biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Size: px

Start display at page:

Download "biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data"

Irene Pearson
5 years ago
Views:

1 biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data Ilkay Al/ntas 1, Jianwu Wang 2, Daniel Crawl 1, Shweta Purawat 1 1 San Diego Supercomputer Center, UC San Diego 2 UMBC WorDS.sdsc.edu

opera2ons Provenance and fault tolerance Need expertise to identify which tool

2 A Toolbox with Many Tools Data Search, database access, IO opera2ons, streaming data in real- 2me Compute Data- parallel paoerns, external execu2on, Network opera2ons Provenance and fault tolerance Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!

3 Workflows are Used in These Diverse Scenarios in Biological Sciences Often for data reduction In real-time or offline Data Analysis Data Publica2on Archival From analysis to searchable results Standardization Auto generation of methods and materials Data Acquisi2on Genera2on Sequencers Sensor networks Medical imaging Many forms Data-intensive HPC Local Exploratory Workflows foster collaborations! Flexibility and synergy Optimization of resources Increasing reuse Standards compliance

4 CAMERA Example: Using Scientific Workflows and Related Provenance for Collaborative Metagenomics Research Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA)

5 CAMERA is a Collaborative Environment Data Discovery GIS and Advanced query options Data Cart Multiple Available Mixed collections of CAMERA Data (e.g. projects, samples) User Workspace Single workspace with access to all data and results (private and shared) Data Analysis Workflow based analysis Group Workspace Share specified User Workspace data with collaborators

6 Workflows are a Central Part of CAMERA CAMERA-supported 28 existing workflows Workflows under development Fragment Recruitment Viewer Next Generation Sequencing VIROME Pipeline Standalone bioinformatics tools National Center for Genome Research Joint Genome Institute User built Currently running in a sandbox Will be ported to a virtual cloud environment QC filter BLAST Duplicate filtering Assembly More than 1500 workflow submissions monthly! Taxonomy Binning Metagenomic Annotation and Clustering Comparison, Statistical analysis, and more workflows Inputs: from local or CAMERA file systems; user-supplied parameters Outputs: sharable with a group of users and links to the semantic database All can be reached through the CAMERA portal at:hop://portal.camera.calit2.net

7 CAMERA Portal - Workflows

8 CAMERA Workflows RAMMCAP

9 CAMERA Workflows

10 CAMERA Workflows

11 CAMERA Job Status

12 CAMERA Workflow Results

13 Pushing the boundaries of existing infrastructure and workflow system capabilities

14 New Requirements from the User Community Increase reuse best development practices by the scientific community other bio packages Increase programmability by end users users with various skill levels to formulate actual domain specific workflows Increase resource utilization optimize execution across available computing resources in an efficient, transparent and intuitive manner Make analysis a part of the end-to-end scientific model from data generation to publication

15 RAMMCAP Rapid Clustering and Functional Annotation for Metagenomic Sequences Annota2on features: trna predic2on (trnascan) rrna predic2on (meta_rna, BLAST) ORF call (ORF_finder, Metagene) RPS- BLAST against COG etc HMMER against Pfam / Tigrfam } Clustering of reads } Mul2- step clustering of ORFs } GO assignment } EC number assignment

16 A number of bioinforma2cs tools are used in RAMMCAP Tool BLAST MegaBLAST Diversity QC CD- HIT- 454 RAMMCAP FRV Assembly KEGG RDP binning BLAST binning trna Meta- RNA BLAST- RNA ORF_finder Metagene FragGeneScan Pfam TIGRfam COG KOG PRK CD- HIT- EST CD- HIT H- CD- HIT Descrip/on Scalable parallel database search with blastn, blastp, tblastn, blastx, tblastx Fast database search with MegaBLAST Diversity analysis for viral metagenome Quality control for 454 raw reads Iden2fy ar2ficial duplicates from 454 reads Metagenome annota2on - rrna, trna, ORF predic2on - reads and ORF clustering - reads and ORF informa2on - family and func2on annota2on (Pfam, TIGRfam, COG) - Gene Ontology and Enzyme Classifica2on annota2on - Combined annota2on summary Fragment Recruitment Viewer Consensus- based meta- assembler for 454 reads Pathway annota2on by search KEGG database with blastp Taxonomy binning of rrna sequences using RDP classifier Taxonomy binning by querying ref. rrna DB using blastn Iden2fica2on of trnas from fragments using trna- scan Iden2fica2on of rrnas from fragments using HMM Iden2fica2on of rrnas by querying ref. rrna DB using blastn ORF call by six reading frame transla2on ORF call by Metagene ORF call with FragGeneScan from 454 reads Protein family annota2on against Pfam using HMMER Protein family annota2on against TIGRfam using HMMER Protein family annota2on against NCBI COG using rps- blast Protein family annota2on against NCBI KOG using rps- blast Protein family annota2on against NCBI PRK using rps- blast Clustering of reads Clustering of ORFs Mul2ple level clustering of ORFs into ORF family

17 Original implementa2on of the annota2on workflow in Kepler Data flow is divided. A green box is called a actor, which performs a task. This special actor represents an annota2on component, such as BLAST search. Workflow parameters, which can be specified by users in portal, are passed to workflow components.

18 Each actor was a wrapper to a web service! Customized web services!

19 RAMMCAP metagene QC cd- hit hmmer blast trna Data size CPU 2me Memory Parallel KB MB GB TB Second QC cd- hit Hour metagene Day trna hmmer Month blast Year GB QC trna metagene hmmer blast 10GB cd- hit 100GB No need No Mul2 threading MPI Map Reduce QC cd- hit hmmer blast hmmer trna metagene blast

20 RAMMCAP Rapid Clustering and Functional Annotation for Metagenomic Sequences Data size KB MB GB TB CPU 2me Memory Parallel Minute QC cd- hit Hour metagene Day trna hmmer Month blast Year GB QC trna metagene hmmer blast 10GB cd- hit 100GB No need No Mul2 threading MPI Map Reduce QC cd- hit hmmer blast hmmer trna metagene blast Data size CPU 2me Memory Parallel KB MB GB NGS TB Minute QC Hour Day metagene cd- hit Month trna hmmer Year GB QC trna metagene hmmer 10GB blast cd- hit 100GB No need No Mul2 threading MPI Map Reduce blast QC cd- hit hmmer blast hmmer trna metagene blast

Another cases RNA- seq / genomic / metagenomic Raw reads Reads QC

SOAPdenovo, Abyss Oases Trinity assembly Alignments Con2gs Further

assembly Minute QC Hour Day Month mapping GB QC mapping 10GB 100GB

21 Another cases RNA- seq / genomic / metagenomic Raw reads Reads QC QC BWA Bow9e BLAST mapping mapping HQ reads Assemble Velvet, SOAPdenovo, Abyss Oases Trinity assembly Alignments Con2gs Further analysis Data size CPU 2me Memory Parallel KB MB GB NGS TB assembly Minute QC Hour Day Month mapping GB QC mapping 10GB 100GB No need No Mul2 threading MPI Map Reduce assembly mapping assembly QC mapping

22 biokepler implementa2on: Using bioactors instead of wrapper actors bio bio bio bio bio bio bio bio bio bio bio bio bio bio bio Wrapper Actors Need implementa2on of underlying computa2onal tools bioactors Reusable Mul2ple execu2on modes Build- in parallel execu2on capabili2es

BioLinux Galaxy Clovr Hadoop CLOUD and OTHER COMPUTING RESOURCES e.g.

23 Gateways and other user environments Kepler and Provenance Framework biokepler BioLinux Galaxy Clovr Hadoop CLOUD and OTHER COMPUTING RESOURCES e.g., SGE, Amazon, FutureGrid, XSEDE A coordinated ecosystem of biological and technological May packages 22nd, 2014 Scalable for Bioinforma2cs bioinformatics! Boot Camp

24 The biokepler Approach Parallel Computation Framework Use Distributed Data-Parallel (DDP) frameworks, e.g., MapReduce, and other parallelization methods to execute subworkflows bioactors Configurable and reusable higher-order components for bioinformatics and computational biology Transparent support for different execution engines and computational environments Deployment on diverse environments

25 Reuse, Programmability, Execution Funded by NSF ABI & CI Reuse programs - Altintas (PI) and Li (Co-PI) Development of a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data Big improvement on usability and programmability by end users! biokepler CloudBioLinux Galaxy Bio-Linux Kepler CORE Distributed Data Parallel Provenance Repor2ng Run Manager Kepler supports Workflows Other third party programming tools, e.g., R, Matlab, KNIME Extensible task and data paralleliza2on Service orienta2on Execu2on on mul2ple engines, e.g., SDF, SGE, Hadoop

26 biokepler s Conceptual Framework Kepler Bioinformatics Tools Mapping Assembly Clustering biokepler Data-Parallel Execution Patterns Map-Reduce Master-Slave All-Pairs bioactors BLAST HMMER CD-HIT Workflow Director Scheduler Executable Workflow Plan Provenance Data Lineage Execution History Reporting Report Designer PDF Generation Run Manager Search Tag Bioinformatician Customize & Integrate Execution Engine Fault-Tolerance Alternatives Error Handling Deploy & Execute Data Ensembl CAMERA Genbank Transfer Amazon EC2 FutureGrid Compute Triton Resource Adhoc Network Sun Grid Engine

27 bioactors Set of steps to execute a bioinformatics tool locally or in an external environment Locally executable Parallelized external execution Customizable by the user based on external packages Tools imported from CloudBioLinux Tools are evaluated on their computational requirements

28 Transparent Execution includes Parallelization Solutions in Distributed Environments Tradi9onal parallel programming interfaces Examples: MPI and OpenMP Hard to implement Original sequen2al tools cannot be reused Parallel job execu9on Examples: SGE and Condor Original sequen2al tools can be reused Create small jobs by splikng data or tasks Hard to achieve data locality for each job Data parallel job execu9on Examples: Hadoop and Stratosphere Original sequen2al tools can be reused Support customized and automa2c data par22on and distribu2on Support data locality for each job through special distributed file system, HDFS

29 Distributed Data-Parallel bioactors Set of steps to execute a bioinformatics tool in DDP environment Customized from the ExecutionChoice actor Includes: Data-parallel patterns, e.g., Map, Reduce, Cross, All-Pairs, etc., to specify data grouping I/O to interface with storage Data format specifying how to split and join

30 DDP bioactor Usage Model User: Workflow Developer 1. Search 2b. Choose Generic 2b. Create Sub-Workflow bioactor Library A1 A2 An DDP Generic DDP Blast 2a. Choose Specific 3. Add to Workflow 4c. Save in Library DDP Director Workflow 4b. Add to Larger Workflow 4a. Execute Results

31 Status of bioactors 500+ bioactors are listed under current biokepler release, ~40 of them are parallelized.

32 Example bioactors Alignment: BLAST, BLAT Profile-Sequence Alignment: PSI-BLAST Hidden Markov Model: HMMER Mapping: Bowtie, BWA, Samtools Multiple Alignment: ClustalW, Muscle Clustering: CD-HIT, Blastclust Gene Prediction: Glimmer, Genescan, Fraggenescan trna prediction: trna-scan, Meta-RNA Phylogeny: FastTree, RAxML

33 Example Workflows

34 Current Release Downloadable as a package at: A biokepler VM executable on Amazon EC2, FutureGrid and SDSC Cloud Builds upon CloudBioLinux including Bio- Linux and Galaxy A bioactor template that can be customized for different execution choices e.g., local vs. Map/Reduce on a specific environment Example usecases

35 Demo and Ques2ons WorDS Director: Ilkay Al2ntas, Ph.D.

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal