A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

Similar documents
Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

Clouds: An Opportunity for Scientific Applications?

Automating Real-time Seismic Analysis

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

Managing large-scale workflows with Pegasus

Part IV. Workflow Mapping and Execution in Pegasus. (Thanks to Ewa Deelman)

Integration of Workflow Partitioning and Resource Provisioning

Complex Workloads on HUBzero Pegasus Workflow Management System

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

Scheduling distributed applications can be challenging in a multi-cloud environment due to the lack of knowledge

On the Use of Cloud Computing for Scientific Workflows

Important DevOps Technologies (3+2+3days) for Deployment

Corral: A Glide-in Based Service for Resource Provisioning

Pegasus. Pegasus Workflow Management System. Mats Rynge

Reduction of Workflow Resource Consumption Using a Density-based Clustering Model

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

{deelman,

Building a Virtualized Desktop Grid. Eric Sedore

Virtualization Overview. Joel Jaeggli AFNOG SS-E 2013

OPENSTACK: THE OPEN CLOUD

USC Viterbi School of Engineering

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

Managing Large-Scale Scientific Workflows in Distributed Environments: Experiences and Challenges

Cloud Computing & Visualization

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

To the Cloud and Back: A Distributed Photo Processing Pipeline

On the Use of Cloud Computing for Scientific Workflows

XSEDE High Throughput Computing Use Cases

Meeting the Challenges of Managing Large-Scale Scientific Workflows in Distributed Environments

MOHA: Many-Task Computing Framework on Hadoop

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Enabling Large-scale Scientific Workflows on Petascale Resources Using MPI Master/Worker

Cloud Computing and Service-Oriented Architectures

Large Scale Computing Infrastructures

A Cleanup Algorithm for Implementing Storage Constraints in Scientific Workflow Executions

Reshaping Text Data for Efficient Processing on Amazon EC2. Gabriela Turcu, Ian Foster, Svetlozar Nestorov

Serverless Architecture Hochskalierbare Anwendungen ohne Server. Sascha Möllering, Solutions Architect

Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud and Amazon Web Services

Monitoring Grid Virtual Machine deployments

Work Queue + Python. A Framework For Scalable Scientific Ensemble Applications

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Model-Driven Geo-Elasticity In Database Clouds

Using AWS to Build a Large Scale Dockerized Microservices Architecture. Dr. Oliver Wahlen moovel Group GmbH Frankfurt, 30.

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

HPC learning using Cloud infrastructure

Distributed Systems COMP 212. Lecture 18 Othon Michail

Cloud interoperability and elasticity with COMPSs

Ian Foster, An Overview of Distributed Systems

The 5th Pan-Galactic BOINC Workshop. Desktop Grid System

Database Level 100. Rohit Rahi November Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Prototyping Data Intensive Apps: TrendingTopics.org

A High-Level Distributed Execution Framework for Scientific Workflows

Adaptive Cluster Computing using JavaSpaces

BUILDING A PRIVATE CLOUD. By Mark Black Jay Muelhoefer Parviz Peiravi Marco Righini

Research Challenges in Cloud Infrastructures to Provision Virtualized Resources

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

Pocket: Elastic Ephemeral Storage for Serverless Analytics

The Fusion Distributed File System

zenterprise The Ideal Platform For Smarter Computing Improving Service Delivery With Private Cloud Computing

Comparative performance of opensource recommender systems

Semantic SOA - Realization of the Adaptive Services Grid

Technical Brief: Specifying a PC for Mascot

How To Guide: Long Term Archive for Rubrik. Using SwiftStack Storage as a Long Term Archive for Rubrik

CLOUD INFRASTRUCTURE ARCHITECTURE DESIGN

Technical Brief. Adding Zadara Storage to VMware Cloud on AWS

Actual4Test. Actual4test - actual test exam dumps-pass for IT exams

IBM Compose Managed Platform for Multiple Open Source Databases

Zombie Apocalypse Workshop

Minimum-cost Cloud Storage Service Across Multiple Cloud Providers

DEPLOYMENT MADE EASY!

Moving to the Cloud: Making It Happen With MarkLogic

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

IBM Bluemix compute capabilities IBM Corporation

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Technical Paper. Performance and Tuning Considerations for SAS on the Hitachi Virtual Storage Platform G1500 All-Flash Array

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course?

Genomics on Cisco Metacloud + SwiftStack

Cloud Computing 4/17/2016. Outline. Cloud Computing. Centralized versus Distributed Computing Some people argue that Cloud Computing. Cloud Computing.

Exploring Architecture Options for a Federated, Cloud-based Systems Biology Knowledgebase. Ian Gorton, Jenny Liu, Jian Yin

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Distributed Systems. 31. The Cloud: Infrastructure as a Service Paul Krzyzanowski. Rutgers University. Fall 2013

Image Processing on the Cloud. Outline

The BioHPC Nucleus Cluster & Future Developments

ArcGIS for Server: Administration and Security. Amr Wahba

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Tania Tudorache Stanford University. - Ontolog forum invited talk04. October 2007

The ATLAS EventIndex: Full chain deployment and first operation

Delegated Access for Hadoop Clusters in the Cloud

LIGO Virtual Data. Realizing. UWM: Bruce Allen, Scott Koranda. Caltech: Kent Blackburn, Phil Ehrens, Albert. Lazzarini, Roy Williams

Big Data Hadoop Course Content

Introduction to Cloud Computing

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

Towards Reproducible escience in the Cloud

An Experimental Cloud Resource Broker System for Virtual Application Control with VM Allocation Scheme

The Leading Parallel Cluster File System

Carelyn Campbell, Ben Blaiszik, Laura Bartolo. November 1, 2016

Transcription:

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis Ashish Nagavaram, Gagan Agrawal, Michael A. Freitas, Kelly H. Telu The Ohio State University Gaurang Mehta, Rajiv. G. Mayani, Ewa Deelman USC Information Sciences Institute http://pegasus.isi.edu deelman@isi.edu

The Problem Execute a compute intensive, dataparallel application within usergiven time constraints The application is sequential Want to outsource computation to the cloud

Outline Application Benefits of cloud computing Approach Parallelize the application Use of workflow technologies Dynamic resource provisioning Evaluation on EC2 Conclusions

Application: Mass Spectrometry Searches proteins and peptides from tandem mass spectrometry data Uses Protein DB Sensitive probabilistic scoring model Theoretical Protein database Digest the sequence Has the sequence been searched before? no Full scan search for finding matching peptides yes Do not add it to the final result MS/MS data input file Noise filtering algorithm Clear insignificant peptides Statistical analysis to generate results results

Basic characteristics of the application Highly data parallel Performance depends on the data set, thus not known a priory When partitioning the data, need to combine results Opportunity/challenge to adapt the execution environment to the specific problem

Workflows and Clouds Benefits User control over environment Pay as you go model On-demand provisioning / Elasticity SLA, reliability, maintenance Drawbacks Complexity (more control = more work) Cost Performance Resource Availability Vendor Lock-In Google s Container-based Data Center in Belgium http://www.datacenterknowl edge.com/

Approach Parallelize the application Use the workflow paradigm to structure the application Use the Pegasus Workflow Management System to manage the workflow execution Use cloud computing for execution Adapt the cloud to the application Use a flexible resource provisioning system to acquire the necessary resources (Wrangler) Evaluate the parallelization (multi-core machine) and adaptation (EC2 cloud)

Parallelize the application Configuration File split1 massmatrix Input File Python Script split2 massmatrix Merge Input Database splitn massmatrix Sequential phase Complex data structures (matrix of matrices) Need to re-index while maintain both local and global index

Pegasus Workflow Management System Developed since 2001 A collaboration between USC and the Condor Team at UW Madison (includes DAGMan) Used by a number of applications in a variety of domains (astronomy, bioinformatics, earthquake science, gravitational-wave physics, etc.) Provides reliability can retry computations from the point of failure Provides scalability can handle large data and many computations (kbytes-tb of data, 1-10 6 tasks) Automatically captures provenance information Can run on resources distributed among institutions, laptop, campus cluster, Grid, Cloud

Pegasus Workflow Management System Provides a portable and re-usable workflow description Enables the construction of complex workflows based on computational blocks Can compose workflows using Java, Perl, Python APIs, systems Triana, Wings Can be incorporated into portals Infers data transfers Infers data registrations Lives in user-space Provides correct, scalable, and reliable execution Enforces dependencies between tasks Progresses as far as possible in the face of failures

Executable Workflow Generated by Pegasus Pegasus: Selects an execution site Selects a data archive Creates a workflow that Creates a sandbox on the execution site Stages data Invokes the computation Stages out data Registers data and Cleans up execution site Captures provenance information Performs other optimizations

Submit host Ewa Deelman, deelman@isi.edu

Outline Application Benefits of cloud computing Approach Parallelize the application Use of workflow technologies Dynamic resource provisioning Evaluation on EC2 Conclusions

A way to make it work Pegasus makes use of available resources, but cannot control them Data Storage data resources Work definition work Grids and Clouds Pegasus WMS Local Resource

A way to make it work Pegasus makes use of available resources, but cannot control them Data Storage data resources Work definition Virtual Resource Pool work Grids and Clouds Resources requests Pegasus WMS Resource Provisioner Local Resource

Building a Virtual Cluster on the Cloud Clouds provide resources, but the software is up to the user Running on multiple nodes may require cluster services (e.g. scheduler) Dynamically configuring such systems is not trivial Workflows need to communicate data often through files, need filesystems (or stage data in/out for each task) Adapt the cluster on demand

Wrangler (Gideon Juve, USC/ISI) A service for provisioning and configuring virtual clusters User specifies the virtual cluster configuration, and Wrangler provisions the nodes and configures them according to the user s requirements Users can specify custom pluggins for nodes by writing simple scripts XML format for describing virtual clusters, support for multiple cloud providers, node dependencies and groups, automatic distribution of configuration files and scripts

Clients-- send requests to the coordinator to launch, query, and terminate, deployments Coordinator-- a web service that manages application deployments. accepts requests from clients provisions nodes from cloud providers collects information about the state of a deployment acts as an information broker Agents--run on VMs Manage VM configuration and monitors health. collect information and reports the state of the node to the collector configure the node with the software and services specified by the user monitor the node for failures. Plugins -- user-defined scripts that implement the behavior of a node invoked by the agent to configure and monitor a node each node can have multiple plugins. Wrangler CloudCom 2011

A cloud Condor/NFS configuration The submit host can be in or out of the cloud

Parallel Execution *-core cluster 2.9 speedup on 4cores 8 core Intel Xeon node with 6GB of RAM Theoretical database used was of 20 MB The code was run for 6 different datasets (~50,000 records)

Release splits to available resources Create parallel workflow (P>>N) Submit to Pegasus Use Wrangler to acquire N resources Take max time ( of N splits) T constraint Determine additional resources needed (M) Approach to adaptation Use Wrangler to acquire M resources

Amazon EC2 Q Submit host Performance Monitor Wrangler Ewa Deelman, deelman@isi.edu

Determining the number of resources needed User specifies the deadline T constraint Capture performance information in the first N executions (T per_split ) T remaining = T constraint - (2 x T per_split ) Calculate the cumulative remaining execution time T execution_predicted = T per_split x (split_count 2N) Estimate needed cores Max Nodes required T execution_ predicted 1 T time_const raint N

Execution on EC2 The algorithm chooses how many cores to add Running a Dataset on EC2

Conclusions Displayed a framework for dynamic execution of scientific workflows User specified time constraint can be used to drive the allocation of resources Used real-time performance information for choosing the number of resources Possible extensions More dynamic approach that monitors the execution over the lifetime of the application Quality of results vs time Including other criteria for resource acquisition (cost)