Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Similar documents
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Abstract 1. Introduction

Introduction to. Amazon Web Services. Thilina Gunarathne Salsa Group, Indiana University. With contributions from Saliya Ekanayake.

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Abstract 1. Introduction

Clouds and MapReduce for Scientific Applications

SCALABLE PARALLEL COMPUTING ON CLOUDS:

Applying Twister to Scientific Applications

Scalable Parallel Scientific Computing Using Twister4Azure

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

IIT, Chicago, November 4, 2011

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Y790 Report for 2009 Fall and 2010 Spring Semesters

Genetic Algorithms with Mapreduce Runtimes

MapReduce for Data Intensive Scientific Analyses

Browsing Large Scale Cheminformatics Data with Dimension Reduction

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

DryadLINQ for Scientific Analyses

Hybrid cloud and cluster computing paradigms for life science applications

Seung-Hee Bae. Assistant Professor, (Aug current) Computer Science Department, Western Michigan University, Kalamazoo, MI, U.S.A.

ARCHITECTURE AND PERFORMANCE OF RUNTIME ENVIRONMENTS FOR DATA INTENSIVE SCALABLE COMPUTING. Jaliya Ekanayake

Cloud Technologies for Bioinformatics Applications

DRYADLINQ CTP EVALUATION

Applying Twister to Scientific Applications

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Sequence Clustering Tools

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and

Chapter 5. The MapReduce Programming Model and Implementation

Dynamic Cluster Configuration Algorithm in MapReduce Cloud

MOHA: Many-Task Computing Framework on Hadoop

CSE6331: Cloud Computing

Shark: Hive on Spark

DOWNLOAD OR READ : CLOUD GRID AND HIGH PERFORMANCE COMPUTING EMERGING APPLICATIONS PDF EBOOK EPUB MOBI

CS15-319: Cloud Computing. Lecture 3 Course Project and Amazon AWS Majd Sakr and Mohammad Hammoud

Applicability of DryadLINQ to Scientific Applications

AWS Solution Architecture Patterns

High Performance Computing on MapReduce Programming Framework

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

The State of High Performance Computing in the Cloud Sanjay P. Ahuja, 2 Sindhu Mani

Towards a next generation of scientific computing in the Cloud

Performing Large Science Experiments on Azure: Pitfalls and Solutions

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Towards Reproducible escience in the Cloud

Hadoop/MapReduce Computing Paradigm

SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING. Yang Ruan. Advised by Geoffrey Fox

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Amazon Web Services Cloud Computing in Action. Jeff Barr

Forget about the Clouds, Shoot for the MOON

Collective Communication Patterns for Iterative MapReduce

Similarities and Differences Between Parallel Systems and Distributed Systems

Towards a Collective Layer in the Big Data Stack

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

SCALABLE HIGH PERFORMANCE MULTIDIMENSIONAL SCALING

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Analytics in the cloud

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Managing Deep Learning Workflows

what is cloud computing?

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

Advanced Database Technologies NoSQL: Not only SQL

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

Big Data and Cloud Computing

Big Data 7. Resource Management

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Design Patterns for Scientific Applications in DryadLINQ CTP

The Fusion Distributed File System

Processing Technology of Massive Human Health Data Based on Hadoop

Big Data and Object Storage

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

A BigData Tour HDFS, Ceph and MapReduce

Embedded Technosolutions

Challenges for Data Driven Systems

Big Data for Engineers Spring Resource Management

Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

CIT 668: System Architecture. Amazon Web Services

Large Scale Sky Computing Applications with Nimbus

DriveScale-DellEMC Reference Architecture

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Distributed Computing.

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

How to scale Windows Azure Application

More AWS, Serverless Computing and Cloud Research

CLOUD-SCALE FILE SYSTEMS

Ian Foster, CS554: Data-Intensive Computing

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Next-Generation Cloud Platform

Transcription:

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics, Pervasive Technology Institute Indiana University

Introduction Forth Paradigm Data intensive scientific discovery DNA Sequencing machines, LHC Loosely coupled problems BLAST, Monte Carlo simulations, many image processing applications, parametric studies Cloud platforms Amazon Web Services, Azure Platform MapReduce Frameworks Apache Hadoop, Microsoft DryadLINQ

Cloud Computing On demand computational services over web Spiky compute needs of the scientists Horizontal scaling with no additional cost Increased throughput Cloud infrastructure services Storage, messaging, tabular storage Cloud oriented services guarantees Virtually unlimited scalability

Amazon Web Services Elastic Compute Service (EC2) Infrastructure as a service Cloud Storage (S3) Queue service (SQS) Instance Type Memory EC2 compute units Actual CPU cores Cost per hour Large 7.5 GB 4 2 X (~2Ghz) 0.34$ Extra Large 15 GB 8 4 X (~2Ghz) 0.68$ High CPU Extra Large 7 GB 20 8 X (~2.5Ghz) 0.68$ High Memory 4XL 68.4 GB 26 8X (~3.25Ghz) 2.40$

Microsoft Azure Platform Windows Azure Compute Platform as a service Azure Storage Queues Azure Blob Storage Instance Type CPU Cores Memory Local Disk Space Cost per hour Small 1 1.7 GB 250 GB 0.12$ Medium 2 3.5 GB 500 GB 0.24$ Large 4 7 GB 1000 GB 0.48$ ExtraLarge 8 15 GB 2000 GB 0.96$

Classic cloud architecture

MapReduce General purpose massive data analysis in brittle environments Commodity clusters Clouds Fault Tolerance Ease of use Apache Hadoop HDFS Microsoft DryadLINQ

MapReduce Architecture HDFS Input Data Set Data File Map() exe Map() exe Executable Optional Reduce Phase Reduce HDFS Results

Programming patterns Fault Tolerance AWS/ Azure Hadoop DryadLINQ Independent job execution Task re-execution based on a time out MapReduce Re-execution of failed and slow tasks. Data Storage S3/Azure Storage. HDFS parallel file system. Environments EC2/Azure, local compute resources Linux cluster, Amazon Elastic MapReduce DAG execution, MapReduce + Other patterns Re-execution of failed and slow tasks. Local files Windows HPCS cluster Ease of EC2 : ** Programming Azure: *** Ease of use EC2 : *** Azure: ** Scheduling & Dynamic scheduling Load Balancing through a global queue, Good natural load balancing **** **** *** **** Data locality, rack aware dynamic task scheduling through a global queue, Good natural load balancing Data locality, network topology aware scheduling. Static task partitions at the node level, suboptimal load balancing

Performance Parallel Efficiency Per core per computation time

Cap3 Sequence Assembly Assembles DNA sequences by aligning and merging sequence fragments to construct whole genome sequences Increased availability of DNA Sequencers. Size of a single input file in the range of hundreds of KBs to several MBs. Outputs can be collected independently, no need of a complex reduce step.

Compute Time (s) Cost ($) Sequence Assembly Performance with different EC2 Instance Types 2000 Amortized Compute Cost Compute Cost (per hour units) Compute Time 6.00 5.00 1500 4.00 1000 500 3.00 2.00 1.00 0 0.00

Sequence Assembly in the Clouds Cap3 parallel efficiency Cap3 Per core per file (458 reads in each file) time to process sequences

Cost to assemble to process 4096 FASTA files * Amazon AWS total :11.19 $ Compute 1 hour X 16 HCXL (0.68$ * 16) = 10.88 $ 10000 SQS messages = 0.01 $ Storage per 1GB per month = 0.15 $ Data transfer out per 1 GB = 0.15 $ Azure total : 15.77 $ Compute 1 hour X 128 small (0.12 $ * 128) = 15.36 $ 10000 Queue messages = 0.01 $ Storage per 1GB per month = 0.15 $ Data transfer in/out per 1 GB = 0.10 $ + 0.15 $ Tempest (amortized) : 9.43 $ 24 core X 32 nodes, 48 GB per node Assumptions : 70% utilization, write off over 3 years, including support * ~ 1 GB / 1875968 reads (458 reads X 4096)

GTM & MDS Interpolation Finds an optimal user-defined low-dimensional representation out of the data in high-dimensional space Used for visualization Multidimensional Scaling (MDS) With respect to pairwise proximity information Generative Topographic Mapping (GTM) Gaussian probability density model in vector space Interpolation Out-of-sample extensions designed to process much larger data points with minor trade-off of approximation.

Compute Time (s) Cost ($) GTM Interpolation performance with different EC2 Instance Types 600 500 Amortized Compute Cost Compute Cost (per hour units) Compute Time 5 4.5 4 400 300 200 100 0 3.5 3 2.5 2 1.5 1 0.5 0 EC2 HM4XL best performance. EC2 HCXL most economical. EC2 Large most efficient

Dimension Reduction in the Clouds - GTM interpolation GTM Interpolation parallel efficiency GTM Interpolation Time per core to process 100k data points per core 26.4 million pubchem data DryadLINQ using a 16 core machine with 16 GB, Hadoop 8 core with 48 GB, Azure small instances with 1 core with 1.7 GB.

Dimension Reduction in the Clouds - MDS Interpolation DryadLINQ on 32 nodes X 24 Cores cluster with 48 GB per node. Azure using small instances

Next Steps AzureMapReduce AzureTwister

Alignment Time (ms) AzureMapReduce SWG SWG Pairwise Distance 10k Sequences 7 6 Time Per Alignment Per Instance 5 4 3 2 1 0 0 32 64 96 128 160 Number of Azure Small Instances

Conclusions Clouds offer attractive computing paradigms for loosely coupled scientific computation applications. Infrastructure based models as well as the Map Reduce based frameworks offered good parallel efficiencies given sufficiently coarser grain task decompositions The higher level MapReduce paradigm offered a simpler programming model Selecting an instance type which suits your application can give significant time and monetary advantages.

Acknowlegedments SALSA Group (http://salsahpc.indiana.edu/) Jong Choi Seung-Hee Bae Jaliya Ekanayake & others Chemical informatics partners David Wild Bin Chen Amazon Web Services for AWS compute credits Microsoft Research for technical support on Azure & DryadLINQ

Questions? Thank You!!