A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Similar documents
biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Big Data Applications using Workflows for Data Parallel Computing

Integrated Machine Learning in the Kepler Scientific Workflow System

Op#mizing MapReduce for Highly- Distributed Environments

MapReduce, Apache Hadoop

BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite

MapReduce, Apache Hadoop

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs

Big Data, Big Compute, Big Interac3on Machines for Future Biology. Rick Stevens. Argonne Na3onal Laboratory The University of Chicago

MapReduce. Cloud Computing COMP / ECPE 293A

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

CSE 444: Database Internals. Lecture 23 Spark

Abstract Storage Moving file format specific abstrac7ons into petabyte scale storage systems. Joe Buck, Noah Watkins, Carlos Maltzahn & ScoD Brandt

Cloud Computing WSU Dr. Bahman Javadi. School of Computing, Engineering and Mathematics

A High-Level Distributed Execution Framework for Scientific Workflows

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Spark Overview. Professor Sasu Tarkoma.

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Kepler and Grid Systems -- Early Efforts --

Resilient Distributed Datasets

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat

Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Processing of big data with Apache Spark

Analysis in the Big Data Era

a Spark in the cloud iterative and interactive cluster computing

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Automating Real-time Seismic Analysis

Network Coding: Theory and Applica7ons

DATA SCIENCE USING SPARK: AN INTRODUCTION

The Stratosphere Platform for Big Data Analytics

Outline. CS-562 Introduction to data analysis using Apache Spark

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Fast, Interactive, Language-Integrated Cluster Computing

HPCSoC Modeling and Simulation Implications

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hypergraph Sparsifica/on and Its Applica/on to Par//oning

HBase... And Lewis Carroll! Twi:er,

Big Data Architect.

MOHA: Many-Task Computing Framework on Hadoop

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Big Data Infrastructures & Technologies

Databases 2 (VU) ( / )

Distributed Computing with Spark and MapReduce

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Flash Storage Complementing a Data Lake for Real-Time Insight

TerraSwarm. A Machine Learning and Op0miza0on Toolkit for the Swarm. Ilge Akkaya, Shuhei Emoto, Edward A. Lee. University of California, Berkeley

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

Re- op&mizing Data Parallel Compu&ng

Energy- Aware Time Change Detec4on Using Synthe4c Aperture Radar On High- Performance Heterogeneous Architectures: A DDDAS Approach

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Stay Informed During and AEer OpenWorld

Apache Spark Internals

Lecture 11 Hadoop & Spark

Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich

A Platform for Fine-Grained Resource Sharing in the Data Center

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Data Intensive Scalable Computing

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

M 2 R: Enabling Stronger Privacy in MapReduce Computa;on

Scien&fic Experiments as Workflows and Scripts. Vanessa Braganholo

Starchart*: GPU Program Power/Performance Op7miza7on Using Regression Trees

How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD

CS 378 Big Data Programming

Logisland Event mining at scale. Thomas [ ]

Apache Hive. CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook

Map- reduce programming paradigm

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Stream Processing on IoT Devices using Calvin Framework

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Hands-on tutorial on usage the Kepler Scientific Workflow System

HPC-Reuse: efficient process creation for running MPI and Hadoop MapReduce on supercomputers

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

2/4/2019 Week 3- A Sangmi Lee Pallickara

Bioinforma)cs Resources - NoSQL -

CS 378 Big Data Programming

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Lustre Beyond HPC. Presented to the Lustre* User Group Beijing October 2013

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Storwize in IT Environments Market Overview

Friday, April 26, 13

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

6.888 Lecture 8: Networking for Data Analy9cs

Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

A scalability comparison study of data management approaches for smart metering systems

Scalable Tools - Part I Introduction to Scalable Tools

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Transcription:

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu

Computa3onal and Data Science Workflows - Programmable and Reproducible Scalability - Real- Time Hazards Management wifire.ucsd.edu Data- Parallel Bioinforma3cs biokepler.org Access and query data Scale computa3onal analysis Increase reuse Save 3me, energy and money Formalize and standardize kepler- project.org WorDS.sdsc.edu Scalable Automated Molecular Dynamics and Drug Discovery nbcr.ucsd.edu

Kepler is a Scientific Workflow System www.kepler-project.org A cross-project collaboration initiated August 2003 2.5 about to be released Builds upon the open-source Ptolemy II framework Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for workflow management KEPLER = Ptolemy II + X for Scientific Workflows

Kepler was applied to problems in different scien3fic disciplines: some here and many more Astrophysisc, e.g., DIAPL Noanotechnology, e.g., ANELLI Fusion, e.g., ITER Metagenomics, e.g., CAMERA Mul(- scale biology, e.g., NBCR

So, how can we use Kepler workflows in the context of big data applications? while coupling all scales computing computing within a reusable solution

APPLICATION- SPECIFIC KNOWLEDGE and QUESTIONS? BIG DATA ON- DEMAND COMPUTE Allows for data- enabled decision making at scale, using sta3s3cs, data mining, graph analy3cs, but also computa3onal science methods. Requires support for experimental work by a mul3disciplinary group of experts and dynamic scalability on many plazorms!

Big Data Engineering Computa3onal Big Data Science Find data Access data Acquire data Move data Clean data Integrate data Subset data Pre- process data Analyze data Process data Interpret results Summarize results Visualize results Post- process results ACCESS MANAGE ANALYZE REPORTxdhd Many ways to look at the process not every step is automatable! Build Explore Scale Report

Build Once, Run Many Times Big data workflows should support experimental work and dynamic scalability on many plazorms Scalability based on: data volume and velocity dynamic modeling needs highly- op3mized HPC codes changes in network, storage and compu3ng availability

Scalability in Workflow Task2 Task1 Task4 Task3 Finished Running Waiting (a) Task- level scalability Input Data Set 1 2 3 1 2 3 Task1 Task2 Task3 Running Waiting Waiting (b) Data- level scalability Input Data Set 1 2 3... 3 2 1 Task1 Task2 Task3 Running Running Running (c) Pipeline- level scalability 9

Big Batch Data Applica3on Examples RAMMCAP: Rapid Analysis of Mul3ple Metagenomes with Clustering and Annota3on Local Learner Master Learner Big Data Quality Evaluation & Data partitioning Data Quality Evaluation Local Ensemble Learning Master Ensemble Learning Final BN Structure BNL: Big Data Bayesian Network Ensemble Learning 10

Characteris3cs of the Big Data Applica3ons Name Data Size Tasks Computa(onal Requirements RAMMCAP 7~14 million reads and much more with next genera3on sequencing (NGS) techniques 10~100 GB 12 bioinforma3cs tools Some are computa3on intensive; some are memory intensive. All run on one node BNLA 5 ~ 100 million records 10~1000 GB 3 programs wrimen in R Run on one node Goal: How to easily build efficient Big Data applica3ons? 11

Distributed- Data Parallel Compu3ng A parallel and scalable programming model for Big Data Input data is automatically partitioned onto multiple nodes Programs are distributed and executed in parallel on the partitioned data blocks MapReduce Move program to data! Images from: hmp://www.stratosphere.eu/projects/ Stratosphere/wiki/PactPM

Distributed Data- Parallel (DDP) Pamerns Pamerns for data distribu(on and parallel data processing A higher- level programming model Moving computa3on to data Good scalability and performance accelera3on Run- 3me features such as fault- tolerance Easier parallel programming than MPI and OpenMP Images from: hmp://www.stratosphere.eu/projects/stratosphere/wiki/pactpm

Some Available Big Data Pamerns (a.k.a., Distributed Data- Parallel Pamerns) The execu3ons of the pamerns can scale in distributed environments The pamerns can be applied to different user- defined func3ons Images from: hmp://www.stratosphere.eu 14

Scalability across plazorms People Purpose Process Platforms

Hadoop Open source implementa3on of MapReduce A distributed file system across compute nodes (HDFS) Automa'c data par''on Automa'c data replica'on Master and workers/slaves architecture Automa3c task re- execu3on for failed tasks Spark Fast Big Data Engine Keeps data in memory as much as possible Resilient Distributed Datasets (RDDs) Evaluated lazily Keeps track of lineage for fault tolerance More operators than just Map and Reduce Can run on YARN (Hadoop v2)

The scalable Process should be Programmable! People? Purpose Process Platforms Programmability Reusable and reproducible programming interfaces to systems middleware, analy3cal tools, and end user&repor3ng environments.

Challenge Analysis How to easily apply the Big Data Pamerns in a workflow? How to parallelize legacy tools for Big Data? How to pick pamern(s) each specific task/tool? How to run the same process on top of different Big Data engines, such as Hadoop, Spark and Stratosphere (Apache name: Flink)? 18

Support Big Data Pamerns in Kepler We define a separate DDP (Distributed Data- Parallel) task/actor for each pamern These DDP actors par33on input data and process each par33on separately User- defined func3ons are described as sub- workflows of DDP actors DDP director: executes DDP workflows on top of Big Data engines 19

RAMMCAP in Kepler Workflow System Visual programming Parallel execu3on of the DDP sub- workflows Exis3ng actors can easily reused for new tools (a) Top-level Workflow (b) Sub-workflow for trnascan-se (c) Sub-workflow for Map 20

Kepler Workflow for Bayesian Network Learning Applica3on (a) Top-level Workflow Big Data actors and non Big Data actors work seamlessly within a workflow through hierarchical modeling (b) Network learner sub-workflow (d) Sub-workflow for Reduce (c) Sub-workflow for Map 21

Legacy Tool Paralleliza3on for Big Data Black- box approach (we use this approach) Run a tool directly Wrap the tool with Big Data techniques Can quickly convert a tool into a parallelized one White- box approach Inves3gate the source code of a legacy tool and try to re- implement it using Big Data techniques Time- consuming Oten 3ghtly- coupled with specific Big Data engine Could find more parallel opportuni3es 22

Performance Factors of Legacy Tool Paralleliza3on Data par33oning By default, data is par33oned to 64 MB blocks in Hadoop Load balancing is more important because legacy tool execu3on can take a long 3me even with small input Data size for each legacy tool execu3on New Big Data analy3cs applica3ons oten try to process minimal input data for each execu3on Because of legacy tool execu3on overhead, performance is bemer if each execu3on processes rela3vely large input 23

Experiments on Performance Factors Total Execution Time (minutes) 8 10 12 14 16 80 cores 64 cores 48 cores Total Execution Time (minutes) 3 4 5 6 7 8 9 10 These findings can be used to analyze the best configura3on for a Big Data applica3on Workflow Performance with Different Sequence Numbers Per Execution 20 30 40 50 60 70 80 Data Partition Number 4.0 4.5 5.0 5.5 Sequence Number Per Execution (Log)

Big Data Pamern Selec3on A Big Data task could oten be executed using different Big Data pamerns CloudBurst is a parallel mapping tool in bioinforma3cs implemented using MapReduce We re- implement CloudBurst using MapCoGroup and MapMatch and find they are easier to build Our analysis and experiments show no pamern is always the best in terms of performance Performance depends on input data characteris3cs The balance of the two input datasets The sparseness of the values for each key 25

Kepler Workflow for CloudBurst Applica3on (a) Top-level Workflow (c) Using MapMatch for CloudBurst (b) Using MapReduce for CloudBurst (d) Using MapCoGroup for CloudBurst 26

Execu3on Choices for CloudBurst Applica3on 27

Big Data Process Execu3on Adaptability: Our DDP director can run the same process/workflow on different Big Data engines The director transforms workflow into jobs based on each Big Data engine s specifica3on For Hadoop, CoGroup, Match and Cross pamerns are not supported directly. The director converts them into MapReduce jobs Consecu3ve Map pamerns are automa3cally merged before execu3on to improve performance 28

Engine Configura3on of DDP Director Available Engines 29

Overhead Experiments Total Execution Time (minutes) 0.0 0.2 0.4 0.6 0.8 1.0 Execution Time Kepler OverHead Total Execution Time (minutes) 0 2 4 6 8 10 12 14 Execution Time Kepler OverHead MapReduce MapMatch MapCoGroup MapReduce MapMatch MapCoGroup Kepler overhead for CloudBurst application with small input data Kepler overhead for CloudBurst application with large input data 30

Conclusions Usability/programmability Easy Big Data applica3on construc3on through visual programing The same Big Data applica3on can execute on different Big Data engines by the DDP director Execu3on op3miza3on The findings on legacy tool paralleliza3on and Big Data pamern selec3on can help op3mal execu3on The addi3onal layer for workflow system brings minor overhead

More Research Challenges End- to- end performance predic3on for Big Data applica3ons/workflows (how long to run) Knowledge based: Analyze performance using profiling techniques and dependency analysis Data driven: Predict performance based on execu3on history (provenance) using machine learning techniques On demand resource provisioning and scheduling for Big Data applica3ons (where and how to run) Find the best resource alloca3on based on execu3on objec3ves and performance predic3ons Find the best workflow and task configura3on on the allocated resources 32

Ques3ons? WorDS Director: Ilkay Al3ntas, Ph.D. Email: al3ntas@sdsc.edu