Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Similar documents
Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Optimizing Load Balancing and Data-Locality with Data-aware Scheduling

Load-balanced and locality-aware scheduling for dataintensive workloads at extreme scales

Towards Next Generation Resource Management at Extreme-Scales

MOHA: Many-Task Computing Framework on Hadoop

The Fusion Distributed File System

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

MATRIX:DJLSYS EXPLORING RESOURCE ALLOCATION TECHNIQUES FOR DISTRIBUTED JOB LAUNCH UNDER HIGH SYSTEM UTILIZATION

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

SimMatrix: SIMulator for MAny-Task computing execution fabric at exascales

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

A Data Diffusion Approach to Large Scale Scientific Exploration

Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers

Overview Past Work Future Work. Motivation Proposal. Work-in-Progress

Overcoming Hadoop Scaling Limitations through Distributed Task Execution

Ian Foster, An Overview of Distributed Systems

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Supercomputers. Alex Reid & James O'Donoghue

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

Distributed NoSQL Storage for Extreme-scale System Services

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Arguably one of the most fundamental discipline that touches all other disciplines and people

Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

PART I - Fundamentals of Parallel Computing

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Practical Considerations for Multi- Level Schedulers. Benjamin

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Requirement and Dependencies

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Data Movement & Tiering with DMF 7

HPC Technology Trends

Applying DDN to Machine Learning

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Predictable Time-Sharing for DryadLINQ Cluster. Sang-Min Park and Marty Humphrey Dept. of Computer Science University of Virginia

NERSC. National Energy Research Scientific Computing Center

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Automating Real-time Seismic Analysis

Directions in Workload Management

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course?

Enosis: Bridging the Semantic Gap between

Advances of parallel computing. Kirill Bogachev May 2016

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

System Software for Big Data and Post Petascale Computing

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Spark, Shark and Spark Streaming Introduction

Pocket: Elastic Ephemeral Storage for Serverless Analytics

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Chapter 5. The MapReduce Programming Model and Implementation

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Case Studies in Storage Access by Loosely Coupled Petascale Applications

Accelerating Large Scale Scientific Exploration through Data Diffusion

HPC over Cloud. July 16 th, SCENT HPC Summer GIST. SCENT (Super Computing CENTer) GIST (Gwangju Institute of Science & Technology)

FUSION PROCESSORS AND HPC

SCALABLE RESOURCE MANAGEMENT IN CLOUD COMPUTING IMAN SADOOGHI DEPARTMENT OF COMPUTER SCIENCE

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

Motivation Goal Idea Proposition for users Study

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

Big Data Analytics at OSC

Big data systems 12/8/17

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

MapReduce. U of Toronto, 2014

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Green Supercomputing

HPC in Cloud. Presenter: Naresh K. Sehgal Contributors: Billy Cox, John M. Acken, Sohum Sohoni

The MOSIX Scalable Cluster Computing for Linux. mosix.org

Embedded Technosolutions

Leveraging Flash in HPC Systems

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

CSE 451: Operating Systems Spring Module 12 Secondary Storage

MDHIM: A Parallel Key/Value Store Framework for HPC

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

On the Role of Burst Buffers in Leadership- Class Storage Systems

High Performance Computing on MapReduce Programming Framework

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

Cray XC Scalability and the Aries Network Tony Ford

Consolidating Complementary VMs with Spatial/Temporalawareness

CSE 451: Operating Systems Spring Module 12 Secondary Storage. Steve Gribble

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Transcription:

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Today (2014): PetaScale Computing 34PFlops 3Million cores Near Future (~2020): Exascale Computing 1000PFlops 1Billion cores http://s.top500.org/static/lists/2014/06/top500_201406_poster.pdf http://users.ece.gatech.edu/mrichard/exascalecomputingstudyreports/ecss%20report%20101909.pdf

Manages resources compute storage network Job scheduling/management resource allocation job launch Data management data movement caching

Medical PetaScale Image Processing: à Exascale Functional Uses (10^18 and computer Magnetic Flops, involved Resonance 1B cores) Workload Diversity Decade 1970s Weather forecasting, aerodynamic research (Cray-1). Chemistry Traditional Domain: large-scale MolDyn High Performance Computing (HPC) jobs HPC ensemble runs 1980s Molecular Probabilistic Dynamics: analysis, DOCK radiation shielding modeling (CDC Cyber). Many-Task Computing (MTC) applications Production Runs in Drug Design 1990s State-of-the-art Brute force code Resource breaking (EFF Managers DES cracker). Economic Centralized Modeling: paradigm: MARS job scheduling Large-scale HPC 3D nuclear Astronomy resource test managers: simulations Application SLURM, as a substitute PBS, for Moab, legal Torque, conduct Cobalt 2000s Evaluation HTC Nuclear resource Non-Proliferation manager: Condor, Treaty (ASCI Falkon, Q). Mesos, YARN Next-generation Resource Managers Astronomy Domain: Montage 2010s Molecular Dynamics Simulation (Tianhe-1A) Distributed paradigm Data Analytics: Sort and WordCount Scalable, efficient, reliable http://elib.dlr.de/64768/1/ensim_-_ensemble_simulation_on_hpc_computes_en.pdf

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Scalability System scale is increasing Workload size is increasing Processing capacity needs to increase Efficiency Allocating resources fast Making fast enough scheduling decisions Maintaining a high system utilization Reliability Still functioning well under failures

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Client Client Client MTC Centralized Scheduling Falkon Scheduler Scheduler MTC Distributed communication Scheduling Executor Executor KVS server KVS server MATRIX Compute Node Fully-Connected Compute Node HPC Centralized Scheduling Slurm Controller and KVS server Client Client Client HPC Distributed Scheduling Fully-Connected Slurm++ Controller and KVS Server Controller and KVS Server cd cd cd cd cd cd cd cd cd

T1 T2 T3 T4 P1 Local Ready Queue P2 Client Client Client Task 1 Wait Queue Task 2 Complete Queue Scheduler Executor KVS server communication Scheduler Executor KVS server Task 1 Task 2 Task 3 Task 6 Work-Stealing Ready Queue Task 1 Task 2 Task 3 Compute Node Fully-Connected Compute Node Task 4 Task 5 Task 3 Task 4 Task 4 Task 5 Task 6 Task 5 Task 6 Scheduler Specification

Client Client Client Scheduler communication Scheduler Executor KVS server Executor KVS server Compute Node Fully-Connected Compute Node

Fine-grained Workloads Load Balancing Data-Aware Scheduling

100% MTC application trace MTC application DAGs 90% 80% 70% Percentage 60% 50% 40% 30% 20% 10% 0% 0.01 0.1 1 10 100 1000 Task Execution Time (s) 17-month period on IBM BG/P machine Minimum length: 0 sec Maximum length: 1470 sec Medium length: 30 sec Average length: 95 sec Direct Acyclic Graphs (DAG) Vertices are discrete events Edges are data-flows Task can have parents Task can have children

Load balancing Steal tasks Steal tasks Distribute workloads as evenly as possible Scheduler Difficult 1 because Scheduler of local 2 view Scheduler 3 Scheduler 4 Dynamic change Task 1 Task 9 Task 10 Task 4 Task 2 Task 7 Task 11 Popular in shared-memory Task Task 5 send tasks machine at thread level Task 3 Task 8 Task 12 Starved processors Task Task 6 steal tasks from overloaded ones Task 4 Apply it at the node level in distributed environment Task 5 Task 4 Various parameters Send tasks Task 6 Task 5 Work stealing: distributed load balancing technique number of tasks to steal Task 7 number of neighbors Task 6 Task 8 static vs dynamic neighbors (dynamic multi-random neighbor selection) polling interval (exponential back-off)

Big-data era data-intensive applications Data-flow driven programming models (MTC) Workflow, task execution framework Work Stealing Original work stealing is data locality-oblivious Migrating tasks randomly comprise data-locality Propose a Data-aware Work Stealing technique

Steal tasks Steal tasks Scheduler 1 Scheduler 2 Scheduler 3 Scheduler 4 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10 Task 4 Task 7 Task 11 Task Task 5 send tasks Task 8 Task 12 Task Task 6 Task 4 Send tasks Task 5 Task 6 Task 1 2 3 4 5 6 7 8 length 1ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms Data size 5GB 20KB 1MB 20GB 30KB 0B 20KB 1MB

typedef TaskMetaData Wait Queue (WaitQ): holds tasks that T1 T2 T3 T4 { are waiting for parents to complete Dedicated Local Ready Queue int num_wait_parent; // number (LReadyQ): of holds waiting ready tasks that can parents only be executed on local node P1 P2 Local vector<string> parent_list; // schedulers that run each parent task Ready Queue Shared Work-Stealing Ready Queue vector<string> data_object; // data object name produced by each parent (SReadyQ): holds ready tasks that can be shared through work stealing Task 1 vector<long> data_size; // data object size (byte) produced by each parent Complete Queue (CompleteQ): holds Complete Wait Queue Task 2 tasks that are completed Queue long all_data_size; // all data object size (byte) produced by all parents Task 1 Task 6 Task 1 P1: a program that checks if a is vector<string> children; // children of this ready to run, and moves ready tasks to Task 2 Task 2 either ready queue according to the Work-Stealing } TMD; decision making algorithm Task 3 Ready Queue Task 3 P2: a program that updates the task metadata for each child of a completed task Task 4 Task 5 Task 3 Task 4 Task 4 Task 5 T1 to T4: executor has 4 (configurable) executing threads that executes tasks in the ready queues and move a task to complete queue when it is done Task 6 Task 5 Task 6

MLB Maximized Load Balancing MDL Maximized Data-Locality RLDS Rigid Load balancing and Data-locality Segregation FLDS Flexible Load balancing and Data-locality Segregation

A discrete event simulator Simulates MTC task execution framework 1M nodes, 1B cores, 100B tasks Explore the scalability fully distributed MTC-architecture work stealing technique data-aware work stealing technique

Efficiency Efficiency 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1 1 2 4 4 nb_1 nb_2 nb_log 8 nb_sqrt 16 64 90% steal_1 80% steal_2 nb_2 70% steal_log nb_log 60% steal_sqrt nb_sqrt 50% steal_half nb_eighth 40% nb_quar No. of Task to Steal 30% nb_half 20% 10% The conclusion drawn about the simulation-based 0% optimal parameters for the adaptive work stealing is Scale (No. of Nodes) Scale (No. of Nodes) to steal half the number of tasks from their 16 32 64 128 256 512 1024 2048 4096 8192 neighbors, and to use the square root number of dynamic random neighbors. The data-aware work stealing technique is scalable at extreme-scales No. of Dynamic Neighbors 256 1024 4096 Scale (No. of Nodes) 16384 65536 262144 1048576 Efficiency 100% 1 2 4 8 16 32 No. of Static neighbors 64 128 256 512 1024 2048 4096 8192 Dataaware Work Stealing

Simulations have shown that the distributed architecture and the work stealing technique are scalable, now we can do real implementations

MATRIX components Client (submit tasks, monitoring execution progress) Scheduler (scheduling tasks for execution) Executor (execute tasks) MATRIX code (https://github.com/kwangiit/matrix_v2) 5K lines of C++ code 8K lines of ZHT code 1K lines of auto-generated code from Google protocol buffer MATRIX goal Scalable distributed scheduling system for MTC applications The prototype has been finished working on running real applications Testbeds: IBM Blue Gene/P/Q supercomputers (4K cores) Probe Kodiak cluster (200 cores) Executor Scheduler KVS server Compute Node Amazon EC2 (256 cores) Client Client Client communication Fully-Connected Executor Scheduler KVS server Compute Node

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

For sub-second tasks (64ms), MATRIX achieves efficiency as high as 85%+ at scales of 1024 nodes (4096 cores), meaning throughput as high as 13K tasks/sec

100% 90% 80% 70% Efficiency 60% 50% 40% 30% 20% 10% sleep 1 (Falkon) sleep 2 (Falkon) sleep 4 (Falkon) sleep 8 (Falkon) sleep 1 (MATRIX) sleep 2 (MATRIX) sleep 4 (MATRIX) sleep 8 (MATRIX) 0% 256 512 1024 2048 Scale (No. of Cores)

f 0 f 1 f 2 f 3 f m f 0,0 f 0,1 f 1,0 f 1,1 t 0 t 1 t 2 t 3 t 4 t n t end Astroportal: Image Stacking t 0 t 1 t 2 t 3 Biometrics: All-pairs

Efficiency 100% 80% 60% 40% 20% 0% MATRIX efficiency Average task length 99% 99% 69% 57% YARN efficiency 94% 91% 91% 55% 40% 30% 2 4 16 64 256 Scale (No. of cores) 50 40 30 20 10 0 Average Task Length (sec) Bio-Informatics Application Protein-ligand Clustering 5 P h a s e s o f MapReduce Jobs 256MB data per node First phase has the majority of tasks

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

System scale is approaching exascale Applications are becoming fine-grained Next-generation resource management systems need to be highly scalable, efficient and available Fully distributed architectures are scalable Data-aware work stealing technique is scalable

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Hadoop Integration of MATRIX Hadoop scheduler is centralized Replace the Hadoop scheduler with MATRIX Workflow integration of MATRIX Swift workflow system Will enable the execution of real scientific applications File System integration of MATRX FusionFS distributed file system Take care of the data storage and management Expose the data to MATRIX

More information: http://datasys.cs.iit.edu/~kewang/ Contact: kwang22@hawk.iit.edu Questions?