Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Today (2014): PetaScale Computing 34PFlops 3Million cores Near Future (~2020): Exascale Computing 1000PFlops 1Billion cores http://s.top500.org/static/lists/2014/06/top500_201406_poster.pdf http://users.ece.gatech.edu/mrichard/exascalecomputingstudyreports/ecss%20report%20101909.pdf

Manages resources compute storage network Job scheduling/management resource allocation job launch Data management data movement caching

Medical PetaScale Image Processing: à Exascale Functional Uses (10^18 and computer Magnetic Flops, involved Resonance 1B cores) Workload Diversity Decade 1970s Weather forecasting, aerodynamic research (Cray-1). Chemistry Traditional Domain: large-scale MolDyn High Performance Computing (HPC) jobs HPC ensemble runs 1980s Molecular Probabilistic Dynamics: analysis, DOCK radiation shielding modeling (CDC Cyber). Many-Task Computing (MTC) applications Production Runs in Drug Design 1990s State-of-the-art Brute force code Resource breaking (EFF Managers DES cracker). Economic Centralized Modeling: paradigm: MARS job scheduling Large-scale HPC 3D nuclear Astronomy resource test managers: simulations Application SLURM, as a substitute PBS, for Moab, legal Torque, conduct Cobalt 2000s Evaluation HTC Nuclear resource Non-Proliferation manager: Condor, Treaty (ASCI Falkon, Q). Mesos, YARN Next-generation Resource Managers Astronomy Domain: Montage 2010s Molecular Dynamics Simulation (Tianhe-1A) Distributed paradigm Data Analytics: Sort and WordCount Scalable, efficient, reliable http://elib.dlr.de/64768/1/ensim_-_ensemble_simulation_on_hpc_computes_en.pdf

Scalability System scale is increasing Workload size is increasing Processing capacity needs to increase Efficiency Allocating resources fast Making fast enough scheduling decisions Maintaining a high system utilization Reliability Still functioning well under failures

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Client Client Client MTC Centralized Scheduling Falkon Scheduler Scheduler MTC Distributed communication Scheduling Executor Executor KVS server KVS server MATRIX Compute Node Fully-Connected Compute Node HPC Centralized Scheduling Slurm Controller and KVS server Client Client Client HPC Distributed Scheduling Fully-Connected Slurm++ Controller and KVS Server Controller and KVS Server cd cd cd cd cd cd cd cd cd

T1 T2 T3 T4 P1 Local Ready Queue P2 Client Client Client Task 1 Wait Queue Task 2 Complete Queue Scheduler Executor KVS server communication Scheduler Executor KVS server Task 1 Task 2 Task 3 Task 6 Work-Stealing Ready Queue Task 1 Task 2 Task 3 Compute Node Fully-Connected Compute Node Task 4 Task 5 Task 3 Task 4 Task 4 Task 5 Task 6 Task 5 Task 6 Scheduler Specification

Client Client Client Scheduler communication Scheduler Executor KVS server Executor KVS server Compute Node Fully-Connected Compute Node

Fine-grained Workloads Load Balancing Data-Aware Scheduling

100% MTC application trace MTC application DAGs 90% 80% 70% Percentage 60% 50% 40% 30% 20% 10% 0% 0.01 0.1 1 10 100 1000 Task Execution Time (s) 17-month period on IBM BG/P machine Minimum length: 0 sec Maximum length: 1470 sec Medium length: 30 sec Average length: 95 sec Direct Acyclic Graphs (DAG) Vertices are discrete events Edges are data-flows Task can have parents Task can have children

Load balancing Steal tasks Steal tasks Distribute workloads as evenly as possible Scheduler Difficult 1 because Scheduler of local 2 view Scheduler 3 Scheduler 4 Dynamic change Task 1 Task 9 Task 10 Task 4 Task 2 Task 7 Task 11 Popular in shared-memory Task Task 5 send tasks machine at thread level Task 3 Task 8 Task 12 Starved processors Task Task 6 steal tasks from overloaded ones Task 4 Apply it at the node level in distributed environment Task 5 Task 4 Various parameters Send tasks Task 6 Task 5 Work stealing: distributed load balancing technique number of tasks to steal Task 7 number of neighbors Task 6 Task 8 static vs dynamic neighbors (dynamic multi-random neighbor selection) polling interval (exponential back-off)

Big-data era data-intensive applications Data-flow driven programming models (MTC) Workflow, task execution framework Work Stealing Original work stealing is data locality-oblivious Migrating tasks randomly comprise data-locality Propose a Data-aware Work Stealing technique

Steal tasks Steal tasks Scheduler 1 Scheduler 2 Scheduler 3 Scheduler 4 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10 Task 4 Task 7 Task 11 Task Task 5 send tasks Task 8 Task 12 Task Task 6 Task 4 Send tasks Task 5 Task 6 Task 1 2 3 4 5 6 7 8 length 1ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms Data size 5GB 20KB 1MB 20GB 30KB 0B 20KB 1MB

typedef TaskMetaData Wait Queue (WaitQ): holds tasks that T1 T2 T3 T4 { are waiting for parents to complete Dedicated Local Ready Queue int num_wait_parent; // number (LReadyQ): of holds waiting ready tasks that can parents only be executed on local node P1 P2 Local vector<string> parent_list; // schedulers that run each parent task Ready Queue Shared Work-Stealing Ready Queue vector<string> data_object; // data object name produced by each parent (SReadyQ): holds ready tasks that can be shared through work stealing Task 1 vector<long> data_size; // data object size (byte) produced by each parent Complete Queue (CompleteQ): holds Complete Wait Queue Task 2 tasks that are completed Queue long all_data_size; // all data object size (byte) produced by all parents Task 1 Task 6 Task 1 P1: a program that checks if a is vector<string> children; // children of this ready to run, and moves ready tasks to Task 2 Task 2 either ready queue according to the Work-Stealing } TMD; decision making algorithm Task 3 Ready Queue Task 3 P2: a program that updates the task metadata for each child of a completed task Task 4 Task 5 Task 3 Task 4 Task 4 Task 5 T1 to T4: executor has 4 (configurable) executing threads that executes tasks in the ready queues and move a task to complete queue when it is done Task 6 Task 5 Task 6

MLB Maximized Load Balancing MDL Maximized Data-Locality RLDS Rigid Load balancing and Data-locality Segregation FLDS Flexible Load balancing and Data-locality Segregation

A discrete event simulator Simulates MTC task execution framework 1M nodes, 1B cores, 100B tasks Explore the scalability fully distributed MTC-architecture work stealing technique data-aware work stealing technique

Efficiency Efficiency 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1 1 2 4 4 nb_1 nb_2 nb_log 8 nb_sqrt 16 64 90% steal_1 80% steal_2 nb_2 70% steal_log nb_log 60% steal_sqrt nb_sqrt 50% steal_half nb_eighth 40% nb_quar No. of Task to Steal 30% nb_half 20% 10% The conclusion drawn about the simulation-based 0% optimal parameters for the adaptive work stealing is Scale (No. of Nodes) Scale (No. of Nodes) to steal half the number of tasks from their 16 32 64 128 256 512 1024 2048 4096 8192 neighbors, and to use the square root number of dynamic random neighbors. The data-aware work stealing technique is scalable at extreme-scales No. of Dynamic Neighbors 256 1024 4096 Scale (No. of Nodes) 16384 65536 262144 1048576 Efficiency 100% 1 2 4 8 16 32 No. of Static neighbors 64 128 256 512 1024 2048 4096 8192 Dataaware Work Stealing

Simulations have shown that the distributed architecture and the work stealing technique are scalable, now we can do real implementations

MATRIX components Client (submit tasks, monitoring execution progress) Scheduler (scheduling tasks for execution) Executor (execute tasks) MATRIX code (https://github.com/kwangiit/matrix_v2) 5K lines of C++ code 8K lines of ZHT code 1K lines of auto-generated code from Google protocol buffer MATRIX goal Scalable distributed scheduling system for MTC applications The prototype has been finished working on running real applications Testbeds: IBM Blue Gene/P/Q supercomputers (4K cores) Probe Kodiak cluster (200 cores) Executor Scheduler KVS server Compute Node Amazon EC2 (256 cores) Client Client Client communication Fully-Connected Executor Scheduler KVS server Compute Node

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

For sub-second tasks (64ms), MATRIX achieves efficiency as high as 85%+ at scales of 1024 nodes (4096 cores), meaning throughput as high as 13K tasks/sec

100% 90% 80% 70% Efficiency 60% 50% 40% 30% 20% 10% sleep 1 (Falkon) sleep 2 (Falkon) sleep 4 (Falkon) sleep 8 (Falkon) sleep 1 (MATRIX) sleep 2 (MATRIX) sleep 4 (MATRIX) sleep 8 (MATRIX) 0% 256 512 1024 2048 Scale (No. of Cores)

f 0 f 1 f 2 f 3 f m f 0,0 f 0,1 f 1,0 f 1,1 t 0 t 1 t 2 t 3 t 4 t n t end Astroportal: Image Stacking t 0 t 1 t 2 t 3 Biometrics: All-pairs

Efficiency 100% 80% 60% 40% 20% 0% MATRIX efficiency Average task length 99% 99% 69% 57% YARN efficiency 94% 91% 91% 55% 40% 30% 2 4 16 64 256 Scale (No. of cores) 50 40 30 20 10 0 Average Task Length (sec) Bio-Informatics Application Protein-ligand Clustering 5 P h a s e s o f MapReduce Jobs 256MB data per node First phase has the majority of tasks

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

System scale is approaching exascale Applications are becoming fine-grained Next-generation resource management systems need to be highly scalable, efficient and available Fully distributed architectures are scalable Data-aware work stealing technique is scalable

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Hadoop Integration of MATRIX Hadoop scheduler is centralized Replace the Hadoop scheduler with MATRIX Workflow integration of MATRIX Swift workflow system Will enable the execution of real scientific applications File System integration of MATRX FusionFS distributed file system Take care of the data storage and management Expose the data to MATRIX

More information: http://datasys.cs.iit.edu/~kewang/ Contact: kwang22@hawk.iit.edu Questions?