Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

Similar documents
The Power of Many: Scalable Execution of Heterogeneous Workloads

arxiv: v4 [cs.dc] 30 Jul 2018

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R.

SLURM Operation on Cray XT and XE

The Eclipse Parallel Tools Platform

Reducing Cluster Compatibility Mode (CCM) Complexity

XSEDE High Throughput Computing Use Cases

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Introduction to Slurm

CSinParallel Workshop. OnRamp: An Interactive Learning Portal for Parallel Computing Environments

Compiling applications for the Cray XC

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

A Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

MOHA: Many-Task Computing Framework on Hadoop

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Running Jobs on Blue Waters. Greg Bauer

Slurm Roadmap. Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull)

Running applications on the Cray XC30

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

MATRIX:DJLSYS EXPLORING RESOURCE ALLOCATION TECHNIQUES FOR DISTRIBUTED JOB LAUNCH UNDER HIGH SYSTEM UTILIZATION

Shifter and Singularity on Blue Waters

MDHIM: A Parallel Key/Value Store Framework for HPC

Native SLURM on Cray XC30. SLURM Birds of a Feather SC 13

Amazon EC2 Container Service: Manage Docker-Enabled Apps in EC2

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Shifter on Blue Waters

Workload Managers. A Flexible Approach

Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart.

Enabling Contiguous Node Scheduling on the Cray XT3

Using jupyter notebooks on Blue Waters. Roland Haas (NCSA / University of Illinois)

High-Performance Distributed RMA Locks

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Batch Systems. Running your jobs on an HPC machine

1 Bull, 2011 Bull Extreme Computing

Resource allocation and utilization in the Blue Gene/L supercomputer

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Reproducible computational pipelines with Docker and Nextflow

Slurm Workload Manager Overview SC15

Your Microservice Layout

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Research Challenges in Cloud Infrastructures to Provision Virtualized Resources

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

Grid Scheduling Architectures with Globus

Leveraging Flash in HPC Systems

Enabling Large-scale Scientific Workflows on Petascale Resources Using MPI Master/Worker

Workflow as a Service: An Approach to Workflow Farming

How to run a job on a Cluster?

A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez

Opportunities for container environments on Cray XC30 with GPU devices

Slurm BOF SC13 Bull s Slurm roadmap

IRIX Resource Management Plans & Status

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

Production and Operational Use of the Cray X1 System

High Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain

High-performance aspects in virtualized infrastructures

Workload management at KEK/CRC -- status and plan

Distributed NoSQL Storage for Extreme-scale System Services

Slurm Configuration Impact on Benchmarking

High-Performance Distributed RMA Locks

HTC Brief Instructions

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Bright Cluster Manager

Improving the Scalability of Comparative Debugging with MRNet

Fault-Tolerant Parallel Analysis of Millisecond-scale Molecular Dynamics Trajectories. Tiankai Tu D. E. Shaw Research

Accelerating HPL on Heterogeneous GPU Clusters

Getting Started With Amazon EC2 Container Service

Heterogeneous Job Support

Update on Topology Aware Scheduling (aka TAS)

Job Scheduler Simulator Extension for Evaluating Queue Mapping to Computing Node

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory CONTAINERS IN HPC WITH SINGULARITY

Lecture 11 Hadoop & Spark

Comet Virtualization Code & Design Sprint

Parallel Debugging. ª Objective. ª Contents. ª Learn the basics of debugging parallel programs

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Adaptive Cluster Computing using JavaSpaces

Module 4: Working with MPI

WebSphere Application Server, Version 5. What s New?

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Important DevOps Technologies (3+2+3days) for Deployment

Next Generation Platforms for Data Intensive Applications. Ian Gray, Neil Audsley

Using GPUaaS in Cloud Foundry

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

pbsacct: A Workload Analysis System for PBS-Based HPC Systems

MPI History. MPI versions MPI-2 MPICH2

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Armon HASHICORP

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

Objectives. Architectural Design. Software architecture. Topics covered. Architectural design. Advantages of explicit architecture

Transcription:

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot Research in Advanced DIstributed Cyberinfrastructure & Applications Laboratory (RADICAL) Rutgers University http://radical.rutgers.edu http://radical-cybertools.github.io

Extreme Scale Task-level Parallelism on HPDC Problems in computational science naturally amenable to task level parallelism computing Beyond HTC vs HPC Given access to X cores/nodes slice/dice or distribute as needed. Resources and workloads are characterised by a range of properties: Multiple resources Heterogeneous Varying Availability Performance

Blue Waters Job Size Distribution

Requirements / goals Workload with heterogeneous tasks Dynamic workload with workload unknown in advance Task N+1 depends on task N Control over concurrency of tasks Varying core count Varying application MPI / non-mpi Might be loosely coupled (e.g. replica exchange) ~10k concurrent tasks

(Why not) batch queue jobs Low throughput Every job needs to queue Breaks especially in dynamic workload situations No control over concurrency Limit on total concurrency Maximum of one task per node Job arrays are too inflexible (nor available on BW) Too many flavours

Pilot Abstraction User Space Working definition: A system that generalizes a placeholder to allow application-level control over acquired resources. User Application Pilot System Pilot Pilot Policies System Space Resource Manager Resource A Resource B Resource C Resource D

Advantages of Pilot-Abstraction Decouples workload management from resource management Flexible Resource Management Enables the fine-grained slicing and dicing of resources Tighter temporal control and other advantages of application-level Scheduling (avoid limitations of system-level scheduling) Build higher-level frameworks without explicit resource management

RADICAL-Pilot Overview Programmable interface, arguably unique: Well defined state models for pilots and units. Supports research whilst supporting production scalable science: Pluggable components; introspection. Portability and Interoperability: Works on Crays, most known clusters, XSEDE resources, OSG, and Amazon EC2. Modular pilot agent for different architectures. Scalable: Agent, communication, throughput.

Agent Architecture Components: Enact state transitions for Units State Updater: Communicate with client library and DB Scheduler: Maps Units onto compute nodes Resource Manager: Interfaces with batch queuing system, e.g. PBS, SLURM, etc. Launch Methods: Constructs command line, e.g. APRUN, SSH, ORTE, MPIRUN Task Spawner: Executes tasks on compute nodes

(Why not) RADICAL-Pilot + APRUN RP Agent runs on MOM node Uses aprun to launch tasks onto the worker nodes Low throughput (ALPS not designed for short/small tasks) Limit on total concurrency (1000 aprun instances) Maximum of one task per node

(Why not) RADICAL-Pilot + CCM Bootstrapper runs on MOM node Bootstrapper creates cluster Uses ccmrun to launch RP Agent into the cluster Not universally available

RADICAL-Pilot + ORTE-CLI (a bit better) ORTE: Open RunTime Environment Isolated layer used by Open MPI to coordinate task layout Runs a set of daemons over compute nodes No ALPS concurrency limits Supports multiple tasks per node orte-submit is CLI which submits tasks to those daemons sub-agent on compute node that executes these Limited by fork/exec behavior Limited by open sockets/file descriptors Limited by file system interactions

RADICAL-Pilot + ORTE-LIB (much better) All the same as ORTE-CLI, but Uses library calls instead of orte-submit processes No central fork/exec limits Shared network socket (Hardly) no central file system interactions

RADICAL-Pilot + ORTE on Cray

Micro Benchmark: Scheduler Scheduling only Scheduling and unscheduling

Micro Benchmark: Executor Scaling

Agent Performance: Full Node Tasks (3 x 64s)

Agent Performance: Concurrent Units (3x)

Agent Performance: Turnaround (3 x 4k x 64s)

Agent Performance: Resource Utilization

Conclusion There is no one size fits all in HPC With general tools extend functionality of Cray HPC systems Achieved 16k concurrent tasks Launch rate of ~100 tasks / second Efficiency large dependent on task count and duration Cray specific PMI excludes running Cray MPI linked applications

Future work RADICAL-Pilot Bulks all the way Agent scheduler overhaul Topology aware task placement Heterogenous node scheduling (I.e. GPU) ORTE Fabrics-based inter-orte communication Optimize ORTE communication topology

References RADICAL-Pilot: Scalable Execution of Heterogeneous and Dynamic Workloads on Supercomputers A Comprehensive Perspective on the Pilot-Job Systems http://radical-cybertools.github.io/ RADICAL-Pilot Github http://arxiv.org/abs/1508.04180 RADICAL-Cybertools overview http://arxiv.org/abs/1512.08194 https://github.com/radical-cybertools/radical.pilot RADICAL-Pilot Documentation http://radicalpilot.readthedocs.org/

Micro Benchmark: Exec Rate + Concurrency (1x4k)