ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Similar documents
Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

CPU-GPU Heterogeneous Computing

Chapter 3 Parallel Software

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Overview of research activities Toward portability of performance

Parallel Programming Libraries and implementations

High Performance Computing on GPUs using NVIDIA CUDA

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Marco Danelutto. May 2011, Pisa

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Addressing Heterogeneity in Manycore Applications

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

OpenACC 2.6 Proposed Features

CUDA GPGPU Workshop 2012

GPUfs: Integrating a file system with GPUs

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Parallel Hybrid Computing Stéphane Bihan, CAPS

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Introduction to Multicore Programming

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

The Parallel Boost Graph Library spawn(active Pebbles)

Principles of Parallel Algorithm Design: Concurrency and Mapping

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Recent Advances in Heterogeneous Computing using Charm++

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Parallel Programming. Libraries and Implementations

Region-Based Memory Management for Task Dataflow Models

Message-Passing Shared Address Space

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

A Characterization of Shared Data Access Patterns in UPC Programs

The Cray Rainier System: Integrated Scalar/Vector Computing

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Hybrid Implementation of 3D Kirchhoff Migration

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Introduction to Multicore Programming

Progress in automatic GPU compilation and why you want to run MPI on your GPU

Modern Processor Architectures. L25: Modern Compiler Design

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Friday, May 25, User Experiments with PGAS Languages, or

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Trends and Challenges in Multicore Programming

WHY PARALLEL PROCESSING? (CE-401)

A Lightweight OpenMP Runtime

c 2015 by Adam R. Smith. All rights reserved.

OPERATING SYSTEM. Chapter 4: Threads

Chapter 4: Threads. Operating System Concepts 9 th Edition

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

CS 470 Spring Parallel Languages. Mike Lam, Professor

MTAPI: Parallel Programming for Embedded Multicore Systems

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

An Introduction to Parallel Programming

RDMA-like VirtIO Network Device for Palacios Virtual Machines

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to Parallel Computing

GPU Fundamentals Jeff Larkin November 14, 2016

Parallel Processing SIMD, Vector and GPU s cont.

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

High-Performance Key-Value Store on OpenSHMEM

Principles of Parallel Algorithm Design: Concurrency and Mapping

Parallel and High Performance Computing CSE 745

UCX: An Open Source Framework for HPC Network APIs and Beyond

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Experiences with ENZO on the Intel Many Integrated Core Architecture

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

OpenMP for next generation heterogeneous clusters

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Givy, a software DSM runtime using raw pointers

OpenACC Course. Office Hour #2 Q&A

Tesla GPU Computing A Revolution in High Performance Computing

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Heterogeneous Architecture. Luca Benini

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Transcription:

HPC Runtime Software Rishi Khan SC 11

Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing (only good for recursive parallelism) Distributed Memory Multiprocessing MPI Bulk synchronous Parallelism SHMEM, UPC - PGAS Hybrid Models MPI + OpenMP (needed to get performance on multi-core, multi-node systems) Heterogeneous Accelerator Parallelism CUDA, OpenCL 2

Problem Statement 1. Heterogeneous systems require multiple languages and programming models e.g. MPI across nodes, OpenMP across cores, OpenCL across GPUs 2. Current programming models are based on the idea of communicating sequential processes (CSP) Difficult to program and debug. Difficult to express dynamic parallelism Does not take advantage of dynamic availability of resources Extremely hard to exploit programs with irregular and/or global data accesses 3

Runtime System Comparisons MPI, OpenMP, OpenCL New Runtime Systems Time Communicating Turing Machines Bulk Synchronous Message Passing VS. Active threads Waiting Asynchronous Event-Driven Tasks Dependencies Constraints Resources Active Messages Time

Solution Express program as tasks with runtime dependencies and constraints Data: input arguments Control: must run before/after certain tasks Resource: locks, CPU or GPU, etc Tasks can run to completion once all runtime dependencies and constraints are met Runtime system determines which tasks to run based on runtime resource availability. ETI implements this solution in a technology called SWARM 5

What is SWARM? SWift Adaptive Runtime Machine (SWARM): Runtime system for heterogeneous large-scale systems Implements an execution model based on specially tagged tasks: Non-preemptible pieces of code. Tagged with dependences, constraints, and resource demands. Scheduled when all dependencies and constraints are satisfied. Once scheduled, runs to completion in a non-blocking fashion. These non-blocking tagged tasks are called codelets. Goal: Unified runtime system for heterogeneous distributed parallel systems Supplant and synergize the separate abilities of MPI, OpenMP, and OpenCL. 6

How does SWARM achieve its goals? Two-level threading system First level are heavy-weight and bound to processing resource Second level light-weight threads run non-preemptively Object-oriented design which is easily extended to new architectures and heterogeneous systems Working across cores and nodes and heterogeneous devices All runtime resources are accessed through split-phase non-blocking asynchronous operations The result of puts/gets are scheduled later using asynchronous callbacks Takes a dynamic view of the computation and the machine in contrast to static mapping found in current programming models 7

SWARM Execution Overview Enabled Tasks Tasks with Unsatisfied Dependencies Tasks enabled Tasks mapped to resources Available Resources SWARM Dependencies satisfied Resources in Use CPU CPU CPU CPU CPU CPU CPU GPU Resources allocated CPU CPU CPU CPU GPU Resources released GPU

Runtime Resource Access All communication is through asynchronous splitphase transactions between resources, e.g.: Async procedure call: put/get into procedure resources Data storage: put/get to storage resource Two basic resource access patterns: Producer passes key to consumer Producer Callback put key Consumer Producer and consumer know resource key a priori data get(key) Callback Producer put(key) data Callback Callback key get(key) Consumer 9

Motivation - Resource Sharing Allow for access to limited quantities of a resource Example: mutex, queue Producer puts, Consumer gets Use a put callback to let producer know the operation completed Use a get callback to let consumer know when the resource is available. T1 P1 C1 put put P2 queue get get C2 mutex T2 10

Key Features Exposing implicit parallelism Manage asynchrony Migration of data structures, work, global control Global namespace Hierarchy of locales for data locality and affinity Runtime Introspection Dynamic Adaptive Runtime System Solution to multicore/multinode problem that is user transparent to physical parallelism Diversity of scheduling domains & policies of tasks and resources Readable intermediate representation

Moving Global Control while (visited_list)! {! }! foreach(v in visited_list)! }! foreach(n in neighbors(v))! {! }! if (!visited(n))! {! new_visited_list[pos++] = n;! parent[n] = v;! }! swap_lists(new_visited_list, visited_list);! Run this at the owner of n.

SWARM: Key Concepts SWift Adaptive Runtime Machine: Unifies across nodes, cores, and accelerators Dynamically maps applications needs to available resources Provides expression of asynchronous programs to maximize performance and hide latency Communication and synchronization is implicit in the task dependencies 13

SWARM Availability Presently Current features HAL backends for x86 (32 and 64-bit), POSIX Scheduling of codelets Create dependencies between codelets Basic network support via TCP/IP SCALE Codelet IR Language API Documentation and Programmers Guide Early Access Release SWARM 0.7.0 available now: http://www.etinternational.com/swarm New version by early December Full locale support (scheduling and memory) Full abstraction of hardware/os in HAL Proper network stack Codelet/function symmetry 14

SWARM Future Plans Hardware Support Intel MIC, Runnemeade GPU, Adapteva Legacy Support Work with MPI/OpenMP and other runtimes Via recompilation (e.g. OpenMP) Operate side-by-side (e.g. MPI) Via DLL injection (e.g. OpenCL) UPC, SHMEM support Language Wrestling with higher level language Detailed language for experts, yet simple for Joe programmer Monitoring and Debugging 15

Key Takeaways Contexts Where SWARM helps Irregular loads Long latency operations Resource constraints other than CPU Heterogeneous systems Benefits Programming Productivity Deliver higher throughput and higher performance Power efficiency Purchase flexibility Key Runtime Concepts Asynchronous Split-Phase Resource Access Hierarchical Event Driven Scheduling Abstraction of resources for unified heterogeneous access Experiences SWARM Runtime system SWARM SCALE Codelet IR Langauge 16

Case Studies Mandelbrot Barnes-hut N-body problem Graph500

Mandelbrot Speedup over Serial 12 11 10 9 8 7 6 5 4 3 2 1 0 Ideal SWARM OpenMP Dynamic OpenMP Guided OpenMP Static 1 2 3 4 5 6 7 8 9 10 11 12 Number of Threads Mandelbrot

Barnes-Hut Speedup over Serial 12 11 10 9 8 7 6 5 4 3 2 1 0 Ideal SWARM OpenMP 1 2 3 4 5 6 7 8 9 10 11 12 Number of Threads Barnes-Hut

Graph 500 Implementation with SWARM Graph500: New supercomputing benchmark for more realistic application workloads Ported to SWARM and produced results on 4 different supercomputers. Supercomputer Name Sandia Redsky TACC Lonestar Intel Endeavor ORNL Jaguar Processor Type Nehalem X5570 Westmere 5680 Westmere X5670 Cray XT5- HE Processor Speed 2.93 GHz 3.33 GHz 2.93 GHz 2.6 GHz Processors per Node 8 12 12 12 Main memory size 12GB/node 24GB/node 24GB/node 16GB/Node

SWARM/MPI Performance Comparison Consistent speed up from 2- fold to 14.5- fold SWARM Speedup 1500% 1400% 1300% 1200% 1100% 1000% 900% 800% 700% 600% 500% 400% 300% 200% 100% 0% 4 8 16 32 64 128 256 512 MPI Lonestar Redsky Endeavor Jaguar Number of Nodes

Advantages of SWARM on Lower type overhead Active messages - fewer copies and round trips Share address space on same node Monitor and allocate cache utilization Idle threads can steal work from other threads Effective substitute for MPI + OpenMP + Active Messages All in one package with lower overheads

SWARM: Key Concepts SWift Adaptive Runtime Machine: Unifies across nodes, cores, and accelerators Dynamically maps applications needs to available resources Provides expression of asynchronous programs to maximize performance and hide latency Communication and synchronization is implicit in the task dependencies 23

SWARM Availability Presently Current features HAL backends for x86 (32 and 64-bit), POSIX Scheduling of codelets Create dependencies between codelets Basic network support via TCP/IP SCALE Codelet IR Language API Documentation and Programmers Guide Early Access Release SWARM 0.7.0 available now: http://www.etinternational.com/swarm New version by early December Full locale support (scheduling and memory) Full abstraction of hardware/os in HAL Proper network stack Codelet/function symmetry 24

SWARM Future Plans Hardware Support Intel MIC, Runnemeade GPU, Adapteva Legacy Support Work with MPI/OpenMP and other runtimes Via recompilation (e.g. OpenMP) Operate side-by-side (e.g. MPI) Via DLL injection (e.g. OpenCL) UPC, SHMEM support Language Wrestling with higher level language Detailed language for experts, yet simple for Joe programmer Monitoring and Debugging 25

Key Takeaways Contexts Where SWARM helps Irregular loads Long latency operations Resource constraints other than CPU Heterogeneous systems Benefits Programming Productivity Deliver higher throughput and higher performance Power efficiency Purchase flexibility Key Runtime Concepts Asynchronous Split-Phase Resource Access Hierarchical Event Driven Scheduling Abstraction of resources for unified heterogeneous access Experiences SWARM Runtime system SWARM SCALE Codelet IR Langauge 26