HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Similar documents
ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

HPX. HPX The Futurization of Computing

High Performance Computing. Introduction to Parallel Computing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

A unified multicore programming model

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Enabling Exascale Computing through the ParalleX Execution Model

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Main Points of the Computer Organization and System Software Module

Martin Kruliš, v

CS370 Operating Systems

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

ASYNCHRONOUS COMPUTING IN C++

Preliminary Design Examination of the ParalleX System from a Software and Hardware Perspective

Using HPX and OP2 for Improving Parallel Scaling Performance of Unstructured Grid Applications

Message-Passing Shared Address Space

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

Multiprocessor Systems. Chapter 8, 8.1

CUDA GPGPU Workshop 2012

Introduction to parallel Computing

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Multi-core Architecture and Programming

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Review: Creating a Parallel Program. Programming for Performance

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

IT 540 Operating Systems ECE519 Advanced Operating Systems

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Computer Science Curricula 2013

Principles of Parallel Algorithm Design: Concurrency and Mapping

Concurrent Preliminaries

Process Description and Control

Module 11: I/O Systems

Application Programming

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Continuum Computer Architecture

SPIN Operating System

Plain Threads are the GOTO of Today s Computing

Introduction to Parallel Computing

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Moore s Law. Computer architect goal Software developer assumption

Threads, SMP, and Microkernels. Chapter 4

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Processes and Threads

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Multiprocessor scheduling

CS370 Operating Systems

Principles of Parallel Algorithm Design: Concurrency and Mapping

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer Systems A Programmer s Perspective 1 (Beta Draft)

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

CONCURRENT DISTRIBUTED TASK SYSTEM IN PYTHON. Created by Moritz Wundke

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012

Multiprocessors & Thread Level Parallelism

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

What s in a process?

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

2017 Pearson Educa2on, Inc., Hoboken, NJ. All rights reserved.

10/10/ Gribble, Lazowska, Levy, Zahorjan 2. 10/10/ Gribble, Lazowska, Levy, Zahorjan 4

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory

Multi-Core Programming

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Heckaton. SQL Server's Memory Optimized OLTP Engine

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Modeling Software with SystemC 3.0

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on

Lecture 5: Process Description and Control Multithreading Basics in Interprocess communication Introduction to multiprocessors

Sistemas Operacionais I. Valeria Menezes Bastos

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Chapter 13: I/O Systems

Practical Near-Data Processing for In-Memory Analytics Frameworks

CS420: Operating Systems

Cloud Programming James Larus Microsoft Research. July 13, 2010

Portland State University ECE 588/688. Graphics Processors

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Adaptive Cluster Computing using JavaSpaces

Chapter 4: Threads. Operating System Concepts 9 th Edition

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Notes. CS 537 Lecture 5 Threads and Cooperation. Questions answered in this lecture: What s in a process?

Transcription:

HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu)

2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently, mainly software only implementation Emphasis on Functionality: finding proper architecture and API s Performance: finding hotspots, contention points, reduce overheads, hide latencies API: finding the minimal but complete set of required functions Driven by real applications (AMR, Contact, Graphs, CFD) Should allow retargeting to different platforms Stable basis for long term migration path of applications Highly modular, allows to incorporate different policy implementations First experiments with custom hardware components Implemented in C++ Utilize compiler for highly optimized implementation Utilize language for best possible user experience/simplest API

3 Cetraro Workshop 2011 6/28/2011 Technology Demands new Response

4 Cetraro Workshop 2011 6/28/2011 Technology Demands new Response

5 Cetraro Workshop 2011 6/28/2011 Technology Demands new Response

6 Cetraro Workshop 2011 6/28/2011 Amdahl s Law (Strong Scaling) S = 1 1 P + P N S: Speedup P: Proportion of parallel code N: Number of processors Figure courtesy of Wikipedia (http://en.wikipedia.org/wiki/amdahl's_law)

7 Gustafson s Law (Weak Scaling) S = N P(N 1) S: Speedup P: Proportion of parallel code N: Number of processors Figure courtesy of Wikipedia (http://en.wikipedia.org/wiki/gustafson's_law)

8 The 4 Horsemen of the Apocalypse: SLOW Starvation Insufficient concurrent work to maintain high utilization of resources Latencies Time-distance delay of remote resource access and services Overheads Work for management of parallel actions and resources on critical path which are not necessary in sequential variant Waiting for Contention resolution Delays due to lack of availability of oversubscribed shared resource

9 The 4 Horsemen of the Apocalypse: SLOW Starvation Insufficient concurrent work to maintain high utilization of resources Latencies Time-distance delay of remote resource access and services Overheads Work for management of parallel actions and resources on critical path which are not necessary in sequential variant Waiting for Contention resolution Delays due to lack of availability of oversubscribed shared resource

10 Main HPX Runtime System Tasks Manage parallel execution for the application Exposing parallelism, runtime adaptive management of parallelism and resources Synchronizing parallel tasks Thread (task) scheduling, load balancing Mitigate latencies for the application Latency Latency hiding through overlap of computation and communication Latency avoidance through locality management Reduce overhead for the application Overhead Synchronization, scheduling, load balancing, communication, context switching, memory management, address translation Resolve contention for the application Contention Adaptive routing, resource scheduling, load balancing Localized request buffering for logical resources Starvation

11 What s? Active global address space (AGAS) instead of PGAS Message driven instead of message passing Lightweight control objects instead of global barriers Latency hiding instead of latency avoidance Adaptive locality control instead of static data distribution Moving work to data instead of moving data to work Fine grained parallelism of lightweight threads instead of Communicating Sequential Processes (CSP/MPI)

12 HPX Runtime System Design Current version of HPX provides the following infrastructure on conventional systems as defined by the execution model Active Global Address Space (AGAS) Threads and Thread Management Parcel Transport and Parcel Management Local Control Objects (LCOs) Processes Namespace and policies management, locality control Monitoring subsystem

13 HPX Runtime System Design Current version of HPX provides the following infrastructure on conventional systems as defined by the execution model Process Manager Active Global Address Space (AGAS) Performance Threads and Thread Management Counters Interconnect Parcel Transport and Parcel Management AGAS Address Translation Local Control Objects (LCOs) LCOs Processes Parcel Action Port Namespace and policies management, Manager locality control Parcel Handler Monitoring subsystem Local Memory Management Thread Manager Performance Monitor Thread Pool

14 Active Global Address Space Global Address Space throughout the system Removes dependency on static data distribution Enables dynamic load balancing of application and system data AGAS assigns global names (identifiers, unstructured 128 bit integers) to all entities managed by HPX. Unlike PGAS provides a mechanism to resolve global identifiers into corresponding local virtual addresses (LVA) LVAs comprise Locality ID, Type of Entity being referred to and its local memory address Moving an entity to a different locality updates this mapping Current implementation is based on centralized database storing the mappings which are accessible over the local area network. Local caching policies have been implemented to prevent bottlenecks and minimize the number of required round-trips. Current implementation allows autonomous creation of globally unique ids in the locality where the entity is initially located and supports memory pooling of similar objects to minimize overhead Implemented garbage collection scheme of HPX objects

15 Active Global Address Space Global Address Space throughout the system Removes dependency on static data distribution Enables dynamic load balancing of application and system data AGAS assigns global names (identifiers, unstructured 128 bit integers) to all entities managed by HPX. Local Memory Process Manager Performance Monitor Management Unlike PGAS provides a mechanism to resolve global identifiers into corresponding local virtual addresses (LVA) Performance LVAs comprise Locality ID, Type of Entity being referred to and Counters its local memory address Interconnect Moving an entity to a different locality updates this mapping AGAS Address Current implementation is based on centralized Translation database storing the mappings which are accessible over the local area network. LCOs Local caching policies have been implemented to prevent bottlenecks and minimize the number of required round-trips. Parcel Action Current implementation Port allows autonomous Manager creation of globally unique ids in the locality where the entity Parcel is initially located and Thread supports memory Thread Handler Manager Pool pooling of similar objects to minimize overhead Implemented garbage collection scheme of HPX objects

16 Parcel Management Active messages (parcels) Destination address, function to execute, parameters, continuation Any inter-locality messaging is based on Parcels In HPX parcels are represented as polymorphic objects An HPX entity on creating a parcel object hands it to the parcel handler. The parcel handler serializes the parcel where all dependent data is Serialized Parcel bundled along with the parcel At the receiving locality the parcel is de-serialized and causes a Parcel Handler Action Manager HPX thread to be created based on put() its content parcel object De-serialized Parcel HPX Threads Locality 1 Locality 2

17 Parcel Management Active messages (parcels) Destination address, function to execute, parameters, continuation Any inter-locality messaging is based on Parcels Local Memory In HPX parcels Process are Manager represented as polymorphic objects Management An HPX entity on creating a parcel object hands it to the parcel Performance handler. Counters The parcel handler Interconnect serializes the parcel where all dependent data is AGAS Address Serialized Parcel bundled along with the parcel Translation At the receiving locality the parcel LCOs is de-serialized and causes a Parcel Action Port Manager Parcel Handler Action Manager HPX thread to be created based on Parcel put() Thread Thread Handler Manager Pool its content parcel object Performance Monitor De-serialized Parcel HPX Threads Locality 1 Locality 2

18 Thread Management Thread manager is modular and implements a workqueue based management as specified by the execution model Threads are cooperatively scheduled at user level without requiring a kernel transition Specially designed synchronization primitives such as semaphores, mutexes etc. allow synchronization of PXthreads in the same way as conventional threads Thread management currently supports several key modes Global Thread Queue Local Queue (work stealing) Local Priority Queue (work stealing)

19 Thread Management Thread manager is modular and implements a workqueue based management as specified by the execution model Threads are cooperatively scheduled Local Memory at user level Process Manager Performance Monitor Management without requiring a kernel transition Specially designed synchronization primitives Performance such as Counters semaphores, mutexes etc. allow synchronization of PX- Interconnect threads in the same way as conventional threads AGAS Address Translation Thread management currently supports several key LCOs modes Global Thread Parcel Queue Action Port Manager Local Queue (work stealing) Parcel Handler Local Priority Queue (work stealing) Thread Manager Thread Pool

20 Constraint based Synchronization Compute dependencies at task instantiation time No global barriers, uses constraint based synchronization Computation flows at its own pace Message driven Symmetry between local and remote task creation/execution Possibility to control grain size

21 LCOs (Local Control Objects) LCOs provide a means of controlling parallelization and synchronization of PX-threads Enable event-driven thread creation and can support inplace data structure protection and on-the-fly scheduling Preferably embedded in the data structures they protect Abstraction of a multitude of different functionalities for event driven PX-thread creation, protection of data structures from race conditions automatic on-the-fly scheduling of work LCO may create (or reactivate) a PX-thread as a result of being triggered

22 LCOs (Local Control Objects) LCOs provide a means of controlling parallelization and synchronization of PX-threads Enable event-driven thread creation and can support inplace data structure protection and on-the-fly scheduling Management Performance Monitor Local Memory Process Manager Preferably embedded in the data structures they Performance protect Counters Abstraction of a multitude of different functionalities for Interconnect event driven PX-thread creation, AGAS Address Translation protection of data structures from race conditions automatic on-the-fly scheduling of work Parcel Action LCO may create (or reactivate) a PX-thread as a result of Port Manager Parcel Thread Thread Handler Manager Pool being triggered LCOs

23 Exemplar LCO: Futures In HPX Futures LCO refers to an object that acts as a proxy for the result that is initially not known. When a user code invokes a future (using future.get() ) the thread can do one of 2 activities If the remote data/arguments are available then the future.get() operation fetches the data and the execution of the thread continues If the remote data is NOT available the thread may continue until it requires the actual value; then the thread suspends allowing other threads to continue execution. The original thread re-activates as soon as the data dependency is resolved Locality 1 Future object Suspend consumer thread Execute another thread Resume consumer thread Locality 2 Execute Future: Result is being returned Producer thread

24 Processes Management of namespace and locality Centerpiece for truly distributed AGAS We completed the first step in re-implementing AGAS towards being distributed Encapsulation of blocks of functionality and possibly distributed data Completed software architecture design for processes Implemented prototypical processes managing read-only distributed data

25 Recent Results First formal release of HPX3 (V0.6), V1.0 scheduled for May Re-implemented AGAS on top of Parcel transport layer Better distributed scalability, better asynchrony, latency hiding First results implementing processes Distributed (read-only) data partitioning of very large data sets First (encouraging) results from distributed runs Demonstrated strong scaling similar to SMP Consolidated performance counter monitoring framework Allows to measure almost arbitrary system characteristics using unified interface Implemented remote application steering Used to demonstrate control of power consumption New applications Linear algebra, N-body, chess, contact, graph500

Scaling (normalized to 1 core) 26 Scaling & performance: AMR using MPI and HPX 14 Scaling of MPI AMR application 12 10 8 6 4 1 core 2 cores 4 cores 10 cores 20 cores 2 0 0 1 2 3 4 5 6 7 8 Levels of AMR refinement

Scaling (normalized to 1 core) Scaling (normalized to 1 core) 27 Scaling & performance: AMR using MPI and HPX 14 12 10 8 6 4 Scaling of MPI AMR application 14 12 10 8 6 4 2 Scaling of HPX AMR application 1 core 2 cores 4 cores 10 cores 20 cores 1 core 2 cores 4 cores 10 cores 20 cores 2 0 0 0 1 2 3 4 5 6 7 8 Levels of AMR refinement 0 1 2 3 4 5 6 7 8 Levels of AMR refinement

Scaling (normalized to 1 core) Wallclock time ratio MPI/HPX (HPX = 1) Scaling (normalized to 1 core) 28 Scaling & performance: AMR using MPI and HPX 14 12 10 8 6 4 2 0 Scaling of HPX AMR application Wallclock 14 time ratio MPI/HPX Scaling of MPI (Depending AMR on application levels of refinement - LoR, pollux.cct.lsu.edu, 32 cores) 4 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 Number 7 of 8cores Levels of AMR refinement 12 10 8 6 4 2 0 1 core 2 cores 4 cores 10 cores 20 cores 0 1 2 3 4 5 6 7 8 Levels of AMR refinement 1 core 2 cores 5 cores 10 cores 20 cores 30 cores 0 LoR 1 LoR 2 LoR 3 LoR 1 core 2 cores 4 cores 10 cores 20 cores

29 Distributed Strong Scaling

30 Distributed Strong Scaling

Walltime (Normalized to 1 OS-Thread) 31 Overheads: AGAS 1.2 AGAS Overhead Benchmark (10000 Futures) 1 0.8 0.6 0.4 No caching Caching Ranged caching 0.2 0 1 2 3 4 5 AGAS Worker OS-Threads

Execution Time [s] 32 Overhead: Threads 120 Execution Time [s] (1,000,000 PX Threads) 100 80 60 40 20 0μs 3.5μs 7μs 14.5μs 29μs 58μs 115μs 0 0 4 8 12 16 20 24 28 32 36 40 44 48 Number of OS Threads (Cores)

Execution Time [s] Execution Time [s] 33 Overhead: Threads 120 Execution Time for 100000 Futures Execution Time [s] (1,000,000 35 PX Threads) 30 100 80 60 40 20 0 25 20 15 10 5 0 0 4 8 12 16 20 24 28 32 36 40 44 48 Number of OS Threads (Cores) 0μs 3.5μs 7μs 14.5μs 29μs 58μs 115μs 0 10 20 30 40 50 Number of OS threads (cores) 0 10µs 19µs 38µs 76µs 150µs 300µs

Execution Time [s] 34 Results from a Gauss-Seidel Solver 5 Execution Time of Gaus Seidel Solver (Depending on Grain Size, Matrix size: 2000x2000) 4 3 1 core 2 cores 4 cores 2 1 8 cores 16 cores 32 cores 48 cores 0 0 200 400 600 800 1000 1200 Grain Size (N²)

Execution Time [s] Scaling (normalized to 1 core) 35 Results from a Gauss-Seidel Solver 5 Strong Scaling for Gauss-Seidel Solver Execution Time of Gaus Seidel (Matrix size: Solver 2000x2000, Grain size: 90x90) (Depending on Grain Size, 20 Matrix size: 2000x2000) 4 15 3 2 1 10 5 1 core 2 cores 4 cores 8 cores 16 cores 32 cores 48 cores HPX OMP 0 0 0 10 20 30 40 50 0 200 400 600 800 1000 1200Number of cores Grain Size (N²)

36 Grain Size: The New Freedom

37 Overhead: Load Balancing Competing effects for optimal grain size: overheads vs. load balancing (starvation)

38 Overhead: Load Balancing Competing effects for optimal grain size: overheads vs. load balancing (starvation)

39 Conclusions Are we there yet? Definitely NO! But very promising results supporting our claim Are we on a right path? Definitely YES! Might not be THE right path, but it s a leap Do we have cure for those scaling impaired applications? We re not sure yet! Based on results we are optimistic