Oak Ridge National Laboratory Computing and Computational Sciences

Size: px

Start display at page:

Download "Oak Ridge National Laboratory Computing and Computational Sciences"

Eunice Rich
5 years ago
Views:

1 Oak Ridge National Laboratory Computing and Computational Sciences Preparing OpenSHMEM for Exascale Presented by: Pavel Shamis (Pasha) HPC Advisory Council Stanford Conference California Feb 2, 2015

2 Outline CORAL overview Summit What is OpenSHMEM? Preparing OpenSHMEM for Exascale Recent advances 2 Preparing OpenSHMEM for Exascale

3 CORAL CORAL Collaboration of ORNL, ANL, LLNL Objective Procure 3 leadership computers to be sited at Argonne, Oak Ridge and Lawrence Livermore in 2017 Two of the contracts have been awarded with the Argonne contract in process Leadership Computers RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and science performance 5x-10x Titan or Sequoia 3 Preparing OpenSHMEM for Exascale

The Road to Exascale Since clock-rate scaling

Titan and beyond deliver hierarchical parallelism

MPI plus thread level parallelism through OpenACC

3 PF Multi-core CPU 7 MW Titan: 27 PF Hybrid

4 The Road to Exascale Since clock-rate scaling ended in 2003, HPC performance has been achieved through increased parallelism. Jaguar scaled to 300,000 cores. Titan and beyond deliver hierarchical parallelism with very powerful nodes. MPI plus thread level parallelism through OpenACC or OpenMP plus vectors Jaguar: 2.3 PF Multi-core CPU 7 MW Titan: 27 PF Hybrid GPU/CPU 9 MW Summit: 5-10x Titan Hybrid GPU/CPU 10 MW CORAL System OLCF5: 5-10x Summit ~20 MW 4 Preparing OpenSHMEM for Exascale

5 System Summary Compute Node POWER Architecture Processor NVIDIA Volta NVMe-compatible PCIe 800GB SSD > 512 GB HBM + DDR4 Coherent Shared Memory Compute Rack Standard 19 Warm water cooling Compute System Summit: 5x-10x Titan 10 MW IBM POWER NVLink NVIDIA Volta HBM NVLink Mellanox Interconnect Dual-rail EDR Infiniband 5 Preparing OpenSHMEM for Exascale

How does Summit compare to Titan Summit VS Titan Feature Summit Titan Application Performance 5-10x Titan Baseline Number of Nodes ~3,400 18,688 Node performance > 40 TF 1.

Bland Do Not Release Prior to Monday, Nov. 17, 2014 6 Preparing OpenSHMEM for Exascale Dual Rail EDR-IB (23 GB/s) Gemini (6.

6 How does Summit compare to Titan Summit VS Titan Feature Summit Titan Application Performance 5-10x Titan Baseline Number of Nodes ~3,400 18,688 Node performance > 40 TF 1.4 TF Memory per Node >512 GB (HBM + DDR4) 38GB (GDDR5+DDR3) NVRAM per Node 800 GB 0 Node Interconnect NVLink (5-12x PCIe 3) PCIe 2 System Interconnect (node injection bandwidth) 12 SC 14 Summit - Bland Do Not Release Prior to Monday, Nov. 17, Preparing OpenSHMEM for Exascale Dual Rail EDR-IB (23 GB/s) Gemini (6.4 GB/s) Interconnect Topology Non-blocking Fat Tree 3D Torus Processors File System IBM POWER9 NVIDIA Volta AMD Ot er on NVIDIA e Kp ler 120 PB, 1 TB/s, P GFS 32 PB, 1 TB/s, Lustre Peak power consumption 10 MW 9 MW Present and Future Leadership Computers at OLCF, Buddy Bland

7 Challenges for Programming Models Very powerful compute nodes Hybrid architecture Multiple CPU/GPU Different types of memory Must be fun to program ;-) MPI + X 7 Preparing OpenSHMEM for Exascale

8 Private Data Objects Remotely Accessible Symmetric Data Objects What is OpenSHMEM? Communication library and interface specification that implements a Partitioned Global Address Space (PGAS) programming model Processing Element (PE) an OpenSHMEM process Symmetric objects have same address (or offset) on all PEs PE 0 PE 1 PE N-1 Global and Static Variables Global and Static Variables Global and Static Variables X = shmalloc(sizeof(long)) Variable: X Variable: X Variable: X Symmetric Heap Symmetric Heap Symmetric Heap Local Variables Local Variables Local Variables 8 Preparing OpenSHMEM for Exascale

9 Private Data Objects Remotely Accessible Symmetric Data Objects OpenSHMEM Operations Remote memory Put and Get void shmem_getmem(void *target, const void *source, size_t len, int pe); void shmem_putmem(void *target, const void *source, size_t len, int pe); Remote memory Atomic operations long long shmem_int_add(int *target, int value, int pe); Collective PE 0 PE 1 PE N-1 broadcast, reductions, etc Synchronization operations Point-to-point Global Ordering operations Global and Static Variables Symmetric Heap Global and Static Variables Symmetric Heap X = shmalloc(sizeof(long)) Global and Static Variables Variable: X Variable: X Variable: X Symmetric Heap Distributed lock operations Local Variables Local Variables Local Variables 9 Preparing OpenSHMEM for Exascale

10 OpenSHMEM Code Example Preparing OpenSHMEM for Exascale

11 OpenSHMEM Code Example You just learned program OpenSHMEM! Library initialization AMO/PUT/GET Synchronization Done Preparing OpenSHMEM for Exascale

12 OpenSHMEM OpenSHMEM is a one-sided communications library C and Fortran API Uses symmetric data objects to efficiently communicate across processes Advantages: Good for irregular applications, latency-driven communication Random memory access patterns Maps really well to hardware/interconnects OpenSHMEM InfniBand (Mellanox) Gemini/Aries (Cray) RMA PUT/GET V V Atomics V V Collectives V V 12 Preparing OpenSHMEM for Exascale

13 OpenSHMEM Key Principles Keep it simple The specification is only ~ 80 pages Keep it fast As close as possible to hardware 13 Preparing OpenSHMEM for Exascale

14 Evolution of OpenSHMEM SHMEM library introduced by Cray Research Inc. (T3D systems) Adapted by SGI for products based on the Numa-Link architecture and included in the Message Passing Toolkit (MPT). Vendor specific SHMEM libraries emerge (Quadrics, HP, IBM, Mellanox, Intel, gpshmem, SiCortex etc.). OpenSHMEM is born. ORNL and UH come together to address the differences between various SHMEM implementations. OSSS signed SHMEM trademark licensing agreement OpenSHMEM 1.0 is finalized 2015 onwards, next OpenSHMEM specifications: faster, more predictable, more agile OpenSHMEM is a living specification! OpenSHMEM 1.0 reference implementation & V&V, Tools OpenSHMEM 1.1 released OpenSHMEM s mid Preparing OpenSHMEM for Exascale

15 OpenSHMEM - Roadmap OpenSHMEM v1.1 (June 2014) Errata, bug fixes Ratified (100+ tickets resolved) OpenSHMEM v1.2 (Early 2015) API naming convention finalize(), global_exit() Consistent data type support Version information Clarifications: zero-length, wait shmem_ptr() OpenSHMEM v1.5 (Late 2015) Non-blocking communication semantics (RMA, AMO) teams, groups Thread safety OpenSHMEM v1.6 Non-blocking collectives OpenSHMEM v1.7 Thread safety update OpenSHMEM Next Generation (2.0) Let s go wild!!! (Exascale!) Active set + Memory context Fault Tolerance Exit codes Locality I/O White paper: OpenSHMEM Tools API 15 Preparing OpenSHMEM for Exascale

16 OpenSHMEM Community Today Academia Vendors Government 16 Preparing OpenSHMEM for Exascale

17 OpenSHMEM Implementations Proprietary SGI SHMEM Cray SHMEM IBM SHMEM HP SHMEM Mellanox Scalable SHMEM Legacy Quadrics SHMEM Open Source OpenSHMEM Reference Implementation (UH) Portals SHMEM Oshmpi / Open SHMEM over MPI (under development) OpenSHMEM with OpenMPI OpenSHMEM with Mvapich MPI (OSU) TSHMEM (UFL) GatorSHMEM (UFL) 17 Preparing OpenSHMEM for Exascale

18 OpenSHMEM Eco-system Development OpenSHMEM Reference Implementation ANALYZER Performance Analysis Vampir Debug 18 Preparing OpenSHMEM for Exascale

19 OpenSHMEM Eco-system OpenSHMEM Specification Vampir TAU DDT OpenSHMEM Analyzer UCCS 19 Preparing OpenSHMEM for Exascale

Upcoming Challenges for OpenSHMEM Based on what we know

Communication across different components of system

(without performance sacrifices) Threads locality

20 Upcoming Challenges for OpenSHMEM Based on what we know about the upcoming architecture Hybrid Architecture Communication across different components of system Locality of resources Multiple CPU/GPU Thread Safety (without performance sacrifices) Threads locality Scalability Different Types of memory Address spaces 20 Preparing OpenSHMEM for Exascale

21 Hybrid Architecture Challenges and Ideas OpenSHMEM for accelerators TOC-Centric Communication: a case study with NVSHMEM, OUG/PGAS 2014, Shreeram Potluri UG14.pdf Preliminary study, prototype concept 21 Preparing OpenSHMEM for Exascale

NVSHMEM The problem Communication across GPU requires synchronization with Host Software overheads, hardware overhead of launching kernels, etc.

22 NVSHMEM The problem Communication across GPU requires synchronization with Host Software overheads, hardware overhead of launching kernels, etc. Research idea/concept proposed by Nvidia GPU-initiated communication NVSHMEM communication primitives: nvshmem_put(), nvshmem_get() to/from remote GPU memory Emulated using CUDA IPC (CUDA 4.2) CHANGE IN T Traditional Loop { Interior Compute (kernel launch) Pack Boundaries (kernel launch) Stream Synchronize Exchange (MPI/OpenSHMEM) Unpack Boundaries (kernel launch) Boundary Compute (kernel launch) Stream/Device Synchronize } - Kernel launch overheads - CPU based blocking synchronization The slide is based on TOC-Centric Communication: a case study with NVSHMEM, OUG/PGAS 2014, Shreeram Potlurihttp:// 22 Preparing OpenSHMEM for Exascale

Stencil%Size % tl D Evaluation results from: TOC-Centric Communication: a case study with NVSHMEM, OUG/PGAS

23 NVSHMEM u[i][j] = u[i][j] + (v[i+1][j] + v[i-1][j] + v[i][j+1] + v[i][j+1])/x PRELIMINARY R Time%per%Step%(usec)% 1500" 1000" 500" 0" tradi/ onal" persistent"kernel" 64" 128" 256" 512" 16 1K" 2K" Stencil%Size % tl D Evaluation results from: TOC-Centric Communication: a case study with NVSHMEM, OUG/PGAS 2014, Shreeram Potlurihttp:// 23 Preparing OpenSHMEM for Exascale

Many-Core System Challenges It is challenging to provide highperformance THREAD_MULTIPLE support Locks / Atomic operations in communication path Even though MPI IMB benchmarks benefits from full

24 Many-Core System Challenges It is challenging to provide highperformance THREAD_MULTIPLE support Locks / Atomic operations in communication path Even though MPI IMB benchmarks benefits from full process memory separation, multi-threaded UCCS obtains comparable performance Aurelien Bouteiller, Thomas Herault and George Bosilca, A Multithreaded Communication Substrate for OpenSHMEM, OUG Preparing OpenSHMEM for Exascale

Variable: X Variable: X Symmetric Heap Memory Mapping Symmetric

25 Many-Core System Challenges Old Ideas SHMEM_PTR (or SHMEM_LOCAL_PTR on Cray) Y = shmem_ptr(&x, PE1) PE 0 PE 1 Variable: X Variable: X Symmetric Heap Memory Mapping Symmetric Heap Variable: Y Local Variables Local Variables 25 Preparing OpenSHMEM for Exascale

26 Many-Core System Challenges Old Ideas Provides direct assess to remote PE element with memory load and store operations Supported on a systems where SHMEM_PUT/GET are implemented with memory load and store operations Usually implemented using XPMEM ( Gabriele Jost, Ulf R. Hanebutte, James Dinan, OpenSHMEM with Threads: A Bad Idea? f 26 Preparing OpenSHMEM for Exascale

and Mario Flajslik, Contexts: A Mechanism

27 Many-Core System Challenges New Ideas OpenSHMEM Context by Intel James Dinan and Mario Flajslik, Contexts: A Mechanism for High Throughput Communication in OpenSHMEM, PGAS 2015 Explicit API for allocation and management of communication contexts OpenSHMEM Application Thread 0 Thread 1 Thread 2 OpenSHMEM Library Context 0 Context 1 Context 2 Put Put Get Put Put Get Put Put Get 27 Preparing OpenSHMEM for Exascale

28 Many-Core System Challenges New Ideas Cray s proposal of Hot Threads Monika ten Bruggencate Cray Inc. Cray SHMEM Update, First OpenSHMEM Workshop: Experiences, Implementations and Tools ns_and_tutorials/tenbruggencate_cray_shmem_update.pdf Idea: each thread is registered within OpenSHMEM library. The library allocates and automatically manages communication resources (context) for the application Compatible with current API 28 Preparing OpenSHMEM for Exascale

29 Address Space and Locality Challenges Symmetric heap is not flexible enough All PE have to allocate the same amount of memory No concept of locality How we manage different types of memory? What is the right abstraction? 29 Preparing OpenSHMEM for Exascale

30 Memory Spaces Aaron Welch, Swaroop Pophale, Pavel Shamis, Oscar Hernandez, Stephen Poole, Barbara Chapman, Extending the OpenSHMEM Memory Model to Support User-Defined Spaces, PGAS2014 Concept of teams Original OpenSHMEM active-set (group of Pes) concept is outdates, BUT very lightweight (local operation) Memory Space Memory space association with a team Similar concepts can be found in MPI, Chapel, etc. 30 Preparing OpenSHMEM for Exascale

Teams Explicit method of grouping PEs Fully local objects and operations - Fast New (sub)teams created from parent teams Re-indexing of PE ids with respect to the team Strided teams and

31 Teams Explicit method of grouping PEs Fully local objects and operations - Fast New (sub)teams created from parent teams Re-indexing of PE ids with respect to the team Strided teams and axial splits No need to maintain translation array All translations can be done with simple arithmetic Ability to specify team index for remote operations 31 Preparing OpenSHMEM for Exascale

32 Spaces 32 Preparing OpenSHMEM for Exascale

33 Spaces Spaces and teams creation is decoupled Faster memory allocation compared to shmalloc Future directions Different types of memory Locality Separate address spaces Asymmetric RMA access 33 Preparing OpenSHMEM for Exascale

34 Fault Tolerance? How to run in presence of faults? What is the responsibility of programming model and communication libraries Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, Swaroop Pophale, Aaron Welch, Stephen Poole, Barbara Chapman, Fault Tolerance for OpenSHMEM, PGAS/OUG14 sion_12.pdf 34 Preparing OpenSHMEM for Exascale

35 Fault Tolerance Basic idea In memory checkpoint of symmetric memory regions Symmetric recovery or only memory recovery 35 Preparing OpenSHMEM for Exascale

36 Fault Tolerance Code snippet 36 Preparing OpenSHMEM for Exascale

37 Fault Tolerance Work in progress OpenSHMEM is just one piece of the puzzle Run-time, I/O, drivers, etc. The system has to provide fault tolerance infrastructure Error notification, coordination, etc. Leveraging existing work/research in the HPC community MPI, Hadoop, etc. 37 Preparing OpenSHMEM for Exascale

38 Summary This just a snapshot some of the ideas Other active research & development topics: non-blocking operations, counting operations, signaled operation, asymmetric memory access, etc These challenges are relevant for many other HPC programming models The key to success Co-design of hardware and software Generic solutions that target broader community The challenges are common across different fields: storage, analytics, bigdata, etc. 38 Preparing OpenSHMEM for Exascale

39 How to get involved? Join the mailing list Join OpenSHMEM redmine GitHUB OpenSHMEM RF, test suites, benchmarks, etc. Participate in our upcoming events Workshop, user group meetings, and conference calls 39 Preparing OpenSHMEM for Exascale

40 Upcoming Events Workshop 2015 August,4 th -6 th, Preparing OpenSHMEM for Exascale

41 Upcoming Events Co-Located with PGAS th international Conference on Partitioned Global Address Space Programming Models Washington, DC 41 Preparing OpenSHMEM for Exascale

42 Acknowledgements This work was supported by the United States Department of Defense & used resources of the Extreme Scale Systems Center at Oak Ridge National Laboratory. Empowering the Mission 42 Preparing OpenSHMEM for Exascale

43 Questions? 43 Preparing OpenSHMEM for Exascale

44 Backup Slides 44 Preparing OpenSHMEM for Exascale

$NVSHMEM Code Example USING NVSHMEM Device Code global void one_kernel (u, v, sync, ) { i = threadidx.$

45 NVSHMEM Code Example USING NVSHMEM Device Code global void one_kernel (u, v, sync, ) { i = threadidx.x; for ( ) { if (i+1 > nx) { v[i+1] = nvshmem_float_g (v[1], rightpe) } if (i-1 < 1) { v[i-1] = nvshmem_float_g (v[nx], leftpe) } contd. } } /*peers array has left and right PE ids*/ if (i < 2) { nvshmem_int_p (sync[i], 1, peers[i]); nvshmem_quiet(); nvshmem_wait_until (sync[i], EQ, 1); } //intra-process sync //compute v from u and sync u[i] = (u[i] + (v[i+1] + v[i-1]... contd. Evaluation results from: TOC-Centric Communication: a case study with NVSHMEM, OUG/PGAS 2014, Shreeram Potlurihttp:// Preparing OpenSHMEM for Exascale

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman