A One-Sided View of HPC: Global-View Models and Portable Runtime Systems

Size: px

Start display at page:

Download "A One-Sided View of HPC: Global-View Models and Portable Runtime Systems"

Archibald Booker
6 years ago
Views:

1 A One-Sided View of HPC: Global-View Models and Portable Runtime Systems James Dinan James Wallace Gives Postdoctoral Fellow Argonne Na9onal Laboratory

2 Why Global-View? Proc 0 Proc 1 Proc n Global address space X X[1..9] [1..9][1..9] X[M][M][N] Shared Private Mechanism for dealing with large datasets Similar convenience as shared memory Don t operate on data in- place No coherence across data copies Mechanism for coping with sparse, unbalanced computa9ons Work distribu9on decoupled from data distribu9on More flexible load balancing Computa9onal chemistry apps exhibit both of these characteris9cs Specifically Inves9gate: NWChem and Global Arrays 2

3 Why One-Sided? Memory Bank NIC Memory Bank Memory Bank NIC Memory Bank Explicit message exchange couples two opera9ons: 1. Synchroniza9on of processes Opera9on completes when sender calls send and receiver calls receive 2. Data movement One- sided decouples these opera9ons Enables asynchronous data movement Efficient support through RDMA- capable networks 3

Global Arrays and ARMCI The Aggregate Remote Memory Copy Interface Global Arrays (GA) Distributed, shared arrays ARMCI GA s run9me system Manages array data Provides portability Async.

4 Global Arrays and ARMCI The Aggregate Remote Memory Copy Interface Global Arrays (GA) Distributed, shared arrays ARMCI GA s run9me system Manages array data Provides portability Async., one- sided comm. Get, put, accumulate, Noncon9guous opera9ons Mutexes, atomics, collec9ves, processor groups, Consistency model Remote: Loca9on consistent Local: Direct load/store allowed GA_Put({x,y},{x,y }) ARMCI_PutS(rank, addr, )

The GA/ARMCI Software Stack ARMCI Support Na9vely implemented Sparse vendor support Implementa9ons lag systems MPI is ubiquitous Support one- sided for

MPI- 2: drive implementa9on performance, one- sided tools 3. MPI- 3: mo9vate features Na9ve NWChem Global Arrays ARMCI MPI Native 4.

5 The GA/ARMCI Software Stack ARMCI Support Na9vely implemented Sparse vendor support Implementa9ons lag systems MPI is ubiquitous Support one- sided for 15 years Goal: Use MPI RMA to implement ARMCI 1. Portable one- sided communica9on for NWChem users 2. MPI- 2: drive implementa9on performance, one- sided tools 3. MPI- 3: mo9vate features Na9ve NWChem Global Arrays ARMCI MPI Native 4. Interoperability: Increase resources available to applica9on ARMCI/MPI share progress, buffer pinning, network and host resources Challenge: Mismatch between MPI- RMA and ARMCI ARMCI- MPI NWChem Global Arrays ARMCI MPI Native 5

6 MPI Remote Memory Access Interface Ac9ve and Passive target Modes Ac9ve: target par9cipates Passive: target does not par9cipate Rank 0 Rank 1 Lock Put(X) Window: Expose memory for RMA Logical public and private copies Conserva9ve data consistency model Accesses must occur within an epoch Lock(window, rank) Unlock(window, rank) Access mode can be exclusive or shared Opera9ons are not ordered within an epoch Get(Y) Unlock Comple9on Public Copy Private Copy 6

Sources Public Copy X X X X Private Copy load store

7 MPI-2 RMA Separate Memory Model Concurrent, conflicting accesses are erroneous Same source Same epoch Diff. Sources Public Copy X X X X Private Copy load store Conserva9ve, but extremely portable Compa9ble with non- coherent memory systems 7

rank, address Preserve MPI RMA seman9cs Manage access epochs Avoid conflic9ng accesses Protect shared buffers Absolute Process Id

8 Translation: Global Memory Regions Translate between ARMCI and MPI shared data segment representa9ons ARMCI: Array of base pointers MPI: Window object Translate between MPI and ARMCI comm. parameters MPI: win., win. rank, disp. ARMCI: bs. rank, address Preserve MPI RMA seman9cs Manage access epochs Avoid conflic9ng accesses Protect shared buffers Absolute Process Id Metadata 0 1 N 0 0x6b7 0x7af 0x0c1 1 0x9d9 0x0 0x0 2 0x611 0x38b 0x xa63 0x3f8 0x0 Alloca?on Metadata MPI_Win window; int size[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw; 8

GMR: Preserving MPI local access semantics ARMCI_Put( src = 0x9e0, dst = 0x39b, size = 8 bytes, rank = 1 ); Problem: Local buffer is also shared Can t access without epoch Can we lock it?

9 GMR: Preserving MPI local access semantics ARMCI_Put( src = 0x9e0, dst = 0x39b, size = 8 bytes, rank = 1 ); Problem: Local buffer is also shared Can t access without epoch Can we lock it? Same window: not allowed Diff window: can deadlock Solu9on: Copy to private buffer Absolute Process Id Metadata 0 1 N 0 0x6b7 0x7af 0x0c1 1 0x9d9 0x0 0x0 2 0x611 0x38b 0x xa63 0x3f8 0x0 src_copy = Lock; memcpy; Unlock xrank = GMR_Translate(comm, rank) Lock(win, xrank) Put(src_cpy, dst, size, xrank) Unlock(win, xrank) Alloca?on Metadata window = win; Size = { 1024, }; Grp = comm; 9

$ARMCI Noncontiguous Operations: I/O Vector Generalized noncon9guous transfer with uniform segment size: typedef struct { void **src_ptr_array; // Source addresses void **dst_ptr_array; // Dest.$

10 ARMCI Noncontiguous Operations: I/O Vector Generalized noncon9guous transfer with uniform segment size: typedef struct { void **src_ptr_array; // Source addresses void **dst_ptr_array; // Dest. Addresses int bytes; // Length of all seg. int ptr_array_len; // Number of segments } armci_giov_t; ARMCI_GetV( ) Three methods to support in MPI 1. Conserva9ve (one epoch): Lock, Put/Get/Acc, Unlock, 2. Batched (mul9ple epochs): Lock, Put/Get/Acc,, Unlock 3. Direct: Generate MPI indexed datatype for source and des9na9on Single opera9on/epoch: Lock, Put/Get/Acc, Unlock Handoff processing to MPI 10

11 Avoiding concurrent, conflicting accesses Con9guous opera9ons Don t know what other nodes do Wrap each call in an exclusive epoch Noncon9guous opera9ons I/O Vector segments may overlap MPI Error! Must detect errors and fall back to conserva9ve mode if needed MPI Errors can be fatal Generate a conflict tree Sorted, self- balancing AVL tree 1. Search the tree for a match 2. If (match found): Conflict! 3. Else: Insert into the tree 0x1a6 0x200 Merge search/insert steps into a single traversal 0x3f0 0x517 0x518 0x812 0x917 0xb10 0x8bb 0xa02 0xf37 0xf47 11

Experimental Setup System Nodes Cores Memory Interconnect MPI IBM BG/P (Intrepid) 40,960 1 x 4 2 GB 3D Torus IBM MPI IB (Fusion) 320 2 x 4 36 GB IB QDR MVAPICH2 1.

12 Experimental Setup System Nodes Cores Memory Interconnect MPI IBM BG/P (Intrepid) 40,960 1 x 4 2 GB 3D Torus IBM MPI IB (Fusion) x 4 36 GB IB QDR MVAPICH2 1.6 Cray XT5 (Jaguar) 18,688 2 x 6 16 GB Seastar 2+ Cray MPI Cray XE6 (Hopper) 6,392 2 x GB Gemini Cray MPI Communicaton Benchmarks Con9guous bandwidth Noncon9guous Bandwidth NWChem performance evalua9on CCSD(T) calcula9on on water pentamer IB, XT5: Na9ve much beyer than ARMCI- MPI (needs tuning/mpi- 3) 12

are same and acc is ~15% less BW XE: ARMCI- MPI is 2x beyer for

13 Contiguous Communication Bandwidth (BG/P & XE6) BG/P: Na9ve is beyer for small to medium size messages Bandwidth regime: get/put are same and acc is ~15% less BW XE: ARMCI- MPI is 2x beyer for get/put Double precision accumulate, 2x beyer small, same large xfers 13

14 Strided Communication Bandwidth (BG/P) Segment size 1 kb Get Batched is best Other methods always pack Packing in host CPU slower than injec9ng into network MPI implementa9on should select this automa9cally Performance is close to na9ve Bandwidth (GB/Sec) Acc 0.35 Put Number of Contiguous Segments

Strided Communication Benchmark (XE6) Segment size 1 kb 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.

15 Strided Communication Benchmark (XE6) Segment size 1 kb Get Batched is best for Acc Not clear for others Significant performance advantage over current na9ve implementa9on Under ac9ve development Bandwidth (GB/Sec) Acc 1.6 Put Number of Contiguous Segments

16 NWChem Performance (BG/P) Blue Gene/P ARMCI-MPI CCSD ARMCI-Native CCSD CCSD Time (min) Number of Nodes 16

17 NWChem Performance (XE6) Cray XE6 CCSD Time (min) ARMCI-MPI CCSD ARMCI-Native CCSD ARMCI-MPI (T) ARMCI-Native (T) (T) Time (min) Number of Cores 17

18 ARMCI on MPI-2 Implementation Checklist 1. One- sided Transla9on provided by GMR Effort needed to avoid concurrent, conflic9ng accesses 2. Collec9ves Implemented as MPI user- defined collec9ves 3. Mutexes MPI- 2 RMA does not support atomic RMW opera9ons Read, modify, write is forbidden within an epoch Queueing mutex implementa9on (Latham, Ross, & Thakur [IJHPCA 07]) O(P) space 4. One- sided atomic swap, fetch- and- add (atomic w.r.t. other atomics) Ayach an RMW mutex to each GMR Mutex_lock, get, modify, put, Mutex_unlock Slow, best we can do in MPI Non- blocking opera9ons Implemented as blocking 6. Non- collec9ve processor groups Recursive intercomm. merging algorithm (Dinan, et al. [EuroMPI 11]) O(log 2 P) communica9on cost 18

19 New Capabilities and Features in MPI-3 RMA Unified memory model Take advantage of coherent hardware Relaxed synchroniza9on will yield beyer performance Conflic9ng accesses Localized to loca9ons accessed Relaxed to undefined Load/store does not corrupt Atomic CAS, and Fetch- and- Add, and new accumulate opera9ons Mutex space overhead MPI- 3: O(1) Request- based non- blocking ops Shared memory (IPC) windows Unified Copy Public Copy Private Copy 19

Implementation of MPI-3 RMA in MPICH Extensive remodeling of CH3 RMA implementa9on New window types New communica9on opera9ons New synchroniza9on opera9ons/modes Passive target at mul9ple targets

20 Implementation of MPI-3 RMA in MPICH Extensive remodeling of CH3 RMA implementa9on New window types New communica9on opera9ons New synchroniza9on opera9ons/modes Passive target at mul9ple targets Created new infrastructure to enable performance research Tracking of opera9ons, window synchroniza9on state More flexible passive target sync. and comple9on of opera9ons MPI- 3 RMA is available in MPICH 3.0rc1 Released: Nov. 13, 2012 Please try it out! 20

21 ARMCI on MPI-3 Implementation (ongoing work) 1. One- sided Transla9on provided by GMR No need to avoid concurrent, conflic9ng accesses (undefined results) 2. Collec9ves Implemented as MPI user- defined collec9ves 3. Mutexes Space- scalable, using distributed MCS lock 4. One- sided atomic swap, fetch- and- add (atomic w.r.t. other atomics) Implementa9on using MPI- 3 RMW opera9ons 5. Non- blocking opera9ons Implementa9on using MPI- 3 request- genera9ng opera9ons 6. Non- collec9ve processor groups Implementa9on using MPI- 3 MPI_Comm_create_group() 21

22 MPI-3 RMA Discussion Discuss opportuni9es for collabora9on and involvement in ongoing work MPI- 3 RMA Performance op9miza9on Shared memory op9miza9on RDMA op9miza9on Development of performance and correctness debugging tools Global- View Apply MPI- 3 as a substrate for addi9onal exis9ng and new global- view models 22

23 Conclusions ARMCI- MPI: Complete, portable run9me system for GA and NWChem MPI- 2 performance driver MPI- 3 feature driver Mechanisms to overcome interface and seman9c mismatch Performance is preyy good, dependent on impl. Tuning Available for download with MPICH MPI- 3: Exci9ng new capabili9es Increased flexibility Avaliable in MPICH 3.0 Contact: Jim Dinan <dinan@mcs.anl.gov> 23

Recent Activities in Programming Models and Runtime Systems at ANL

Recent Activities in Programming Models and Runtime Systems at ANL Rajeev Thakur Deputy Director Mathema0cs and Computer Science Division Argonne Na0onal Laboratory Joint work with Pavan Balaji, Darius