Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Size: px

Start display at page:

Download "Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided"

Prudence Lawson
5 years ago
Views:

1 Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER

2 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 2

3 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 3

4 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 4

5 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA Communication is one sided (no involvement of destination) RMA decouples communication & synchronization Different from message passing two sided one sided Proc A send recv Proc B Communication + Synchronization Proc A put sync Proc B Communication Synchronization [1] 5

6 PRESENTATION OVERVIEW 1. Overview of three MPI-3 RMA concepts 2. MPI window creation 3. Communication 5. Application evaluation 4. Synchronization 6

7 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 7

8 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 8

9 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 9

10 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 10

11 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 11

12 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 12

13 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 13

14 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 14

15 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 15

16 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 16

17 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 17

18 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 18

g., OFED/IB) Window creation, communication and

19 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization Communication Synchronization Window creation 19

20 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM, a portable Linux kernel module 20

21 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM, a portable Linux kernel module 21

22 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 22

23 PART 1: SCALABLE WINDOW CREATION Traditional windows Process A Process B Process C Memory Memory Memory 0x111 0x123 0x120 backwards compatible (MPI-2) 23

24 PART 1: SCALABLE WINDOW CREATION Allocated windows Process A Process B Process C Memory Memory Memory 0x123 0x123 0x123 Allows MPI to allocate memory 24

25 PART 1: SCALABLE WINDOW CREATION Dynamic windows Process A Process B Process C Memory Memory Memory 0x129 0x111 0x123 0x129 0x120 Local attach/detach Most flexible 25

..or fall back to remote locking protocol MPI datatype handling with MPITypes library [1] Fast path for contiguous data transfers of

26 PART 2: COMMUNICATION Put and Get: Direct DMAPP put and get operations or local (blocking) memcpy (XPMEM) Accumulate: DMAPP atomic operations for 64Bit types...or fall back to remote locking protocol MPI datatype handling with MPITypes library [1] Fast path for contiguous data transfers of common intrinisic datatypes (e.g., MPI_DOUBLE) MPI_Put dmapp_put_nbi Remote process MPI_Compare _and_swap dmapp_ acswap_qw_nbi Remote process Contiguous memory [1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM 09 26

27 PERFORMANCE INTER-NODE: LATENCY Put Inter-Node 80% faster Get Inter-Node 20% faster Half ping-pong Proc 0 Proc 1 put sync memory 27

28 PERFORMANCE INTRA-NODE: LATENCY Put/Get Intra-Node 3x faster Half ping-pong Proc 0 Proc 1 put sync memory 28

29 PERFORMANCE: OVERLAP Proc 0 Proc 1 put Inter-Node Overlap in % comp. Sync memory Useful for, e.g., scientific codes: AWM-Olsen seismic 3D FFT MILC 29

30 PERFORMANCE: MESSAGE RATE Proc 0 Proc 1 puts... Sync memory Inter-Node Intra-Node 30

31 PERFORMANCE: ATOMICS hardwareaccelerated protocol: lower latency fall-back protocol: higher bandwidth 64Bit integers proprietary 31

32 PART 3: SYNCHRONIZATION Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 32

33 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put put put put put put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put

34 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put Local completion (XPMEM) put

35 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } Remote completion (DMAPP)

36 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } barrier Global completion

37 SCALABLE FENCE PERFORMANCE 90% faster Time bound Memory bound 37

38 PSCW SYNCHRONIZATION Starting process Puts Proc 0 Proc 1 post Puts Posting process allows to access other processes matching algorithm start access epoch complete exposure epoch wait matching algorithm allows access from other processes 38

39 PSCW SYNCHRONIZATION Proc 4 Proc 5 post start Proc 0 Proc 1 start post Proc 2 Proc 3 start start wait complete complete complete wait complete 39

40 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Posting process (opens its window) j3 40

41 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Starting processes (access remote window) j3 41

42 PSCW SCALABLE POST/START MATCHING Each starting process has a local list j1 j4 i j2 Local list j3 42

43 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 43

44 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 44

45 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 j4 i j2 j3 45

46 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 46

47 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 i 47

48 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 48

49 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 49

50 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 0 j3 50

51 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 1 j3 51

52 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 2 j3 52

53 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 3 j3 53

54 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process When the counter is equal to the number of starting processes, the posting process returns from wait j1 j4 i j2 4 j3 54

55 PSCW PERFORMANCE Time bound Memory bound Ring Topology 55

56 SCALABLE LOCK SYNCHRONIZATION Lock/Unlock (shared/exclusive) Lock All (always shared) Active process Passive process Master Process Two-level lock hierarchy: Process 0 Process 1 Process P-1 Shared Counter global: Memory Exclusive Counter Memory Memory local: local: local: Shared Counter Exclusive Bit Shared Counter Exclusive Bit Shared Counter Exclusive Bit 56

57 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process

58 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 )

59 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 2: no local shared/exclusive lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 ) compare & swap

60 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process

61 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process fetch-add MPI_Win_lock_all()

62 62 spcl.inf.ethz.ch FLUSH SYNCHRONIZATION Time bound Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 2 counter: 0 inc(counter) inc(counter) inc(counter)

63 63 spcl.inf.ethz.ch FLUSH SYNCHRONIZATION Time bound Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 2 counter: 3 inc(counter) inc(counter) inc(counter) flush

64 PERFORMANCE Evaluation on Blue Waters System 22,640 computing Cray XE6 nodes 724,480 schedulable cores All microbenchmarks 4 applications One nearly full-scale run 64

65 PERFORMANCE: MOTIF APPLICATIONS Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors 65

66 PERFORMANCE: APPLICATIONS Annotations represent performance gain of fompi over Cray MPI-1. NAS 3D FFT [1] Performance MILC [2] Application Execution Time scale to 65k procs scale to 512k procs [1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS 09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS 12 66

67 CONCLUSIONS & SUMMARY 67

68 CONCLUSIONS & SUMMARY 1. MPI window creation routines 68

69 CONCLUSIONS & SUMMARY 1. MPI window creation routines 2. Non-atomic & atomic communication 69

70 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 70

71 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 71

72 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 72

73 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 6. Application implementation & evaluation 73

74 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 6. Application implementation & evaluation 74

75 ACKNOWLEDGMENTS Try fompi yourself: Parallel_Programming/foMPI Ongoing work on DDT (see ACM Students Research Competition Poster) Thanks to: Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth, Nick Wright, Paul Hargrove (and the whole UPC team) and the MPI Forum RMA WG and the institutions: 75

76 Thank you for your attention 76

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed