Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Size: px

Start display at page:

Download "Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided"

Eugenia Barrett
6 years ago
Views:

1 Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER

2 MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 2

3 MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 3

4 MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 4

5 MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA Communication is one sided (no involvement of destination) RMA decouples communication & synchronization Different from message passing two sided one sided Proc A send recv Proc B Communication + Synchronization Proc A put sync Proc B Communication Synchronization [1] 5

6 PRESENTATION OVERVIEW 1. Overview of three MPI-3 RMA concepts 2. MPI window creation 3. Communication 5. Application evaluation 4. Synchronization 6

7 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 7

8 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 8

9 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 9

10 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 10

11 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 11

12 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 12

13 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 13

14 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 14

15 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 15

16 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 16

17 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 17

18 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 18

g., OFED/IB) Window creation, communication and

19 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization Communication Synchronization Window creation 19

20 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 20

21 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 21

22 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 22

23 PART 1: SCALABLE WINDOW CREATION Traditional windows Process A Process B Process C Memory Memory Memory 0x111 0x123 0x120 p = total number of processes backwards compatible (MPI-2) Time bound: O p Memory bound: O p 23

24 PART 1: SCALABLE WINDOW CREATION Allocated windows Process A Process B Process C Memory Memory Memory 0x123 0x123 0x123 p = total number of processes Allows MPI to allocate memory Time bound: O log p (whp) Memory bound: O 1 24

25 PART 1: SCALABLE WINDOW CREATION Dynamic windows Process A Process B Process C Memory Memory Memory 0x129 0x111 0x123 0x129 0x120 p = total number of processes Local attach/detach Most flexible Time bound: O p Memory bound: O p 25

..or fall back to remote locking protocol MPI datatype handling with MPITypes library [1] Fast path for contiguous data transfers of

26 PART 2: COMMUNICATION Put and Get: Direct DMAPP put and get operations or local (blocking) memcpy (XPMEM) Accumulate: DMAPP atomic operations for 64 bit types...or fall back to remote locking protocol MPI datatype handling with MPITypes library [1] Fast path for contiguous data transfers of common intrinisic datatypes (e.g., MPI_DOUBLE) MPI_Put dmapp_put_nbi Remote process MPI_Compare _and_swap dmapp_ acswap_qw_nbi Remote process Contiguous memory [1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM 09 26

27 PERFORMANCE INTER-NODE: LATENCY Put Inter-Node 80% faster Get Inter-Node 20% faster Half ping-pong Proc 0 Proc 1 put sync memory 27

28 PERFORMANCE INTRA-NODE: LATENCY Put/Get Intra-Node 3x Half ping-pong faster Proc 0 Proc 1 put sync memory 28

29 PERFORMANCE: OVERLAP Proc 0 Proc 1 put Inter-Node Overlap in % comp. Sync memory Useful for, e.g., scientific codes: AWM-Olsen seismic 3D FFT MILC 29

30 PERFORMANCE: MESSAGE RATE Proc 0 Proc 1 puts... Sync memory Inter-Node Intra-Node 30

31 PERFORMANCE: ATOMICS hardwareaccelerated protocol: lower latency fall back protocol: higher bandwidth 64 bit integers proprietary 31

32 PART 3: SYNCHRONIZATION Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 32

33 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put put put put put put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put 33

$Proc 3 put int MPI_Win_fence( ) { asm( mfence );$

34 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put Local completion (XPMEM) put 34

35 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } Local completion (DMAPP) 35

36 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } barrier Global completion 36

37 SCALABLE FENCE PERFORMANCE 90% faster Time bound O logp Memory bound O 1 37

38 PSCW SYNCHRONIZATION Starting process Puts Proc 0 Proc 1 Puts Posting process allows to access other processes matching algorithm start access epoch complete post exposure epoch wait matching algorithm allows access from other processes 38

39 PSCW SYNCHRONIZATION Proc 4 Proc 5 post start Proc 0 Proc 1 start post Proc 2 Proc 3 start start wait complete complete complete wait complete 39

40 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Posting process (opens its window) j3 40

41 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Starting processes (access remote window) j3 41

42 PSCW SCALABLE POST/START MATCHING Each starting process has a local list j1 j4 i j2 Local list j3 42

43 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 43

44 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 44

45 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 j4 i j2 j3 45

46 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 46

47 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 i 47

48 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 48

49 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 49

50 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 0 j3 50

51 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 1 j3 51

52 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 2 j3 52

53 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 3 j3 53

54 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process When the counter is equal to the number of starting processes, the posting process returns from wait j1 j4 i j2 4 j3 54

55 PSCW PERFORMANCE Time bound P start = P wait = O 1 P post = P complete = O log p Memory bound O log p (for scalable programs) Ring Topology 55

56 SCALABLE LOCK SYNCHRONIZATION Active process Lock/Unlock (shared/exclusive) Lock All (always shared) Passive process Master Process Two-level lock hierarchy: Process 0 Process 1 Process P-1 Shared Counter global: Memory Exclusive Counter Memory Memory local: local: local: Shared Counter Exclusive Bit Shared Counter Exclusive Bit Shared Counter Exclusive Bit 56

57 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process

58 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 )

59 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 2: no local shared/exclusive lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 ) compare & swap

60 SHARED LOCAL LOCK: ONE PHASE Increment local shared counter (Invariant: no local exclusive lock on this process held concurrently) Proc 0 wants to lock Proc 1 Process 1 Process 0 Process

61 SHARED LOCAL LOCK: ONE PHASE Increment local shared counter (Invariant: no local exclusive lock on this process held concurrently) Proc 0 wants to lock Proc 1 Process 1 Process 0 Process fetch-add MPI_Win_lock( SHRD, 1 )

62 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process

63 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process fetch-add MPI_Win_lock_all() Constant number of operations for p processes 63

64 FLUSH SYNCHRONIZATION Time bound O 1 Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence O(1) One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 1 counter: 0 inc(counter) inc(counter) inc(counter) 64

65 FLUSH SYNCHRONIZATION Time bound O 1 Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence O(1) One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 1 counter: 3 inc(counter) inc(counter) inc(counter) flush 65

66 PERFORMANCE Evaluation on Blue Waters System 22,640 computing Cray XE6 nodes 724,480 schedulable cores All microbenchmarks 4 applications One nearly full-scale run 66

67 PERFORMANCE: MOTIF APPLICATIONS Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors 67

68 PERFORMANCE: APPLICATIONS Annotations represent performance gain of fompi over Cray MPI-1. NAS 3D FFT [1] Performance MILC [2] Application Execution Time scale to 65k procs scale to 512k procs [1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS 09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS 12 68

69 CONCLUSIONS & SUMMARY 69

70 CONCLUSIONS & SUMMARY 1. MPI window creation routines 70

71 CONCLUSIONS & SUMMARY 1. MPI window creation routines 2. Non-atomic & atomic communication 71

72 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 72

73 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 73

74 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 74

75 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 75

76 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 76

77 ACKNOWLEDGMENTS Thanks to: Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth, Nick Wright, Paul Hargrove (and the whole UPC team) and the MPI Forum RMA WG and the institutions: 77

78 Parallel_Programming/foMPI Thank you for your attention 78

79 Backup slides 79

80 DYNAMIC WINDOW CREATION Process A Process B Process C Memory Memory Memory

81 DYNAMIC WINDOW CREATION Process A Process B Process C Memory Memory Memory 0x111 0x

82 DYNAMIC WINDOW CREATION Process A Process B Process C Memory Memory Memory 0x129 0x111 0x123 0x129 0x

83 DYNAMIC WINDOW CREATION Process A Process B Process C Memory Memory Memory Process A Get(id) 2 Access the window Process B Cached: 2 0x129 0x111 0x123 0x129 0x120 Process A Get(id) 2 Update(list) Process B Cached: Access the window 61

84 PERFORMANCE INTER-NODE: LATENCY (SHMEM) Put Inter-Node Get Inter-Node Half ping-pong Proc 0 Proc 1 put sync memory 84

85 PERFORMANCE INTRA-NODE: LATENCY (SHMEM) Put/Get Intra-Node Half ping-pong Proc 0 Proc 1 put sync memory 85

86 EXCLUSIVE LOCAL LOCK: TWO PHASES BACKOFF: decrement the global exclusive counter...then retry Proc 2 wants to lock exclusively Proc 1 Process 1 Process 0 Process Add(-1)

87 PERFORMANCE MODELING Performance functions for synchronization protocols Fence PSCW Locks P fence = 2.9μs log 2 (p) P start = 0.7μs, P wait = 1.8μs P post = P complete = 350ns k P lock,excl = 5.4μs P lock,shrd = P lock_all = 2.7μs P unlock = P unlock_all = 0.4μs P flush = 76ns P sync = 17ns Performance functions for communication protocols Put/get Atomics P put = 0.16ns s + 1μs P get = 0.17ns s + 1.9μs P acc,sum = 28ns s + 2.4μs P acc,min = 0.8ns s + 7.3μs 61

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to