Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Size: px
Start display at page:

Download "Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided"

Transcription

1 Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER

2 MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 2

3 MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 3

4 MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 4

5 MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA Communication is one sided (no involvement of destination) RMA decouples communication & synchronization Different from message passing two sided one sided Proc A send recv Proc B Communication + Synchronization Proc A put sync Proc B Communication Synchronization [1] 5

6 PRESENTATION OVERVIEW 1. Overview of three MPI-3 RMA concepts 2. MPI window creation 3. Communication 5. Application evaluation 4. Synchronization 6

7 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 7

8 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 8

9 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 9

10 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 10

11 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 11

12 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 12

13 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 13

14 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 14

15 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 15

16 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 16

17 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 17

18 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 18

19 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization Communication Synchronization Window creation 19

20 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 20

21 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 21

22 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 22

23 PART 1: SCALABLE WINDOW CREATION Traditional windows Process A Process B Process C Memory Memory Memory 0x111 0x123 0x120 p = total number of processes backwards compatible (MPI-2) Time bound: O p Memory bound: O p 23

24 PART 1: SCALABLE WINDOW CREATION Allocated windows Process A Process B Process C Memory Memory Memory 0x123 0x123 0x123 p = total number of processes Allows MPI to allocate memory Time bound: O log p (whp) Memory bound: O 1 24

25 PART 1: SCALABLE WINDOW CREATION Dynamic windows Process A Process B Process C Memory Memory Memory 0x129 0x111 0x123 0x129 0x120 p = total number of processes Local attach/detach Most flexible Time bound: O p Memory bound: O p 25

26 PART 2: COMMUNICATION Put and Get: Direct DMAPP put and get operations or local (blocking) memcpy (XPMEM) Accumulate: DMAPP atomic operations for 64 bit types...or fall back to remote locking protocol MPI datatype handling with MPITypes library [1] Fast path for contiguous data transfers of common intrinisic datatypes (e.g., MPI_DOUBLE) MPI_Put dmapp_put_nbi Remote process MPI_Compare _and_swap dmapp_ acswap_qw_nbi Remote process Contiguous memory [1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM 09 26

27 PERFORMANCE INTER-NODE: LATENCY Put Inter-Node 80% faster Get Inter-Node 20% faster Half ping-pong Proc 0 Proc 1 put sync memory 27

28 PERFORMANCE INTRA-NODE: LATENCY Put/Get Intra-Node 3x Half ping-pong faster Proc 0 Proc 1 put sync memory 28

29 PERFORMANCE: OVERLAP Proc 0 Proc 1 put Inter-Node Overlap in % comp. Sync memory Useful for, e.g., scientific codes: AWM-Olsen seismic 3D FFT MILC 29

30 PERFORMANCE: MESSAGE RATE Proc 0 Proc 1 puts... Sync memory Inter-Node Intra-Node 30

31 PERFORMANCE: ATOMICS hardwareaccelerated protocol: lower latency fall back protocol: higher bandwidth 64 bit integers proprietary 31

32 PART 3: SYNCHRONIZATION Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 32

33 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put put put put put put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put 33

34 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put Local completion (XPMEM) put 34

35 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } Local completion (DMAPP) 35

36 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } barrier Global completion 36

37 SCALABLE FENCE PERFORMANCE 90% faster Time bound O logp Memory bound O 1 37

38 PSCW SYNCHRONIZATION Starting process Puts Proc 0 Proc 1 Puts Posting process allows to access other processes matching algorithm start access epoch complete post exposure epoch wait matching algorithm allows access from other processes 38

39 PSCW SYNCHRONIZATION Proc 4 Proc 5 post start Proc 0 Proc 1 start post Proc 2 Proc 3 start start wait complete complete complete wait complete 39

40 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Posting process (opens its window) j3 40

41 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Starting processes (access remote window) j3 41

42 PSCW SCALABLE POST/START MATCHING Each starting process has a local list j1 j4 i j2 Local list j3 42

43 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 43

44 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 44

45 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 j4 i j2 j3 45

46 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 46

47 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 i 47

48 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 48

49 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 49

50 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 0 j3 50

51 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 1 j3 51

52 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 2 j3 52

53 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 3 j3 53

54 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process When the counter is equal to the number of starting processes, the posting process returns from wait j1 j4 i j2 4 j3 54

55 PSCW PERFORMANCE Time bound P start = P wait = O 1 P post = P complete = O log p Memory bound O log p (for scalable programs) Ring Topology 55

56 SCALABLE LOCK SYNCHRONIZATION Active process Lock/Unlock (shared/exclusive) Lock All (always shared) Passive process Master Process Two-level lock hierarchy: Process 0 Process 1 Process P-1 Shared Counter global: Memory Exclusive Counter Memory Memory local: local: local: Shared Counter Exclusive Bit Shared Counter Exclusive Bit Shared Counter Exclusive Bit 56

57 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process

58 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 )

59 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 2: no local shared/exclusive lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 ) compare & swap

60 SHARED LOCAL LOCK: ONE PHASE Increment local shared counter (Invariant: no local exclusive lock on this process held concurrently) Proc 0 wants to lock Proc 1 Process 1 Process 0 Process

61 SHARED LOCAL LOCK: ONE PHASE Increment local shared counter (Invariant: no local exclusive lock on this process held concurrently) Proc 0 wants to lock Proc 1 Process 1 Process 0 Process fetch-add MPI_Win_lock( SHRD, 1 )

62 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process

63 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process fetch-add MPI_Win_lock_all() Constant number of operations for p processes 63

64 FLUSH SYNCHRONIZATION Time bound O 1 Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence O(1) One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 1 counter: 0 inc(counter) inc(counter) inc(counter) 64

65 FLUSH SYNCHRONIZATION Time bound O 1 Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence O(1) One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 1 counter: 3 inc(counter) inc(counter) inc(counter) flush 65

66 PERFORMANCE Evaluation on Blue Waters System 22,640 computing Cray XE6 nodes 724,480 schedulable cores All microbenchmarks 4 applications One nearly full-scale run 66

67 PERFORMANCE: MOTIF APPLICATIONS Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors 67

68 PERFORMANCE: APPLICATIONS Annotations represent performance gain of fompi over Cray MPI-1. NAS 3D FFT [1] Performance MILC [2] Application Execution Time scale to 65k procs scale to 512k procs [1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS 09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS 12 68

69 CONCLUSIONS & SUMMARY 69

70 CONCLUSIONS & SUMMARY 1. MPI window creation routines 70

71 CONCLUSIONS & SUMMARY 1. MPI window creation routines 2. Non-atomic & atomic communication 71

72 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 72

73 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 73

74 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 74

75 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 75

76 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 76

77 ACKNOWLEDGMENTS Thanks to: Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth, Nick Wright, Paul Hargrove (and the whole UPC team) and the MPI Forum RMA WG and the institutions: 77

78 Parallel_Programming/foMPI Thank you for your attention 78

79 Backup slides 79

80 DYNAMIC WINDOW CREATION Process A Process B Process C Memory Memory Memory

81 DYNAMIC WINDOW CREATION Process A Process B Process C Memory Memory Memory 0x111 0x

82 DYNAMIC WINDOW CREATION Process A Process B Process C Memory Memory Memory 0x129 0x111 0x123 0x129 0x

83 DYNAMIC WINDOW CREATION Process A Process B Process C Memory Memory Memory Process A Get(id) 2 Access the window Process B Cached: 2 0x129 0x111 0x123 0x129 0x120 Process A Get(id) 2 Update(list) Process B Cached: Access the window 61

84 PERFORMANCE INTER-NODE: LATENCY (SHMEM) Put Inter-Node Get Inter-Node Half ping-pong Proc 0 Proc 1 put sync memory 84

85 PERFORMANCE INTRA-NODE: LATENCY (SHMEM) Put/Get Intra-Node Half ping-pong Proc 0 Proc 1 put sync memory 85

86 EXCLUSIVE LOCAL LOCK: TWO PHASES BACKOFF: decrement the global exclusive counter...then retry Proc 2 wants to lock exclusively Proc 1 Process 1 Process 0 Process Add(-1)

87 PERFORMANCE MODELING Performance functions for synchronization protocols Fence PSCW Locks P fence = 2.9μs log 2 (p) P start = 0.7μs, P wait = 1.8μs P post = P complete = 350ns k P lock,excl = 5.4μs P lock,shrd = P lock_all = 2.7μs P unlock = P unlock_all = 0.4μs P flush = 76ns P sync = 17ns Performance functions for communication protocols Put/get Atomics P put = 0.16ns s + 1μs P get = 0.17ns s + 1.9μs P acc,sum = 28ns s + 2.4μs P acc,min = 0.8ns s + 7.3μs 61

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to

More information

Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1

Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1 Scientific Programming 22 (2014) 75 91 75 DOI 10.3233/SPR-140383 IOS Press Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1 Robert Gerstenberger, Maciej Besta and Torsten

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch LOCKS An example structure Proc p Proc q Inuitive semantics

More information

Towards scalable RDMA locking on a NIC

Towards scalable RDMA locking on a NIC TORSTEN HOEFLER spcl.inf.ethz.ch Towards scalable RDMA locking on a NIC with support of Patrick Schmid, Maciej Besta, Salvatore di Girolamo @ SPCL presented at HP Labs, Palo Alto, CA, USA NEED FOR EFFICIENT

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch

More information

OPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS

OPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS OPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS TORSTEN HOEFLER, TIMO SCHNEIDER ETH Zürich 2012 Joint Lab Workshop, Argonne, IL, USA PARALLEL APPLICATION CLASSES We distinguish five

More information

DPHPC Recitation Session 2 Advanced MPI Concepts

DPHPC Recitation Session 2 Advanced MPI Concepts TIMO SCHNEIDER DPHPC Recitation Session 2 Advanced MPI Concepts Recap MPI is a widely used API to support message passing for HPC We saw that six functions are enough to write useful

More information

MPI- RMA: The State of the Art in Open MPI

MPI- RMA: The State of the Art in Open MPI LA-UR-18-23228 MPI- RMA: The State of the Art in Open MPI SEA workshop 2018, UCAR Howard Pritchard Nathan Hjelm April 6, 2018 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's

More information

Memory allocation and sample API calls. Preliminary Gemini performance measurements

Memory allocation and sample API calls. Preliminary Gemini performance measurements DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini

More information

MPI+X on The Way to Exascale. William Gropp

MPI+X on The Way to Exascale. William Gropp MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Likely Exascale Architectures (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity, Low Bandwidth) Thin Cores / Accelerators

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks TORSTEN HOEFLER High-Performance Distributed RMA Locks with support of Patrick Schmid, Maciej Besta @ SPCL presented at Wuxi, China, Sept. 2016 ETH, CS, Systems Group, SPCL ETH Zurich top university in

More information

MPI on the Cray XC30

MPI on the Cray XC30 MPI on the Cray XC30 Aaron Vose 4/15/2014 Many thanks to Cray s Nick Radcliffe and Nathan Wichmann for slide ideas. Cray MPI. MPI on XC30 - Overview MPI Message Pathways. MPI Environment Variables. Environment

More information

Comparing One-Sided Communication with MPI, UPC and SHMEM

Comparing One-Sided Communication with MPI, UPC and SHMEM Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to

More information

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by:

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 19 MPI remote memory access (RMA) Put/Get directly

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Implementing Byte-Range Locks Using MPI One-Sided Communication

Implementing Byte-Range Locks Using MPI One-Sided Communication Implementing Byte-Range Locks Using MPI One-Sided Communication Rajeev Thakur, Robert Ross, and Robert Latham Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA

More information

IBM HPC Development MPI update

IBM HPC Development MPI update IBM HPC Development MPI update Chulho Kim Scicomp 2007 Enhancements in PE 4.3.0 & 4.3.1 MPI 1-sided improvements (October 2006) Selective collective enhancements (October 2006) AIX IB US enablement (July

More information

COSC 6374 Parallel Computation. Remote Direct Memory Access

COSC 6374 Parallel Computation. Remote Direct Memory Access COSC 6374 Parallel Computation Remote Direct Memory Access Edgar Gabriel Fall 2015 Communication Models A P0 receive send B P1 Message Passing Model A B Shared Memory Model P0 A=B P1 A P0 put B P1 Remote

More information

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

Lecture 34: One-sided Communication in MPI. William Gropp

Lecture 34: One-sided Communication in MPI. William Gropp Lecture 34: One-sided Communication in MPI William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler Rajeev Thakur

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer

More information

Implementing MPI-IO Shared File Pointers without File System Support

Implementing MPI-IO Shared File Pointers without File System Support Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,

More information

MPI+X on The Way to Exascale. William Gropp

MPI+X on The Way to Exascale. William Gropp MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Some Likely Exascale Architectures Figure 1: Core Group for Node (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity,

More information

MPI Remote Memory Access Programming (MPI3-RMA) and Advanced MPI Programming

MPI Remote Memory Access Programming (MPI3-RMA) and Advanced MPI Programming TORSTEN HOEFLER MPI Remote Access Programming (MPI3-RMA) and Advanced MPI Programming presented at RWTH Aachen, Jan. 2019 based on tutorials in collaboration with Bill Gropp, Rajeev Thakur, and Pavan Balaji

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides

More information

Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming

Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming Torsten Hoefler, 1,2 James Dinan, 3 Darius Buntinas, 3 Pavan Balaji, 3 Brian Barrett, 4 Ron Brightwell, 4 William Gropp,

More information

New and old Features in MPI-3.0: The Past, the Standard, and the Future Torsten Hoefler With contributions from the MPI Forum

New and old Features in MPI-3.0: The Past, the Standard, and the Future Torsten Hoefler With contributions from the MPI Forum New and old Features in MPI-3.0: The Past, the Standard, and the Future Torsten Hoefler With contributions from the MPI Forum All used images belong to the owner/creator! What is MPI Message Passing Interface?

More information

The good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3

The good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3 The good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3 6th Workshop on Runtime and Operating Systems for the Many-core Era (ROME 2018) www.dash-project.org Karl Fürlinger

More information

MPI-3 One-Sided Communication

MPI-3 One-Sided Communication HLRN Parallel Programming Workshop Speedup your Code on Intel Processors at HLRN October 20th, 2017 MPI-3 One-Sided Communication Florian Wende Zuse Institute Berlin Two-sided communication Standard Message

More information

GPI-2: a PGAS API for asynchronous and scalable parallel applications

GPI-2: a PGAS API for asynchronous and scalable parallel applications GPI-2: a PGAS API for asynchronous and scalable parallel applications Rui Machado CC-HPC, Fraunhofer ITWM Barcelona, 13 Jan. 2014 1 Fraunhofer ITWM CC-HPC Fraunhofer Institute for Industrial Mathematics

More information

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E Understanding Communication and MPI on Cray XC40 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

Unified Communication X (UCX)

Unified Communication X (UCX) Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18 UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks

More information

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

FVM - How to program the Multi-Core FVM instead of MPI

FVM - How to program the Multi-Core FVM instead of MPI FVM - How to program the Multi-Core FVM instead of MPI DLR, 15. October 2009 Dr. Mirko Rahn Competence Center High Performance Computing and Visualization Fraunhofer Institut for Industrial Mathematics

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

One Sided Communication. MPI-2 Remote Memory Access. One Sided Communication. Standard message passing

One Sided Communication. MPI-2 Remote Memory Access. One Sided Communication. Standard message passing MPI-2 Remote Memory Access Based on notes by Sathish Vadhiyar, Rob Thacker, and David Cronk One Sided Communication One sided communication allows shmem style gets and puts Only one process need actively

More information

DPHPC: Locks Recitation session

DPHPC: Locks Recitation session SALVATORE DI GIROLAMO DPHPC: Locks Recitation session spcl.inf.ethz.ch 2-threads: LockOne T0 T1 volatile int flag[2]; void lock() { int j = 1 - tid; flag[tid] = true; while (flag[j])

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

MPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session

MPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session S. DI GIROLAMO [DIGIROLS@INF.ETHZ.CH] MPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session Slides credits: Pavan Balaji, Torsten Hoefler https://htor.inf.ethz.ch/teaching/mpi_tutorials/isc16/hoefler-balaji-isc16-advanced-mpi.pdf

More information

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly

More information

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data. Synchronization Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data. Coherency protocols do not: make sure that only one thread accesses shared data

More information

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale

More information

OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS

OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for

More information

The Tofu Interconnect 2

The Tofu Interconnect 2 The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect

More information

Strong Scaling for Molecular Dynamics Applications

Strong Scaling for Molecular Dynamics Applications Strong Scaling for Molecular Dynamics Applications Scaling and Molecular Dynamics Strong Scaling How does solution time vary with the number of processors for a fixed problem size Classical Molecular Dynamics

More information

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Torsten Hoefler With lots of help from the AUS Team and Bill Kramer at NCSA! All used images belong

More information

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE 13 th ANNUAL WORKSHOP 2017 ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE Erik Paulson, Kayla Seager, Sayantan Sur, James Dinan, Dave Ozog: Intel Corporation Collaborators: Howard Pritchard:

More information

Standards, Standards & Standards - MPI 3. Dr Roger Philp, Bayncore an Intel paired

Standards, Standards & Standards - MPI 3. Dr Roger Philp, Bayncore an Intel paired Standards, Standards & Standards - MPI 3 Dr Roger Philp, Bayncore an Intel paired 1 Message Passing Interface (MPI) towards millions of cores MPI is an open standard library interface for message passing

More information

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications Zhezhe Chen, 1 James Dinan, 3 Zhen Tang, 2 Pavan Balaji, 4 Hua Zhong, 2 Jun Wei, 2 Tao Huang, 2 and Feng Qin 5 1 Twitter Inc.

More information

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick UPC Review. July 22, 2009. What is GASNet?

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Basic usage of OpenSHMEM 2 Outline Concept and Motivation Remote Read and Write Synchronisation Implementations OpenSHMEM Summary 3 Philosophy of the talks In

More information

DESIGNING SCALABLE AND HIGH PERFORMANCE ONE SIDED COMMUNICATION MIDDLEWARE FOR MODERN INTERCONNECTS

DESIGNING SCALABLE AND HIGH PERFORMANCE ONE SIDED COMMUNICATION MIDDLEWARE FOR MODERN INTERCONNECTS DESIGNING SCALABLE AND HIGH PERFORMANCE ONE SIDED COMMUNICATION MIDDLEWARE FOR MODERN INTERCONNECTS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and D. K. Panda Presented by Lei Chai Network Based

More information

Challenges and Opportunities for HPC Interconnects and MPI

Challenges and Opportunities for HPC Interconnects and MPI Challenges and Opportunities for HPC Interconnects and MPI Ron Brightwell, R&D Manager Scalable System Software Department Sandia National Laboratories is a multi-mission laboratory managed and operated

More information

A One-Sided View of HPC: Global-View Models and Portable Runtime Systems

A One-Sided View of HPC: Global-View Models and Portable Runtime Systems A One-Sided View of HPC: Global-View Models and Portable Runtime Systems James Dinan James Wallace Gives Postdoctoral Fellow Argonne Na9onal Laboratory Why Global-View? Proc 0 Proc 1 Proc n Global address

More information

Open MPI for Cray XE/XK Systems

Open MPI for Cray XE/XK Systems Open MPI for Cray XE/XK Systems Samuel K. Gutierrez LANL Nathan T. Hjelm LANL Manjunath Gorentla Venkata ORNL Richard L. Graham - Mellanox Cray User Group (CUG) 2012 May 2, 2012 U N C L A S S I F I E D

More information

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based

More information

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

TORSTEN HOEFLER

TORSTEN HOEFLER TORSTEN HOEFLER Performance Modeling for Future Computing Technologies Presentation at Tsinghua University, Beijing, China Part of the 60 th anniversary celebration of the Computer Science Department (#9)

More information

Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!

Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:! Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!! thread synchronization!! data sharing!! massage passing overhead from CPUs.! This system has a rich

More information

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous

More information

MPI for Cray XE/XK Systems & Recent Enhancements

MPI for Cray XE/XK Systems & Recent Enhancements MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.

More information

MPI Forum: Preview of the MPI 3 Standard

MPI Forum: Preview of the MPI 3 Standard MPI Forum: Preview of the MPI 3 Standard presented by Richard L. Graham- Chairman George Bosilca Outline Goal Forum Structure Meeting Schedule Scope Voting Rules 2 Goal To produce new versions of the MPI

More information

The Exascale Architecture

The Exascale Architecture The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on

More information

Parallel I/O using standard data formats in climate and NWP

Parallel I/O using standard data formats in climate and NWP Parallel I/O using standard data formats in climate and NWP Project ScalES funded by BMBF Deike Kleberg Luis Kornblueh, Hamburg MAX-PLANCK-GESELLSCHAFT Outline Introduction Proposed Solution Implementation

More information

One-Sided Append: A New Communication Paradigm For PGAS Models

One-Sided Append: A New Communication Paradigm For PGAS Models One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class

More information

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism 1 Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 16 Aug, 2016 4th Annual MVAPICH Users Group Meeting 2 Background

More information

Dmitry Durnov 15 February 2017

Dmitry Durnov 15 February 2017 Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern

More information

A Distributed Hash Table for Shared Memory

A Distributed Hash Table for Shared Memory A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash

More information

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH 1 Exascale Programming Models With the evolution of HPC architecture towards exascale, new approaches for programming these machines

More information

Can Seqlocks Get Along with Programming Language Memory Models?

Can Seqlocks Get Along with Programming Language Memory Models? Can Seqlocks Get Along with Programming Language Memory Models? Hans-J. Boehm HP Labs Hans-J. Boehm: Seqlocks 1 The setting Want fast reader-writer locks Locking in shared (read) mode allows concurrent

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Multi-core Architecture and Programming

Multi-core Architecture and Programming Multi-core Architecture and Programming Yang Quansheng( 杨全胜 ) http://www.njyangqs.com School of Computer Science & Engineering 1 http://www.njyangqs.com Process, threads and Parallel Programming Content

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts. CPSC/ECE 3220 Fall 2017 Exam 1 Name: 1. Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.) Referee / Illusionist / Glue. Circle only one of R, I, or G.

More information

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers

More information

Cray RS Programming Environment

Cray RS Programming Environment Cray RS Programming Environment Gail Alverson Cray Inc. Cray Proprietary Red Storm Red Storm is a supercomputer system leveraging over 10,000 AMD Opteron processors connected by an innovative high speed,

More information

Other consistency models

Other consistency models Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012

More information

Optimizing non-blocking Collective Operations for InfiniBand

Optimizing non-blocking Collective Operations for InfiniBand Optimizing non-blocking Collective Operations for InfiniBand Open Systems Lab Indiana University Bloomington, USA IPDPS 08 - CAC 08 Workshop Miami, FL, USA April, 14th 2008 Introduction Non-blocking collective

More information

Fault Tolerance for Remote Memory Access Programming Models

Fault Tolerance for Remote Memory Access Programming Models Fault Tolerance for Remote Memory Access Programming Models ABSTRACT Maciej Besta Dept. of Computer Science ETH Zurich Universitätstr. 6, 8092 Zurich, Switzerland maciej.besta@inf.ethz.ch Remote Memory

More information

Friday, May 25, User Experiments with PGAS Languages, or

Friday, May 25, User Experiments with PGAS Languages, or User Experiments with PGAS Languages, or User Experiments with PGAS Languages, or It s the Performance, Stupid! User Experiments with PGAS Languages, or It s the Performance, Stupid! Will Sawyer, Sergei

More information

Comparing the performance of MPICH with Cray s MPI and with SGI s MPI

Comparing the performance of MPICH with Cray s MPI and with SGI s MPI CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 3; 5:779 8 (DOI:./cpe.79) Comparing the performance of with Cray s MPI and with SGI s MPI Glenn R. Luecke,, Marina

More information

MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard MPI: A Message-Passing Interface Standard Version 2.1 Message Passing Interface Forum June 23, 2008 Contents Acknowledgments xvl1 1 Introduction to MPI 1 1.1 Overview and Goals 1 1.2 Background of MPI-1.0

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Parallel Programming with Coarray Fortran

Parallel Programming with Coarray Fortran Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming

More information

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster : Support for High-Performance MPI Intra-Node Communication on Linux Cluster Hyun-Wook Jin Sayantan Sur Lei Chai Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State University

More information