Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Size: px
Start display at page:

Download "Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided"

Transcription

1 Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER

2 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 2

3 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 3

4 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 4

5 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA Communication is one sided (no involvement of destination) RMA decouples communication & synchronization Different from message passing two sided one sided Proc A send recv Proc B Communication + Synchronization Proc A put sync Proc B Communication Synchronization [1] 5

6 PRESENTATION OVERVIEW 1. Overview of three MPI-3 RMA concepts 2. MPI window creation 3. Communication 5. Application evaluation 4. Synchronization 6

7 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 7

8 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 8

9 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 9

10 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 10

11 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 11

12 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 12

13 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 13

14 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 14

15 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 15

16 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 16

17 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 17

18 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 18

19 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization Communication Synchronization Window creation 19

20 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM, a portable Linux kernel module 20

21 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM, a portable Linux kernel module 21

22 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 22

23 PART 1: SCALABLE WINDOW CREATION Traditional windows Process A Process B Process C Memory Memory Memory 0x111 0x123 0x120 backwards compatible (MPI-2) 23

24 PART 1: SCALABLE WINDOW CREATION Allocated windows Process A Process B Process C Memory Memory Memory 0x123 0x123 0x123 Allows MPI to allocate memory 24

25 PART 1: SCALABLE WINDOW CREATION Dynamic windows Process A Process B Process C Memory Memory Memory 0x129 0x111 0x123 0x129 0x120 Local attach/detach Most flexible 25

26 PART 2: COMMUNICATION Put and Get: Direct DMAPP put and get operations or local (blocking) memcpy (XPMEM) Accumulate: DMAPP atomic operations for 64Bit types...or fall back to remote locking protocol MPI datatype handling with MPITypes library [1] Fast path for contiguous data transfers of common intrinisic datatypes (e.g., MPI_DOUBLE) MPI_Put dmapp_put_nbi Remote process MPI_Compare _and_swap dmapp_ acswap_qw_nbi Remote process Contiguous memory [1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM 09 26

27 PERFORMANCE INTER-NODE: LATENCY Put Inter-Node 80% faster Get Inter-Node 20% faster Half ping-pong Proc 0 Proc 1 put sync memory 27

28 PERFORMANCE INTRA-NODE: LATENCY Put/Get Intra-Node 3x faster Half ping-pong Proc 0 Proc 1 put sync memory 28

29 PERFORMANCE: OVERLAP Proc 0 Proc 1 put Inter-Node Overlap in % comp. Sync memory Useful for, e.g., scientific codes: AWM-Olsen seismic 3D FFT MILC 29

30 PERFORMANCE: MESSAGE RATE Proc 0 Proc 1 puts... Sync memory Inter-Node Intra-Node 30

31 PERFORMANCE: ATOMICS hardwareaccelerated protocol: lower latency fall-back protocol: higher bandwidth 64Bit integers proprietary 31

32 PART 3: SYNCHRONIZATION Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 32

33 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put put put put put put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put

34 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put Local completion (XPMEM) put

35 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } Remote completion (DMAPP)

36 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } barrier Global completion

37 SCALABLE FENCE PERFORMANCE 90% faster Time bound Memory bound 37

38 PSCW SYNCHRONIZATION Starting process Puts Proc 0 Proc 1 post Puts Posting process allows to access other processes matching algorithm start access epoch complete exposure epoch wait matching algorithm allows access from other processes 38

39 PSCW SYNCHRONIZATION Proc 4 Proc 5 post start Proc 0 Proc 1 start post Proc 2 Proc 3 start start wait complete complete complete wait complete 39

40 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Posting process (opens its window) j3 40

41 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Starting processes (access remote window) j3 41

42 PSCW SCALABLE POST/START MATCHING Each starting process has a local list j1 j4 i j2 Local list j3 42

43 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 43

44 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 44

45 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 j4 i j2 j3 45

46 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 46

47 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 i 47

48 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 48

49 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 49

50 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 0 j3 50

51 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 1 j3 51

52 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 2 j3 52

53 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 3 j3 53

54 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process When the counter is equal to the number of starting processes, the posting process returns from wait j1 j4 i j2 4 j3 54

55 PSCW PERFORMANCE Time bound Memory bound Ring Topology 55

56 SCALABLE LOCK SYNCHRONIZATION Lock/Unlock (shared/exclusive) Lock All (always shared) Active process Passive process Master Process Two-level lock hierarchy: Process 0 Process 1 Process P-1 Shared Counter global: Memory Exclusive Counter Memory Memory local: local: local: Shared Counter Exclusive Bit Shared Counter Exclusive Bit Shared Counter Exclusive Bit 56

57 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process

58 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 )

59 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 2: no local shared/exclusive lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 ) compare & swap

60 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process

61 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process fetch-add MPI_Win_lock_all()

62 62 spcl.inf.ethz.ch FLUSH SYNCHRONIZATION Time bound Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 2 counter: 0 inc(counter) inc(counter) inc(counter)

63 63 spcl.inf.ethz.ch FLUSH SYNCHRONIZATION Time bound Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 2 counter: 3 inc(counter) inc(counter) inc(counter) flush

64 PERFORMANCE Evaluation on Blue Waters System 22,640 computing Cray XE6 nodes 724,480 schedulable cores All microbenchmarks 4 applications One nearly full-scale run 64

65 PERFORMANCE: MOTIF APPLICATIONS Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors 65

66 PERFORMANCE: APPLICATIONS Annotations represent performance gain of fompi over Cray MPI-1. NAS 3D FFT [1] Performance MILC [2] Application Execution Time scale to 65k procs scale to 512k procs [1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS 09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS 12 66

67 CONCLUSIONS & SUMMARY 67

68 CONCLUSIONS & SUMMARY 1. MPI window creation routines 68

69 CONCLUSIONS & SUMMARY 1. MPI window creation routines 2. Non-atomic & atomic communication 69

70 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 70

71 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 71

72 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 72

73 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 6. Application implementation & evaluation 73

74 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 6. Application implementation & evaluation 74

75 ACKNOWLEDGMENTS Try fompi yourself: Parallel_Programming/foMPI Ongoing work on DDT (see ACM Students Research Competition Poster) Thanks to: Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth, Nick Wright, Paul Hargrove (and the whole UPC team) and the MPI Forum RMA WG and the institutions: 75

76 Thank you for your attention 76

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed

More information

Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1

Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1 Scientific Programming 22 (2014) 75 91 75 DOI 10.3233/SPR-140383 IOS Press Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1 Robert Gerstenberger, Maciej Besta and Torsten

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch LOCKS An example structure Proc p Proc q Inuitive semantics

More information

Towards scalable RDMA locking on a NIC

Towards scalable RDMA locking on a NIC TORSTEN HOEFLER spcl.inf.ethz.ch Towards scalable RDMA locking on a NIC with support of Patrick Schmid, Maciej Besta, Salvatore di Girolamo @ SPCL presented at HP Labs, Palo Alto, CA, USA NEED FOR EFFICIENT

More information

OPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS

OPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS OPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS TORSTEN HOEFLER, TIMO SCHNEIDER ETH Zürich 2012 Joint Lab Workshop, Argonne, IL, USA PARALLEL APPLICATION CLASSES We distinguish five

More information

DPHPC Recitation Session 2 Advanced MPI Concepts

DPHPC Recitation Session 2 Advanced MPI Concepts TIMO SCHNEIDER DPHPC Recitation Session 2 Advanced MPI Concepts Recap MPI is a widely used API to support message passing for HPC We saw that six functions are enough to write useful

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks TORSTEN HOEFLER High-Performance Distributed RMA Locks with support of Patrick Schmid, Maciej Besta @ SPCL presented at Wuxi, China, Sept. 2016 ETH, CS, Systems Group, SPCL ETH Zurich top university in

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

Memory allocation and sample API calls. Preliminary Gemini performance measurements

Memory allocation and sample API calls. Preliminary Gemini performance measurements DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by:

More information

MPI- RMA: The State of the Art in Open MPI

MPI- RMA: The State of the Art in Open MPI LA-UR-18-23228 MPI- RMA: The State of the Art in Open MPI SEA workshop 2018, UCAR Howard Pritchard Nathan Hjelm April 6, 2018 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's

More information

MPI+X on The Way to Exascale. William Gropp

MPI+X on The Way to Exascale. William Gropp MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Likely Exascale Architectures (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity, Low Bandwidth) Thin Cores / Accelerators

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 19 MPI remote memory access (RMA) Put/Get directly

More information

COSC 6374 Parallel Computation. Remote Direct Memory Access

COSC 6374 Parallel Computation. Remote Direct Memory Access COSC 6374 Parallel Computation Remote Direct Memory Access Edgar Gabriel Fall 2015 Communication Models A P0 receive send B P1 Message Passing Model A B Shared Memory Model P0 A=B P1 A P0 put B P1 Remote

More information

MPI on the Cray XC30

MPI on the Cray XC30 MPI on the Cray XC30 Aaron Vose 4/15/2014 Many thanks to Cray s Nick Radcliffe and Nathan Wichmann for slide ideas. Cray MPI. MPI on XC30 - Overview MPI Message Pathways. MPI Environment Variables. Environment

More information

Lecture 34: One-sided Communication in MPI. William Gropp

Lecture 34: One-sided Communication in MPI. William Gropp Lecture 34: One-sided Communication in MPI William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler Rajeev Thakur

More information

Comparing One-Sided Communication with MPI, UPC and SHMEM

Comparing One-Sided Communication with MPI, UPC and SHMEM Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to

More information

IBM HPC Development MPI update

IBM HPC Development MPI update IBM HPC Development MPI update Chulho Kim Scicomp 2007 Enhancements in PE 4.3.0 & 4.3.1 MPI 1-sided improvements (October 2006) Selective collective enhancements (October 2006) AIX IB US enablement (July

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides

More information

Implementing Byte-Range Locks Using MPI One-Sided Communication

Implementing Byte-Range Locks Using MPI One-Sided Communication Implementing Byte-Range Locks Using MPI One-Sided Communication Rajeev Thakur, Robert Ross, and Robert Latham Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA

More information

Implementing MPI-IO Shared File Pointers without File System Support

Implementing MPI-IO Shared File Pointers without File System Support Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,

More information

MPI-3 One-Sided Communication

MPI-3 One-Sided Communication HLRN Parallel Programming Workshop Speedup your Code on Intel Processors at HLRN October 20th, 2017 MPI-3 One-Sided Communication Florian Wende Zuse Institute Berlin Two-sided communication Standard Message

More information

MPI Remote Memory Access Programming (MPI3-RMA) and Advanced MPI Programming

MPI Remote Memory Access Programming (MPI3-RMA) and Advanced MPI Programming TORSTEN HOEFLER MPI Remote Access Programming (MPI3-RMA) and Advanced MPI Programming presented at RWTH Aachen, Jan. 2019 based on tutorials in collaboration with Bill Gropp, Rajeev Thakur, and Pavan Balaji

More information

GPI-2: a PGAS API for asynchronous and scalable parallel applications

GPI-2: a PGAS API for asynchronous and scalable parallel applications GPI-2: a PGAS API for asynchronous and scalable parallel applications Rui Machado CC-HPC, Fraunhofer ITWM Barcelona, 13 Jan. 2014 1 Fraunhofer ITWM CC-HPC Fraunhofer Institute for Industrial Mathematics

More information

MPI+X on The Way to Exascale. William Gropp

MPI+X on The Way to Exascale. William Gropp MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Some Likely Exascale Architectures Figure 1: Core Group for Node (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity,

More information

Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming

Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming Torsten Hoefler, 1,2 James Dinan, 3 Darius Buntinas, 3 Pavan Balaji, 3 Brian Barrett, 4 Ron Brightwell, 4 William Gropp,

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E Understanding Communication and MPI on Cray XC40 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale

More information

FVM - How to program the Multi-Core FVM instead of MPI

FVM - How to program the Multi-Core FVM instead of MPI FVM - How to program the Multi-Core FVM instead of MPI DLR, 15. October 2009 Dr. Mirko Rahn Competence Center High Performance Computing and Visualization Fraunhofer Institut for Industrial Mathematics

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline

More information

New and old Features in MPI-3.0: The Past, the Standard, and the Future Torsten Hoefler With contributions from the MPI Forum

New and old Features in MPI-3.0: The Past, the Standard, and the Future Torsten Hoefler With contributions from the MPI Forum New and old Features in MPI-3.0: The Past, the Standard, and the Future Torsten Hoefler With contributions from the MPI Forum All used images belong to the owner/creator! What is MPI Message Passing Interface?

More information

The good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3

The good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3 The good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3 6th Workshop on Runtime and Operating Systems for the Many-core Era (ROME 2018) www.dash-project.org Karl Fürlinger

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA

More information

DPHPC: Locks Recitation session

DPHPC: Locks Recitation session SALVATORE DI GIROLAMO DPHPC: Locks Recitation session spcl.inf.ethz.ch 2-threads: LockOne T0 T1 volatile int flag[2]; void lock() { int j = 1 - tid; flag[tid] = true; while (flag[j])

More information

Strong Scaling for Molecular Dynamics Applications

Strong Scaling for Molecular Dynamics Applications Strong Scaling for Molecular Dynamics Applications Scaling and Molecular Dynamics Strong Scaling How does solution time vary with the number of processors for a fixed problem size Classical Molecular Dynamics

More information

Unified Communication X (UCX)

Unified Communication X (UCX) Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18 UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

A One-Sided View of HPC: Global-View Models and Portable Runtime Systems

A One-Sided View of HPC: Global-View Models and Portable Runtime Systems A One-Sided View of HPC: Global-View Models and Portable Runtime Systems James Dinan James Wallace Gives Postdoctoral Fellow Argonne Na9onal Laboratory Why Global-View? Proc 0 Proc 1 Proc n Global address

More information

spin: High-performance streaming Processing in the Network

spin: High-performance streaming Processing in the Network T. HOEFLER, S. DI GIROLAMO, K. TARANOV, R. E. GRANT, R. BRIGHTWELL spin: High-performance streaming Processing in the Network spcl.inf.ethz.ch The Development of High-Performance Networking Interfaces

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

One Sided Communication. MPI-2 Remote Memory Access. One Sided Communication. Standard message passing

One Sided Communication. MPI-2 Remote Memory Access. One Sided Communication. Standard message passing MPI-2 Remote Memory Access Based on notes by Sathish Vadhiyar, Rob Thacker, and David Cronk One Sided Communication One sided communication allows shmem style gets and puts Only one process need actively

More information

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick UPC Review. July 22, 2009. What is GASNet?

More information

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications Zhezhe Chen, 1 James Dinan, 3 Zhen Tang, 2 Pavan Balaji, 4 Hua Zhong, 2 Jun Wei, 2 Tao Huang, 2 and Feng Qin 5 1 Twitter Inc.

More information

The Tofu Interconnect 2

The Tofu Interconnect 2 The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect

More information

MPI Forum: Preview of the MPI 3 Standard

MPI Forum: Preview of the MPI 3 Standard MPI Forum: Preview of the MPI 3 Standard presented by Richard L. Graham- Chairman George Bosilca Outline Goal Forum Structure Meeting Schedule Scope Voting Rules 2 Goal To produce new versions of the MPI

More information

Heidi Poxon Cray Inc.

Heidi Poxon Cray Inc. Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs

More information

MPI for Cray XE/XK Systems & Recent Enhancements

MPI for Cray XE/XK Systems & Recent Enhancements MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.

More information

MPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session

MPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session S. DI GIROLAMO [DIGIROLS@INF.ETHZ.CH] MPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session Slides credits: Pavan Balaji, Torsten Hoefler https://htor.inf.ethz.ch/teaching/mpi_tutorials/isc16/hoefler-balaji-isc16-advanced-mpi.pdf

More information

DESIGNING SCALABLE AND HIGH PERFORMANCE ONE SIDED COMMUNICATION MIDDLEWARE FOR MODERN INTERCONNECTS

DESIGNING SCALABLE AND HIGH PERFORMANCE ONE SIDED COMMUNICATION MIDDLEWARE FOR MODERN INTERCONNECTS DESIGNING SCALABLE AND HIGH PERFORMANCE ONE SIDED COMMUNICATION MIDDLEWARE FOR MODERN INTERCONNECTS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

TORSTEN HOEFLER

TORSTEN HOEFLER TORSTEN HOEFLER Performance Modeling for Future Computing Technologies Presentation at Tsinghua University, Beijing, China Part of the 60 th anniversary celebration of the Computer Science Department (#9)

More information

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE 13 th ANNUAL WORKSHOP 2017 ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE Erik Paulson, Kayla Seager, Sayantan Sur, James Dinan, Dave Ozog: Intel Corporation Collaborators: Howard Pritchard:

More information

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism 1 Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 16 Aug, 2016 4th Annual MVAPICH Users Group Meeting 2 Background

More information

Standards, Standards & Standards - MPI 3. Dr Roger Philp, Bayncore an Intel paired

Standards, Standards & Standards - MPI 3. Dr Roger Philp, Bayncore an Intel paired Standards, Standards & Standards - MPI 3 Dr Roger Philp, Bayncore an Intel paired 1 Message Passing Interface (MPI) towards millions of cores MPI is an open standard library interface for message passing

More information

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH 1 Exascale Programming Models With the evolution of HPC architecture towards exascale, new approaches for programming these machines

More information

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Torsten Hoefler With lots of help from the AUS Team and Bill Kramer at NCSA! All used images belong

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS

OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for

More information

Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications

Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications William Gropp University of Illinois, Urbana- Champaign Pavan Balaji, Rajeev Thakur

More information

Challenges and Opportunities for HPC Interconnects and MPI

Challenges and Opportunities for HPC Interconnects and MPI Challenges and Opportunities for HPC Interconnects and MPI Ron Brightwell, R&D Manager Scalable System Software Department Sandia National Laboratories is a multi-mission laboratory managed and operated

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network

More information

Dmitry Durnov 15 February 2017

Dmitry Durnov 15 February 2017 Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern

More information

Implementing MPI-IO Atomic Mode Without File System Support

Implementing MPI-IO Atomic Mode Without File System Support 1 Implementing MPI-IO Atomic Mode Without File System Support Robert Ross Robert Latham William Gropp Rajeev Thakur Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,

More information

On-the-Fly Data Race Detection in MPI One-Sided Communication

On-the-Fly Data Race Detection in MPI One-Sided Communication On-the-Fly Data Race Detection in MPI One-Sided Communication Presentation Master Thesis Simon Schwitanski (schwitanski@itc.rwth-aachen.de) Joachim Protze (protze@itc.rwth-aachen.de) Prof. Dr. Matthias

More information

Open MPI for Cray XE/XK Systems

Open MPI for Cray XE/XK Systems Open MPI for Cray XE/XK Systems Samuel K. Gutierrez LANL Nathan T. Hjelm LANL Manjunath Gorentla Venkata ORNL Richard L. Graham - Mellanox Cray User Group (CUG) 2012 May 2, 2012 U N C L A S S I F I E D

More information

COSC 6374 Parallel Computation. Remote Direct Memory Acces

COSC 6374 Parallel Computation. Remote Direct Memory Acces COSC 6374 Parallel Computation Remote Direct Memory Acces Edgar Gabriel Fall 2013 Communication Models A P0 receive send B P1 Message Passing Model: Two-sided communication A P0 put B P1 Remote Memory

More information

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler

More information

Advanced MPI. Andrew Emerson

Advanced MPI. Andrew Emerson Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 22/02/2017 Advanced MPI 2 One

More information

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010. Parallel Programming Lecture 18: Introduction to Message Passing Mary Hall November 2, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. -

More information

Comparing UPC and one-sided MPI: A distributed hash table for GAP

Comparing UPC and one-sided MPI: A distributed hash table for GAP Comparing UPC and one-sided MPI: A distributed hash table for GAP C.M. Maynard EPCC, School of Physics and Astronomy, University of Edinburgh, JCMB, Kings Buildings, Mayfield Road, Edinburgh, EH9 3JZ,

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard MPI: A Message-Passing Interface Standard Version 2.1 Message Passing Interface Forum June 23, 2008 Contents Acknowledgments xvl1 1 Introduction to MPI 1 1.1 Overview and Goals 1 1.2 Background of MPI-1.0

More information

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based

More information

Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!

Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:! Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!! thread synchronization!! data sharing!! massage passing overhead from CPUs.! This system has a rich

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Advanced MPI. Andrew Emerson

Advanced MPI. Andrew Emerson Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 11/12/2015 Advanced MPI 2 One

More information

Performance-Centric System Design

Performance-Centric System Design Performance-Centric System Design Torsten Hoefler Swiss Federal Institute of Technology, ETH Zürich Performance-Centric System Design - modeling, programming, and architecture - Torsten Hoefler ETH Zürich

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Operational Robustness of Accelerator Aware MPI

Operational Robustness of Accelerator Aware MPI Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers

More information

One-Sided Append: A New Communication Paradigm For PGAS Models

One-Sided Append: A New Communication Paradigm For PGAS Models One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class

More information

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday

More information

Cray RS Programming Environment

Cray RS Programming Environment Cray RS Programming Environment Gail Alverson Cray Inc. Cray Proprietary Red Storm Red Storm is a supercomputer system leveraging over 10,000 AMD Opteron processors connected by an innovative high speed,

More information

Fault Tolerance for Remote Memory Access Programming Models

Fault Tolerance for Remote Memory Access Programming Models Fault Tolerance for Remote Memory Access Programming Models ABSTRACT Maciej Besta Dept. of Computer Science ETH Zurich Universitätstr. 6, 8092 Zurich, Switzerland maciej.besta@inf.ethz.ch Remote Memory

More information

Friday, May 25, User Experiments with PGAS Languages, or

Friday, May 25, User Experiments with PGAS Languages, or User Experiments with PGAS Languages, or User Experiments with PGAS Languages, or It s the Performance, Stupid! User Experiments with PGAS Languages, or It s the Performance, Stupid! Will Sawyer, Sergei

More information

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information