Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
|
|
- Prudence Lawson
- 5 years ago
- Views:
Transcription
1 Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER
2 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 2
3 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 3
4 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA [1] 4
5 MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to hardware trends Majority of HPC networks support RDMA Communication is one sided (no involvement of destination) RMA decouples communication & synchronization Different from message passing two sided one sided Proc A send recv Proc B Communication + Synchronization Proc A put sync Proc B Communication Synchronization [1] 5
6 PRESENTATION OVERVIEW 1. Overview of three MPI-3 RMA concepts 2. MPI window creation 3. Communication 5. Application evaluation 4. Synchronization 6
7 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 7
8 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 8
9 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 9
10 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 10
11 MPI-3 RMA COMMUNICATION OVERVIEW Process B (active) Process A (passive) Memory Memory Put Non-atomic communication calls (put, get) MPI window MPI window Atomic Get Process C (active) Process D (active) Atomic communication calls (Acc, Get & Acc, CAS, FAO) 11
12 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 12
13 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 13
14 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 14
15 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 15
16 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 16
17 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 17
18 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) 18
19 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization Communication Synchronization Window creation 19
20 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM, a portable Linux kernel module 20
21 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM, a portable Linux kernel module 21
22 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Scalable & generic protocols Can be used on any RDMA network (e.g., OFED/IB) Window creation, communication and synchronization fompi, a fully functional MPI-3 RMA implementation DMAPP: lowest-level networking API for Cray Gemini/Aries systems XPMEM: a portable Linux kernel module 22
23 PART 1: SCALABLE WINDOW CREATION Traditional windows Process A Process B Process C Memory Memory Memory 0x111 0x123 0x120 backwards compatible (MPI-2) 23
24 PART 1: SCALABLE WINDOW CREATION Allocated windows Process A Process B Process C Memory Memory Memory 0x123 0x123 0x123 Allows MPI to allocate memory 24
25 PART 1: SCALABLE WINDOW CREATION Dynamic windows Process A Process B Process C Memory Memory Memory 0x129 0x111 0x123 0x129 0x120 Local attach/detach Most flexible 25
26 PART 2: COMMUNICATION Put and Get: Direct DMAPP put and get operations or local (blocking) memcpy (XPMEM) Accumulate: DMAPP atomic operations for 64Bit types...or fall back to remote locking protocol MPI datatype handling with MPITypes library [1] Fast path for contiguous data transfers of common intrinisic datatypes (e.g., MPI_DOUBLE) MPI_Put dmapp_put_nbi Remote process MPI_Compare _and_swap dmapp_ acswap_qw_nbi Remote process Contiguous memory [1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM 09 26
27 PERFORMANCE INTER-NODE: LATENCY Put Inter-Node 80% faster Get Inter-Node 20% faster Half ping-pong Proc 0 Proc 1 put sync memory 27
28 PERFORMANCE INTRA-NODE: LATENCY Put/Get Intra-Node 3x faster Half ping-pong Proc 0 Proc 1 put sync memory 28
29 PERFORMANCE: OVERLAP Proc 0 Proc 1 put Inter-Node Overlap in % comp. Sync memory Useful for, e.g., scientific codes: AWM-Olsen seismic 3D FFT MILC 29
30 PERFORMANCE: MESSAGE RATE Proc 0 Proc 1 puts... Sync memory Inter-Node Intra-Node 30
31 PERFORMANCE: ATOMICS hardwareaccelerated protocol: lower latency fall-back protocol: higher bandwidth 64Bit integers proprietary 31
32 PART 3: SYNCHRONIZATION Active Target Mode Passive Target Mode Fence Active process Passive process Synchronization Communication Lock Post/Start/ Complete/Wait Lock All 32
33 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put put put put put put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put
34 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 put int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put Local completion (XPMEM) put
35 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } Remote completion (DMAPP)
36 SCALABLE FENCE IMPLEMENTATION Collective call Completes all outstanding memory operations Node 0 Node 1 Proc 0 Proc 1 Proc 2 Proc 3 int MPI_Win_fence( ) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } barrier Global completion
37 SCALABLE FENCE PERFORMANCE 90% faster Time bound Memory bound 37
38 PSCW SYNCHRONIZATION Starting process Puts Proc 0 Proc 1 post Puts Posting process allows to access other processes matching algorithm start access epoch complete exposure epoch wait matching algorithm allows access from other processes 38
39 PSCW SYNCHRONIZATION Proc 4 Proc 5 post start Proc 0 Proc 1 start post Proc 2 Proc 3 start start wait complete complete complete wait complete 39
40 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Posting process (opens its window) j3 40
41 PSCW SCALABLE POST/START MATCHING In general, there can be n posting and m starting processes In this example there is one posting and 4 starting processes j1 j4 i j2 Starting processes (access remote window) j3 41
42 PSCW SCALABLE POST/START MATCHING Each starting process has a local list j1 j4 i j2 Local list j3 42
43 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 43
44 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list j1 j4 i j2 j3 44
45 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 j4 i j2 j3 45
46 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 46
47 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i j4 i j2 j3 i 47
48 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 48
49 PSCW SCALABLE POST/START MATCHING Posting process i adds its rank i to a list at each starting process j1,..., j4 Each starting process j waits until the rank of the posting process i is present in its local list i j1 i i j4 i j2 j3 i 49
50 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 0 j3 50
51 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 1 j3 51
52 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 2 j3 52
53 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process j1 j4 i j2 3 j3 53
54 PSCW SCALABLE COMPLETE/WAIT MATCHING Each starting process increments a counter stored at the posting process When the counter is equal to the number of starting processes, the posting process returns from wait j1 j4 i j2 4 j3 54
55 PSCW PERFORMANCE Time bound Memory bound Ring Topology 55
56 SCALABLE LOCK SYNCHRONIZATION Lock/Unlock (shared/exclusive) Lock All (always shared) Active process Passive process Master Process Two-level lock hierarchy: Process 0 Process 1 Process P-1 Shared Counter global: Memory Exclusive Counter Memory Memory local: local: local: Shared Counter Exclusive Bit Shared Counter Exclusive Bit Shared Counter Exclusive Bit 56
57 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process
58 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 1: no global shared lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 )
59 EXCLUSIVE LOCAL LOCK: TWO PHASES PHASE 1: increment the global exclusive counter (Invariant 2: no local shared/exclusive lock held concurrently) Proc 2 wants to lock Proc 1 exclusively Process 1 Process 0 Process fetch-add MPI_Win_lock( EXCL, 1 ) compare & swap
60 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process
61 SHARED GLOBAL LOCK: ONE PHASE Increment global shared counter (Invariant: no local exclusive lock is held concurrently) Proc 2 wants to lock the whole window Process 1 Process 0 Process fetch-add MPI_Win_lock_all()
62 62 spcl.inf.ethz.ch FLUSH SYNCHRONIZATION Time bound Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 2 counter: 0 inc(counter) inc(counter) inc(counter)
63 63 spcl.inf.ethz.ch FLUSH SYNCHRONIZATION Time bound Guarantees remote completion Memory bound Issues a remote bulk synchronization and an x86 mfence One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path Process 0 Process 2 counter: 3 inc(counter) inc(counter) inc(counter) flush
64 PERFORMANCE Evaluation on Blue Waters System 22,640 computing Cray XE6 nodes 724,480 schedulable cores All microbenchmarks 4 applications One nearly full-scale run 64
65 PERFORMANCE: MOTIF APPLICATIONS Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors 65
66 PERFORMANCE: APPLICATIONS Annotations represent performance gain of fompi over Cray MPI-1. NAS 3D FFT [1] Performance MILC [2] Application Execution Time scale to 65k procs scale to 512k procs [1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS 09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS 12 66
67 CONCLUSIONS & SUMMARY 67
68 CONCLUSIONS & SUMMARY 1. MPI window creation routines 68
69 CONCLUSIONS & SUMMARY 1. MPI window creation routines 2. Non-atomic & atomic communication 69
70 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 70
71 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 71
72 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 72
73 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 6. Application implementation & evaluation 73
74 CONCLUSIONS & SUMMARY 3. Fence / PSCW 1. MPI window creation routines 2. Non-atomic & atomic communication 4. Locks 5. fompi reference implementation 6. Application implementation & evaluation 74
75 ACKNOWLEDGMENTS Try fompi yourself: Parallel_Programming/foMPI Ongoing work on DDT (see ACM Students Research Competition Poster) Thanks to: Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth, Nick Wright, Paul Hargrove (and the whole UPC team) and the MPI Forum RMA WG and the institutions: 75
76 Thank you for your attention 76
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed
More informationEnabling highly-scalable remote memory access programming with MPI-3 One Sided 1
Scientific Programming 22 (2014) 75 91 75 DOI 10.3233/SPR-140383 IOS Press Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1 Robert Gerstenberger, Maciej Besta and Torsten
More informationHigh-Performance Distributed RMA Locks
High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch
More informationHigh-Performance Distributed RMA Locks
High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch LOCKS An example structure Proc p Proc q Inuitive semantics
More informationTowards scalable RDMA locking on a NIC
TORSTEN HOEFLER spcl.inf.ethz.ch Towards scalable RDMA locking on a NIC with support of Patrick Schmid, Maciej Besta, Salvatore di Girolamo @ SPCL presented at HP Labs, Palo Alto, CA, USA NEED FOR EFFICIENT
More informationOPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS
OPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS TORSTEN HOEFLER, TIMO SCHNEIDER ETH Zürich 2012 Joint Lab Workshop, Argonne, IL, USA PARALLEL APPLICATION CLASSES We distinguish five
More informationDPHPC Recitation Session 2 Advanced MPI Concepts
TIMO SCHNEIDER DPHPC Recitation Session 2 Advanced MPI Concepts Recap MPI is a widely used API to support message passing for HPC We saw that six functions are enough to write useful
More informationHigh-Performance Distributed RMA Locks
TORSTEN HOEFLER High-Performance Distributed RMA Locks with support of Patrick Schmid, Maciej Besta @ SPCL presented at Wuxi, China, Sept. 2016 ETH, CS, Systems Group, SPCL ETH Zurich top university in
More informationEfficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics
1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department
More informationMemory allocation and sample API calls. Preliminary Gemini performance measurements
DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationDesign Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters
Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by:
More informationMPI- RMA: The State of the Art in Open MPI
LA-UR-18-23228 MPI- RMA: The State of the Art in Open MPI SEA workshop 2018, UCAR Howard Pritchard Nathan Hjelm April 6, 2018 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's
More informationMPI+X on The Way to Exascale. William Gropp
MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Likely Exascale Architectures (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity, Low Bandwidth) Thin Cores / Accelerators
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 19 MPI remote memory access (RMA) Put/Get directly
More informationCOSC 6374 Parallel Computation. Remote Direct Memory Access
COSC 6374 Parallel Computation Remote Direct Memory Access Edgar Gabriel Fall 2015 Communication Models A P0 receive send B P1 Message Passing Model A B Shared Memory Model P0 A=B P1 A P0 put B P1 Remote
More informationMPI on the Cray XC30
MPI on the Cray XC30 Aaron Vose 4/15/2014 Many thanks to Cray s Nick Radcliffe and Nathan Wichmann for slide ideas. Cray MPI. MPI on XC30 - Overview MPI Message Pathways. MPI Environment Variables. Environment
More informationLecture 34: One-sided Communication in MPI. William Gropp
Lecture 34: One-sided Communication in MPI William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler Rajeev Thakur
More informationComparing One-Sided Communication with MPI, UPC and SHMEM
Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to
More informationIBM HPC Development MPI update
IBM HPC Development MPI update Chulho Kim Scicomp 2007 Enhancements in PE 4.3.0 & 4.3.1 MPI 1-sided improvements (October 2006) Selective collective enhancements (October 2006) AIX IB US enablement (July
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides
More informationImplementing Byte-Range Locks Using MPI One-Sided Communication
Implementing Byte-Range Locks Using MPI One-Sided Communication Rajeev Thakur, Robert Ross, and Robert Latham Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA
More informationImplementing MPI-IO Shared File Pointers without File System Support
Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,
More informationMPI-3 One-Sided Communication
HLRN Parallel Programming Workshop Speedup your Code on Intel Processors at HLRN October 20th, 2017 MPI-3 One-Sided Communication Florian Wende Zuse Institute Berlin Two-sided communication Standard Message
More informationMPI Remote Memory Access Programming (MPI3-RMA) and Advanced MPI Programming
TORSTEN HOEFLER MPI Remote Access Programming (MPI3-RMA) and Advanced MPI Programming presented at RWTH Aachen, Jan. 2019 based on tutorials in collaboration with Bill Gropp, Rajeev Thakur, and Pavan Balaji
More informationGPI-2: a PGAS API for asynchronous and scalable parallel applications
GPI-2: a PGAS API for asynchronous and scalable parallel applications Rui Machado CC-HPC, Fraunhofer ITWM Barcelona, 13 Jan. 2014 1 Fraunhofer ITWM CC-HPC Fraunhofer Institute for Industrial Mathematics
More informationMPI+X on The Way to Exascale. William Gropp
MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Some Likely Exascale Architectures Figure 1: Core Group for Node (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity,
More informationLeveraging MPI s One-Sided Communication Interface for Shared-Memory Programming
Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming Torsten Hoefler, 1,2 James Dinan, 3 Darius Buntinas, 3 Pavan Balaji, 3 Brian Barrett, 4 Ron Brightwell, 4 William Gropp,
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio
More informationUnderstanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E
Understanding Communication and MPI on Cray XC40 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level
More informationGPU-centric communication for improved efficiency
GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationScalable Critical Path Analysis for Hybrid MPI-CUDA Applications
Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale
More informationFVM - How to program the Multi-Core FVM instead of MPI
FVM - How to program the Multi-Core FVM instead of MPI DLR, 15. October 2009 Dr. Mirko Rahn Competence Center High Performance Computing and Visualization Fraunhofer Institut for Industrial Mathematics
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More informationChip Multiprocessors COMP Lecture 9 - OpenMP & MPI
Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather
More informationPortable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI
Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline
More informationNew and old Features in MPI-3.0: The Past, the Standard, and the Future Torsten Hoefler With contributions from the MPI Forum
New and old Features in MPI-3.0: The Past, the Standard, and the Future Torsten Hoefler With contributions from the MPI Forum All used images belong to the owner/creator! What is MPI Message Passing Interface?
More informationThe good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3
The good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3 6th Workshop on Runtime and Operating Systems for the Many-core Era (ROME 2018) www.dash-project.org Karl Fürlinger
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA
More informationDPHPC: Locks Recitation session
SALVATORE DI GIROLAMO DPHPC: Locks Recitation session spcl.inf.ethz.ch 2-threads: LockOne T0 T1 volatile int flag[2]; void lock() { int j = 1 - tid; flag[tid] = true; while (flag[j])
More informationStrong Scaling for Molecular Dynamics Applications
Strong Scaling for Molecular Dynamics Applications Scaling and Molecular Dynamics Strong Scaling How does solution time vary with the number of processors for a fixed problem size Classical Molecular Dynamics
More informationUnified Communication X (UCX)
Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18 UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationA One-Sided View of HPC: Global-View Models and Portable Runtime Systems
A One-Sided View of HPC: Global-View Models and Portable Runtime Systems James Dinan James Wallace Gives Postdoctoral Fellow Argonne Na9onal Laboratory Why Global-View? Proc 0 Proc 1 Proc n Global address
More informationspin: High-performance streaming Processing in the Network
T. HOEFLER, S. DI GIROLAMO, K. TARANOV, R. E. GRANT, R. BRIGHTWELL spin: High-performance streaming Processing in the Network spcl.inf.ethz.ch The Development of High-Performance Networking Interfaces
More informationWhatÕs New in the Message-Passing Toolkit
WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2
More informationOne Sided Communication. MPI-2 Remote Memory Access. One Sided Communication. Standard message passing
MPI-2 Remote Memory Access Based on notes by Sathish Vadhiyar, Rob Thacker, and David Cronk One Sided Communication One sided communication allows shmem style gets and puts Only one process need actively
More informationPorting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT
Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick UPC Review. July 22, 2009. What is GASNet?
More informationUnderstanding MPI on Cray XC30
Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering
More informationMC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications
MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications Zhezhe Chen, 1 James Dinan, 3 Zhen Tang, 2 Pavan Balaji, 4 Hua Zhong, 2 Jun Wei, 2 Tao Huang, 2 and Feng Qin 5 1 Twitter Inc.
More informationThe Tofu Interconnect 2
The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect
More informationMPI Forum: Preview of the MPI 3 Standard
MPI Forum: Preview of the MPI 3 Standard presented by Richard L. Graham- Chairman George Bosilca Outline Goal Forum Structure Meeting Schedule Scope Voting Rules 2 Goal To produce new versions of the MPI
More informationHeidi Poxon Cray Inc.
Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs
More informationMPI for Cray XE/XK Systems & Recent Enhancements
MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.
More informationMPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session
S. DI GIROLAMO [DIGIROLS@INF.ETHZ.CH] MPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session Slides credits: Pavan Balaji, Torsten Hoefler https://htor.inf.ethz.ch/teaching/mpi_tutorials/isc16/hoefler-balaji-isc16-advanced-mpi.pdf
More informationDESIGNING SCALABLE AND HIGH PERFORMANCE ONE SIDED COMMUNICATION MIDDLEWARE FOR MODERN INTERCONNECTS
DESIGNING SCALABLE AND HIGH PERFORMANCE ONE SIDED COMMUNICATION MIDDLEWARE FOR MODERN INTERCONNECTS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy
More informationTORSTEN HOEFLER
TORSTEN HOEFLER Performance Modeling for Future Computing Technologies Presentation at Tsinghua University, Beijing, China Part of the 60 th anniversary celebration of the Computer Science Department (#9)
More informationADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE
13 th ANNUAL WORKSHOP 2017 ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE Erik Paulson, Kayla Seager, Sayantan Sur, James Dinan, Dave Ozog: Intel Corporation Collaborators: Howard Pritchard:
More informationRuntime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism
1 Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 16 Aug, 2016 4th Annual MVAPICH Users Group Meeting 2 Background
More informationStandards, Standards & Standards - MPI 3. Dr Roger Philp, Bayncore an Intel paired
Standards, Standards & Standards - MPI 3 Dr Roger Philp, Bayncore an Intel paired 1 Message Passing Interface (MPI) towards millions of cores MPI is an open standard library interface for message passing
More informationTowards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH
Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH 1 Exascale Programming Models With the evolution of HPC architecture towards exascale, new approaches for programming these machines
More informationAnalyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters
Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Torsten Hoefler With lots of help from the AUS Team and Bill Kramer at NCSA! All used images belong
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationOPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS
OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for
More informationSystems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications
Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications William Gropp University of Illinois, Urbana- Champaign Pavan Balaji, Rajeev Thakur
More informationChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI Ron Brightwell, R&D Manager Scalable System Software Department Sandia National Laboratories is a multi-mission laboratory managed and operated
More informationHigh Performance MPI-2 One-Sided Communication over InfiniBand
High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University
More informationMULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA
MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network
More informationDmitry Durnov 15 February 2017
Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern
More informationImplementing MPI-IO Atomic Mode Without File System Support
1 Implementing MPI-IO Atomic Mode Without File System Support Robert Ross Robert Latham William Gropp Rajeev Thakur Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,
More informationOn-the-Fly Data Race Detection in MPI One-Sided Communication
On-the-Fly Data Race Detection in MPI One-Sided Communication Presentation Master Thesis Simon Schwitanski (schwitanski@itc.rwth-aachen.de) Joachim Protze (protze@itc.rwth-aachen.de) Prof. Dr. Matthias
More informationOpen MPI for Cray XE/XK Systems
Open MPI for Cray XE/XK Systems Samuel K. Gutierrez LANL Nathan T. Hjelm LANL Manjunath Gorentla Venkata ORNL Richard L. Graham - Mellanox Cray User Group (CUG) 2012 May 2, 2012 U N C L A S S I F I E D
More informationCOSC 6374 Parallel Computation. Remote Direct Memory Acces
COSC 6374 Parallel Computation Remote Direct Memory Acces Edgar Gabriel Fall 2013 Communication Models A P0 receive send B P1 Message Passing Model: Two-sided communication A P0 put B P1 Remote Memory
More informationLecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp
Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler
More informationAdvanced MPI. Andrew Emerson
Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 22/02/2017 Advanced MPI 2 One
More informationCS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.
Parallel Programming Lecture 18: Introduction to Message Passing Mary Hall November 2, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. -
More informationComparing UPC and one-sided MPI: A distributed hash table for GAP
Comparing UPC and one-sided MPI: A distributed hash table for GAP C.M. Maynard EPCC, School of Physics and Astronomy, University of Edinburgh, JCMB, Kings Buildings, Mayfield Road, Edinburgh, EH9 3JZ,
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationHigh Performance MPI-2 One-Sided Communication over InfiniBand
High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University
More informationMPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard Version 2.1 Message Passing Interface Forum June 23, 2008 Contents Acknowledgments xvl1 1 Introduction to MPI 1 1.1 Overview and Goals 1 1.2 Background of MPI-1.0
More informationOptimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication
Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based
More informationAltix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!
Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!! thread synchronization!! data sharing!! massage passing overhead from CPUs.! This system has a rich
More informationThe Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011
The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities
More informationInterconnect Your Future
Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationAdvanced MPI. Andrew Emerson
Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 11/12/2015 Advanced MPI 2 One
More informationPerformance-Centric System Design
Performance-Centric System Design Torsten Hoefler Swiss Federal Institute of Technology, ETH Zürich Performance-Centric System Design - modeling, programming, and architecture - Torsten Hoefler ETH Zürich
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationOperational Robustness of Accelerator Aware MPI
Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers
More informationOne-Sided Append: A New Communication Paradigm For PGAS Models
One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class
More informationToward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies
Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday
More informationCray RS Programming Environment
Cray RS Programming Environment Gail Alverson Cray Inc. Cray Proprietary Red Storm Red Storm is a supercomputer system leveraging over 10,000 AMD Opteron processors connected by an innovative high speed,
More informationFault Tolerance for Remote Memory Access Programming Models
Fault Tolerance for Remote Memory Access Programming Models ABSTRACT Maciej Besta Dept. of Computer Science ETH Zurich Universitätstr. 6, 8092 Zurich, Switzerland maciej.besta@inf.ethz.ch Remote Memory
More informationFriday, May 25, User Experiments with PGAS Languages, or
User Experiments with PGAS Languages, or User Experiments with PGAS Languages, or It s the Performance, Stupid! User Experiments with PGAS Languages, or It s the Performance, Stupid! Will Sawyer, Sergei
More informationAn In-place Algorithm for Irregular All-to-All Communication with Limited Memory
An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More information