Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Similar documents
Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Hybrid MPI and OpenMP Parallel Programming

Optimization of MPI Applications Rolf Rabenseifner

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Some Aspects of Message-Passing on Future Hybrid Systems

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

Hybrid MPI and OpenMP Parallel Programming

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Parallel Numerical Algorithms

Parallel Programming

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

COMP528: Multi-core and Multi-Processor Computing

Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes

Parallel Hardware Architectures and Parallel Programming Models

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Basic Communication Operations (Chapter 4)

High-Performance Computing-Center. EuroPVM/MPI 2004, Budapest, Sep , q 2

Common lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases

A Heat-Transfer Example with MPI Rolf Rabenseifner

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Hybrid Programming with MPI and SMPSs

Advanced Message-Passing Interface (MPI)

COSC 6385 Computer Architecture - Multi Processor Systems

Review of previous examinations TMA4280 Introduction to Supercomputing

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Hybrid MPI and OpenMP Parallel Programming

Blocking SEND/RECEIVE

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Scalasca performance properties The metrics tour

MPI - Today and Tomorrow

Parallelization Principles. Sathish Vadhiyar

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

CSE5351: Parallel Processing Part III

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Introduction to parallel Computing

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs

Barbara Chapman, Gabriele Jost, Ruud van der Pas

MPI & OpenMP Mixed Hybrid Programming

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Parallel Poisson Solver in Fortran

Principles of Parallel Algorithm Design: Concurrency and Mapping

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Performance Engineering - Case study: Jacobi stencil

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

Weaving Relations for Cache Performance

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

MPI+X on The Way to Exascale. William Gropp

Performance Tuning and OpenMP

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Programming for Fujitsu Supercomputers

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallel Computing Why & How?

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

Our new HPC-Cluster An overview

Performance Tuning and OpenMP

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Code Parallelization

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Computing and the MPI environment

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Optimising for the p690 memory system

EE/CSCI 451: Parallel and Distributed Computation

Scalasca performance properties The metrics tour

Hybrid programming with MPI and OpenMP On the way to exascale

GPUfs: Integrating a file system with GPUs

NUMA-Aware Shared-Memory Collective Communication for MPI

PRIMEHPC FX10: Advanced Software

Hybrid Space-Time Parallel Solution of Burgers Equation

Multicore from an Application s Perspective. Erik Hagersten Uppsala Universitet

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Introduction to Parallel Computing

Case study: OpenMP-parallel sparse matrix-vector multiplication

OpenMP on the IBM Cell BE

Introduction to Parallel Programming

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Mixed Mode MPI / OpenMP Programming

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Transcription:

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart High Performance Computing Center Stuttgart HLRS www.hlrs.de University of Erlangen Regionales Rechenzentrum RRZE EWOMP 2002, Rome, Italy, Sep. 8 20, 2002 Cluster Programming Models Slide Höchstleistungsrechenzentrum Stuttgart Motivation & Goals HPC systems often clusters of SMP nodes = hybrid architectures Often hybrid programming (MPI+OpenMP) slower than why? Using the communication bandwidth of the hardware optimal usage Minimizing synchronization = idle time of the hardware Appropriate parallel programming models Pros & Cons Work horses for legacy & new parallel applications Optimization of the middleware Slide 2 / 23 Höchstleistungsrechenzentrum Stuttgart

Parallel Programming Models on Hybrid Platforms Comparison I. (2 experiments) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread Funneled with Full Load Balancing Multiple & Reserved reserved threads Multiple with Full Load Balancing Slide 3 / 23 Höchstleistungsrechenzentrum Stuttgart MPI + OpenMP MPI_Init_threads(required, &provided) MPI 2.0: provided= categories of thread-safety: MPI_THREAD_ no thread-support by MPI _SINGLE MPI process may be sometimes multi-threaded, _SINGLE (parallel regions) and MPI is called only if only the master-thread exists Same, but the other threads may sleep MPI may be called only outside of OpenMP parallel regions _SINGLE _SINGLE..._MASTERONLY proposal for Same, but all other threads may compute _FUNNELED Multiple threads may call MPI, but only one thread may execute an MPI routine at a time _SERIALIZED MPI may be called from any thread _MULTIPLE MPI 2. Slide 4 / 23 Höchstleistungsrechenzentrum Stuttgart 2

MPI + OpenMP using OMP MASTER MPI_THREAD_FUNNELED needed no implied barrier! no implied cache flush! using OMP SINGLE MPI_THREAD_SERIALIZED needed A[i] = OMP MASTER / SINGLE MPI_Send(A, ) OMP END MASTER / SINGLE A[i] = new value OMP BARRIER OMP BARRIER Same problem as with MPI_THREAD_MASTERONLY: all application threads are sleeping while MPI is executing Slide 5 / 23 Höchstleistungsrechenzentrum Stuttgart Pure MPI on hybrid architectures Optimizing the communication best ranking of MPI processes on the cluster based on MPI virtual topology sequential ranking: 0,, 2, 3, 7 8, 9, 0, 5 6, 7, 8 23 round robin: 0, N, 2N 7N, N+, 7N+ 2, N+2, 7N+2 (Example: N = number of used nodes, 8 threads per node) Slide 6 / 23 Höchstleistungsrechenzentrum Stuttgart 3

Pure MPI on hybrid architectures (continued) Additional message transfer inside of each node compared with MPI+OpenMP Example: 3-D (or 2-D) domain decomposition e.g. on 8-way SMP nodes one (or 3) additional cutting plane in each dimension expecting same message size on each plane outer boundary () inner plane () outer boundary (MPI+OpenMP) compared with MPI+OpenMP : only doubling the total amount of transferred bytes Slide 7 / 23 Höchstleistungsrechenzentrum Stuttgart Benchmark results On Hitachi SR8000, b_eff ) benchmark on 2 nodes b_eff b_eff Lmax 2) 3-d-cyclic average 3-d-cyclic Lmax 2) aggregated bandwidth hybrid [MB/s] 535 5565 604 5638 (per node) [MB/s] (28) (464) (34) (470) aggregated bandwidth [MB/s] 5299 6624 5000 8458 (per process) [MB/s] (55) (73) (52) (92) bw / bw hybrid (measured) 3.45 2.99 3.2 3.27 size / size hybrid (assumed) 2 (based on last slide) T hybrid / T (concluding).73.49.56.64 communication with model is about 60% faster than with the hybrid-masteronly model Slide 8 / 23 Höchstleistungsrechenzentrum Stuttgart ) www.hlrs.de/mpi/b_eff/ 2) message size = 8MB 4

2 nd experiment: Orthogonal parallel communication MPI+OpenMP: only vertical : vertical AND horizontal messages Hitachi SR8000 8 nodes each node with 8 CPUs MPI_Sendrecv intra-node 8*8*MB: 2.0 ms 8*8MB hybrid: 9.2 ms. inter-node 8*8*MB: 9.6 ms : Σ=.6 ms Slide 9 / 23 Höchstleistungsrechenzentrum Stuttgart Results of the experiment is better for message size > 32 kb long messages: T hybrid / T purempi >.6 OpenMP master thread cannot saturate the inter-node network bandwidth is faster MPI+OpenMP (masteronly) is faster Slide 0 / 23 Höchstleistungsrechenzentrum Stuttgart Transfer time [ms] Ratio 00 0 0, 0,0 2,8,6,4,2 0,8 0,6 0,4 0,2 0 T_hybrid (size*8) T_: inter+intra T_: inter-node T_: intra-node 0,25 28 52 0,5 2k 2 8k 8 32k 32 28k 28 52k 52 2048 2M (purempi) k 4k 6k 64k 256k M 4M 6M ( hybrid ) Message size [kb] 0,25 0,5 2 8 32 28 52 2048 Message size [kb] T_hybrid / T_pureMPI (inter+intra node) 5

Interpretation of the benchmark results If the inter-node bandwidth cannot be consumed by using only one processor of each node the model can achieve a better aggregate bandwidth If bw / bw hybrid > 2 & size / size hybrid < 2 faster communication with If bw = bw hybrid & size > size hybrid faster communication with hybrid MPI+OpenMP Slide / 23 Höchstleistungsrechenzentrum Stuttgart Optimizing the hybrid masteronly model By the MPI library: Using multiple threads using multiple memory paths (e.g., for strided data) using multiple floating point units (e.g., for reduction) using multiple communication links (if one link cannot saturate the hardware inter-node bandwidth) requires knowledge about free CPUs, e.g., via new MPI_THREAD_MASTERONLY By the user-application: unburden MPI from work, that can be done by the application e.g., concatenate strided data in parallel regions reduction operations (MPI_reduce / MPI_Allreduce): numerical operations by user-defined multi-threaded call-back routines no rules in the MPI standard about multi-threading of such call-back routine Slide 2 / 23 Höchstleistungsrechenzentrum Stuttgart 6

Other Advantages of Hybrid MPI+OpenMP No communication overhead inside of the SMP nodes Larger message sizes on the boundary reduced latency-based overheads Reduced number of MPI processes better speedup (Amdahl s law) faster convergence, e.g., if multigrid numeric is computed only on a partial grid Slide 3 / 23 Höchstleistungsrechenzentrum Stuttgart Parallel Programming Models on Hybrid Platforms Comparison I. (2 experiments) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread Funneled with Full Load Balancing Multiple & Reserved reserved threads Multiple with Full Load Balancing Slide 4 / 23 Höchstleistungsrechenzentrum Stuttgart 7

Overlapping computation & communication funneled & reserved The model: communication funneled through master thread Hybrid MPI+OpenMP Requires at least MPI_THREAD_FUNNELED While master thread calls MPI routines: all other threads are computing! The implications: no communication overhead inside of the SMP nodes better CPU usage although inter-node bandwidth may be used only partially 2 levels of parallelism (MPI and OpenMP): additional synchronization overhead Major drawback: load balancing necessary alternative: reserved thread next slide Slide 5 / 23 Höchstleistungsrechenzentrum Stuttgart Overlapping computation & communication funneled & reserved Alternative: reserved tasks on threads: master thread: communication all other threads: computation cons: bad load balance, if T communication n communication_threads T computation n computation_threads pros: more easy programming scheme than with full load balancing chance for good performance! Slide 6 / 23 Höchstleistungsrechenzentrum Stuttgart 8

Performance ratio (theory) funneled & reserved T ε = ( hybrid, funneled&reserved ) performance ratio (ε) T hybrid, masteronly f comm [%] Slide 7 / 23 Höchstleistungsrechenzentrum Stuttgart Good chance of funneled & reserved: ε max = +m( /n) ε > funneled& reserved is faster ε < masteronly is faster Small risk of funneled & reserved: ε min = m/n f comm [%] T hybrid, masteronly = (f comm + f comp, non-overlap + f comp, overlap ) T hybrid, masteronly n = # threads per SMP node, m = # reserved threads for MPI communication Experiment: Matrix-vector-multiply (MVM) funneled & reserved performance ratio (ε) Experiments (Theory) funneled & reserved is faster masteronly is faster CG-Solver Hitachi SR8000 8 CPUs / SMP node JDS (Jagged Diagonal Storage) vectorizing n proc = # SMP nodes D Mat = 52*52*(n k loc *nproc) Varying n k loc Varying /f comm f comp,non-overlap = f comp,overlap 6 Slide 8 / 23 Höchstleistungsrechenzentrum Stuttgart 9

Comparison I & II Some conclusions one MPI process on each CPU Comparison I. (2 experiments) hybrid MPI+OpenMP MPI only outside of parallel regions Comparison II. (theory + experiment) Overlapping Comm. + Comp. Funneled & Reserved reserved thread Slide 9 / 23 Höchstleistungsrechenzentrum Stuttgart One MPI thread should achieve full inter-node bandwidth For MPI 2.: MPI_THREAD_MASTERONLY allows MPI-internal multi-threaded optimizations: e.g., handling of derived data reduction operations Application should overlap computation & communication Performance chance ε < 2 (with one communication thread per SMP) ~50% performance benefit with real CG solver on Hitachi SR8000 Parallel Programming Models on Hybrid Platforms Comparison I. (2 experiments) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread Funneled with Full Load Balancing Multiple & Reserved reserved threads Multiple with Full Load Balancing Slide 20 / 23 Höchstleistungsrechenzentrum Stuttgart 0

hybrid MPI+OpenMP OpenMP only Comparing other methods Memory copies from remote memory to local CPU register and vice versa Access method 2-sided MPI -sided MPI Compiler based: OpenMP on DSM (distributed shared memory) or with cluster extensions, UPC, Co-Array Fortran, HPF Copies 2 0 0 Remarks internal MPI buffer + application receive buf. application receive buffer page based transfer word based access latency hiding with pre-fetch latency hiding with buffering bandwidth b(message size) b(size) = b / (+ b T latency /size) same formula, but probably better b and T latency extremely poor, if only parts are needed 8 byte / T latency, e.g, 8 byte / 0.33µs = 24MB/s b see -sided communication Slide 2 / 23 Höchstleistungsrechenzentrum Stuttgart hybrid MPI+OpenMP OpenMP only Compilation and Optimization Library based communication (e.g., MPI) clearly separated optimization of () communication MPI library (2) computation Compiler essential for success of MPI Compiler based parallelization (including the communication): similar strategy OpenMP Source (Fortran / C) preservation of original with optimization directives language? () OMNI Compiler optimization directives? Communication- & Thread-Library Slide 22 / 23 Höchstleistungsrechenzentrum Stuttgart C-Code + Library calls (2) optimizing native compiler Executable Optimization of the computation more important than optimization of the communication

Conclusions Pure MPI versus hybrid masteronly model: Communication is bigger with, but may be nevertheless faster On the other hand, typically communication is only some percent relax Efficient hybrid programming: one should overlap communication and computation hard to do! using simple hybrid funneled&reserved model, you maybe up to 50% faster (compared to masteronly model) Coming from, one should try to implement hybrid funneled&reserved model If you want to use pure OpenMP (based on virtual shared memory) try to use still the full bandwidth of the inter-node network (keep pressure on your compiler/dsm writer) be sure that you do not lose any computational optimization e.g., best Fortran compiler & optimization directives should be usable Slide 23 / 23 Höchstleistungsrechenzentrum Stuttgart Summary Programming models on hybrid architectures (clusters of SMP nodes) Pure MPI / hybrid MPI+OpenMP masteronly / funneled-reserved / compiler based parallelization (e.g. OpenMP on clusters) Communication difficulties with hybrid programming multi-threading with MPI bandwidth of inter-node network overlapping of computation and communication, real chance: 50% performance benefit latency hiding with compiler based parallelization Optimization and compilation separation of optimization we must not lose optimization of computation Conclusion: Pure MPI hybrid MPI+OpenMP OpenMP on clusters a roadmap full of stones! Slide 24 / 23 Höchstleistungsrechenzentrum Stuttgart 2