Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Size: px

Start display at page:

Download "Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen"

Camilla Hancock
5 years ago
Views:

1 Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS) Gerhard Wellein Friedrich-Alexander-Universität Erlangen-Nürnberg, Regionales Rechenzentrum Erlangen (RRZE) ZKI, AK Supercomputing RWTH Aachen, 3. April 3. Parallel Programming on SMP-Clusters Slide Höchstleistungsrechenzentrum Stuttgart Motivation SMP nodes HPC systems often clusters of SMP nodes i.e., hybrid architectures Node Interconnect CPUs shared memory Using the communication bandwidth of the hardware optimal usage Minimizing synchronization = idle time of the hardware Appropriate parallel programming models / Pros & Cons Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

2 Major Programming models on hybrid systems Pure MPI (one MPI process on each CPU) Hybrid MPI+OpenMP shared memory OpenMP distributed memory MPI OpenMP inside of the SMP nodes MPI between the nodes via node interconnect Node Interconnect Other: Virtual shared memory systems, HPF, Often hybrid programming (MPI+OpenMP) slower than why? MPI Sequential program on each CPU local data in each process data Explicit Message Passing by calling MPI_Send & MPI_Recv Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart OpenMP (shared data) some_serial_code #pragma omp parallel for for (j= ; ; j++) block_to_be_parallelized again_some_serial_code Master thread, other threads sleeping Example from SC Pure MPI versus Hybrid MPI+OpenMP () What s better? it depends on? Integration rate [ Years per day ] 35 Explicit C54N6 6 Level SEAM: NPACI Results with 7 or 8 processes or threads per node Processors Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart Integration rate [ Years per day ] Explicit/Semi Implicit C54N6 SEAM vs T7 PSTSWM, 6 Level, NCAR Processors Figures: Richard D. Loft, Stephen J. Thomas, John M. Dennis: Terascale Spectral Element Dynamical Core for Atmospheric General Circulation Models. Proceedings of SC, Denver, USA, Nov.. Fig. 9 and. auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

3 Parallel Programming Models on Hybrid Platforms Comparison I. (experiment) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Funneled Multiple Comparison II. MPI only more than one thread (theory MPI + experiment) on master-thread may communicate local data in each process OpenMP (shared data) Master thread, other threads Sequential data some_serial_code program on #pragma omp parallel for each CPU Funneled & Funneled Multiple & Multiple for (j= ; ; j++) Reserved with block_to_be_parallelized Reserved with Explicit message transfers reserved thread Full Load again_some_serial_code reserved threads Full Load sleeping by calling MPI_Send for & MPI_Recv communication Balancing for communication Balancing Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Programming Models on Hybrid Platforms: Pure MPI one MPI process on each CPU Exa.: SMP nodes, 8 CPUs/node Round-robin x Sequential x Optimal? x Slow inter-node link Advantages No modifications on existing MPI codes MPI library need not to support multiple threads Problems To fit application topology on hardware topology E.g. choosing ranks in MPI_COMM_WORLD round robin (rank on node, rank on node,... ) Sequential (ranks -7 on st node, ranks 8-5 on nd ) Disadvantages Message passing overhead inside of SMP nodes (instead of simply accessing the data via the shared memory)! Reason for using hybrid programming models Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 3

4 Programming Models on Hybrid Platforms: Hybrid MPI only outside of parallel regions for (iteration.) { #pragma omp parallel numerical code /*end omp parallel */ /* on master thread only */ MPI_Send (original data to halo areas in other SMP nodes) MPI_Recv (halo data from the neighbors) } /*end for loop Advantages No message passing inside of the SMP nodes No topology problem Problems MPI-lib must support MPI_THREAD_FUNNELED Disadvantages all other threads are sleeping while master thread communicates Reason for implementing overlapping of communication & computation Slide 7 / 6 High Perf. Comp. Center, Univ. Stuttgart Experiment: Orthogonal parallel communication MPI+OpenMP: only vertical : vertical AND horizontal messages Hitachi SR8 8 nodes each node with 8 CPUs MPI_Sendrecv intra-node 8*8*MB:. ms 8*8MB hybrid: 9. ms Slide 8 / 6 High Perf. Comp. Center, Univ. Stuttgart. inter-node 8*8*MB: 9.6 ms : Σ=.6 ms.6x slower than with, although only half of the transferred bytes and less latencies due to 8x longer messages auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 4

5 Results of the experiment is better for message size > 3 kb long messages: T hybrid / T purempi >.6 OpenMP master thread cannot saturate the inter-node network bandwidth is faster MPI+OpenMP (masteronly) Slide 9 / 6 High Perf. Comp. Center, Univ. Stuttgart Transfer time [ms] Ratio,,,8,6,4,,8,6,4, T_hybrid (size*8) T_: inter+intra T_: inter-node T_: intra-node 8 5 k 8k 3k 8k 5k M (purempi),5 k,5 4k 6k 64k 8 56k 3 8 M 5 4M 6M 48 ( hybrid ) Message size [kb],5, Message size [kb] T_hybrid / T_pureMPI (inter+intra node) Ratio T_hybrid MPI+OpenMP / T_ 6*6 CPUs IBM RS/6 SP, 8 SMP nodes, each SMP node with 6 POWER3+ CPUs, at NERSC transfer time [ms] 56 CPUs, hybrid MPI+OpenMP 56 CPUs,, hori.+vert. 56 CPUs,, vertical 56 CPUs,, horizontal 56 CPUs, T_hybrid / T_pureMPI 3,5,5 T_hybrid / T_pureMPI Pure MPI,, E+ E+ E+ E+3 E+4 E+5 E+6 E+7 Message size () [byte] Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Additional slides (measured at NERSC) auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 5,5 Hybrid MPI+OpenMP (masteronly) Measurements: Thanks to Gerhard Wellein, RRZE, and Horst Simon, NERSC.

6 Ratio T_hybrid MPI+OpenMP / T_ IBM RS/6 SP, 8 SMP nodes, each SMP node with 6 POWER3+ CPUs, at NERSC 3 T_hybrid MPI+OpenMP / T_,5 3 CPUs, T_hybrid / T_pureMPI 3 CPUs, T_hybrid / T_pureMPI 64 CPUs, T_hybrid / T_pureMPI,5 64 CPUs, T_hybrid / T_pureMPI 8 CPUs, T_hybrid / T_pureMPI 8 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI,5 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI E+ E+ E+ E+3 E+4 E+5 E+6 E+7 Message size () [byte] Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Pure MPI Hybrid MPI+OpenMP (masteronly) Measurements: Thanks to Gerhard Wellein, RRZE, and Horst Simon, NERSC. Possible Reasons Hardware: is one CPU able to saturate the inter-node network? Software: internal MPI buffering may cause additional memory traffic memory bandwidth may be the real restricting factor? Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 6

7 Optimizing the hybrid masteronly model By the MPI library: Using multiple threads using multiple memory paths (e.g., for strided data) using multiple floating point units (e.g., for reduction) using multiple communication links (if one link cannot saturate the hardware inter-node bandwidth) requires knowledge about free CPUs, e.g., via new MPI_THREAD_MASTERONLY By the user-application: unburden MPI from work, that can be done by the application e.g., concatenate strided data in parallel regions instead of using MPI derived datatypes Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart PROs & CONs Hybrid PROs hybrid more easy to fit hybrid hardware topology no communication overhead inside of the SMP nodes reduced #MPI messages & larger message sizes reduced latency-based overheads save memory reduced #MPI processes better speedup (Amdahl s law) faster convergence, e.g., if multigrid numeric is computed only on a partial grid CONs hybrid You may loose your cache (cache flushes inside of OpenMP) Requires additional level of parallelism in problem/program Programming complexity increases & additional synchronization overhead Requires support by MPI library & mature OpenMP compiler Problem to saturate the inter-node network Idling CPUs in the scheme Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 7

8 Parallel Programming Models on Hybrid Platforms Comparison I. ( experiments) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread for communication Funneled with Full Load Balancing Multiple & Reserved reserved threads for communication Multiple with Full Load Balancing Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Overlapping computation & communication funneled & reserved hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Implications: all other threads are computing while master thread communicates better CPU usage MPI-lib must provide at least MPI_THREAD_FUNNELED thread-safety inter-node bandwidth may be still used only partially Major drawback: load balancing necessary alternative: reserved thread for communication next slide Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 8

Overlapping computation & communication funneled & reserved Alternative: Funneled MPI only on master-thread Funneled & Reserved reserved thread for communication cons: bad load balance, if T

, reserved tasks on threads: master thread: communication all other threads: computation n communication_threads T computation n computation_threads pros: more easy programming scheme than with full

9 Overlapping computation & communication funneled & reserved Alternative: Funneled MPI only on master-thread Funneled & Reserved reserved thread for communication cons: bad load balance, if T communication i.e., reserved tasks on threads: master thread: communication all other threads: computation n communication_threads T computation n computation_threads pros: more easy programming scheme than with full load balancing chance for good performance! Slide 7 / 6 High Perf. Comp. Center, Univ. Stuttgart Performance ratio (theory) funneled & reserved T ε = ( hybrid, funneled&reserved ) performance ratio (ε) T hybrid, masteronly Good chance of funneled & reserved: ε max = +m( /n) ε > funneled& reserved ε < masteronly Small risk of funneled & reserved: ε min = m/n f comm [%] f comm [%] T hybrid, masteronly = (f comm + f comp, non-overlap + f comp, overlap ) T hybrid, masteronly n = # threads per SMP node, m = # reserved threads for MPI communication Slide 8 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 9

10 Experiment: Matrix-vector-multiply (MVM) funneled & reserved performance ratio (ε) Experiments (Theory) Source: R. Rabenseifner, G. Wellein: Communication and Optimization Aspects of Parallel Programming Models. EWOMP, Rome, Italy, Sep. 8, Slide 9 / 6 High Perf. Comp. Center, Univ. Stuttgart funneled & reserved masteronly Jacobi-Davidson-Solver Hitachi SR8 8 CPUs / SMP node JDS (Jagged Diagonal Storage) vectorizing n proc = # SMP nodes D Mat = 5*5*(n loc k *nproc) Varying n loc k Varying /f comm f comp,non-overlap = f comp,overlap 6 Experiment: Matrix-vector-multiply (MVM) funneled & reserved performance ratio (ε) funneled & reserved masteronly Same experiment on IBM SP Power3 nodes with 6 CPUs per node funneled&reserved is always faster in this experiments Reason: Memory bandwidth is already saturated by 5 CPUs, see inset Inset: Speedup on SMP node using different number of threads Source: R. Rabenseifner, G. Wellein: Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. International Journal of High Performance Computing Applications, Vol. 7, No., 3, Sage Science Press. Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

11 Hybrid MVM implementation MASTERONLY mode: Automatic parallel. of inner i loop (data parallel) Single threaded MPI calls FUNNELED mode: Functional parallelism: Simulate asynchronous data transfer! (OpenMP) Release list - LOCK Single threaded MPI calls MASTERONLY mode MPI M Threads FUNNELED mode MPI M Threads OMP PARALLEL LOCK: Rel. list LOCK: Rel. list OMP END PARALLEL Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Parallel Programming Models on Hybrid Platforms Comparison I. (experiment) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread for communication Funneled with Full Load Balancing Multiple & Reserved reserved threads for communication Multiple with Full Load Balancing Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

12 hybrid MPI+OpenMP OpenMP only Comparing other methods Memory copies from remote memory to local CPU register and vice versa Access method -sided MPI -sided MPI Compiler based: OpenMP on DSM (distributed shared memory) or with cluster extensions, UPC, Co-Array Fortran, HPF Copies Remarks internal MPI buffer + application receive buf. application receive buffer page based transfer word based access latency hiding with pre-fetch latency hiding with buffering bandwidth b(message size) b(size) = b / (+ b T latency /size) same formula, but probably better b and T latency extremely poor, if only parts are needed 8 byte / T latency, e.g, 8 byte /.33µs = 4MB/s b see -sided communication Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart hybrid MPI+OpenMP OpenMP only Compilation and Optimization Library based communication (e.g., MPI) clearly separated optimization of () communication MPI library () computation Compiler essential for success of MPI Compiler based parallelization (including the communication): similar strategy OpenMP Source (Fortran / C) preservation of original with optimization directives language? () OMNI Compiler optimization directives? Communication- & Thread-Library Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart C-Code + Library calls () optimizing native compiler Executable Optimization of the computation more important than optimization of the communication auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

13 Comparison I & II Some conclusions one MPI process on each CPU Comparison I. (experiment) hybrid MPI+OpenMP MPI only outside of parallel regions Comparison II. (theory + experiment) Overlapping Comm. + Comp. Funneled & Reserved reserved thread for communication Hybrid masteronly may not saturate the inter-node bandwidth Experiments:.5.7 times faster message transfer with (only communication part!) hard topology problem with Performance chance ε < (with overlapping computation & communication and one communication thread per SMP) up to 5% total performance benefit with real matrix-vector-multiply cons: overlapping is hard & tricky Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Conclusions Pure MPI versus hybrid masteronly model: Topology problem with may be hard, e.g., with unstructured grids Communication is bigger with, but may be nevertheless faster On the other hand, typically communication is only some percent relax Efficient hybrid programming: one may overlap communication and computation hard to do! using simple hybrid funneled&reserved model, you maybe up to 5% faster (compared to masteronly model) If you want to use pure OpenMP (based on virtual shared memory) try to use still the full bandwidth of the inter-node network (keep pressure on your compiler/dsm writer) be sure that you do not lose any computational optimization e.g., best Fortran compiler & optimization directives should be usable optimal parallel programming model on clusters of SMP nodes, depends on many factors no general recommendation Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 3

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart