Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen
|
|
- Camilla Hancock
- 5 years ago
- Views:
Transcription
1 Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS) Gerhard Wellein Friedrich-Alexander-Universität Erlangen-Nürnberg, Regionales Rechenzentrum Erlangen (RRZE) ZKI, AK Supercomputing RWTH Aachen, 3. April 3. Parallel Programming on SMP-Clusters Slide Höchstleistungsrechenzentrum Stuttgart Motivation SMP nodes HPC systems often clusters of SMP nodes i.e., hybrid architectures Node Interconnect CPUs shared memory Using the communication bandwidth of the hardware optimal usage Minimizing synchronization = idle time of the hardware Appropriate parallel programming models / Pros & Cons Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page
2 Major Programming models on hybrid systems Pure MPI (one MPI process on each CPU) Hybrid MPI+OpenMP shared memory OpenMP distributed memory MPI OpenMP inside of the SMP nodes MPI between the nodes via node interconnect Node Interconnect Other: Virtual shared memory systems, HPF, Often hybrid programming (MPI+OpenMP) slower than why? MPI Sequential program on each CPU local data in each process data Explicit Message Passing by calling MPI_Send & MPI_Recv Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart OpenMP (shared data) some_serial_code #pragma omp parallel for for (j= ; ; j++) block_to_be_parallelized again_some_serial_code Master thread, other threads sleeping Example from SC Pure MPI versus Hybrid MPI+OpenMP () What s better? it depends on? Integration rate [ Years per day ] 35 Explicit C54N6 6 Level SEAM: NPACI Results with 7 or 8 processes or threads per node Processors Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart Integration rate [ Years per day ] Explicit/Semi Implicit C54N6 SEAM vs T7 PSTSWM, 6 Level, NCAR Processors Figures: Richard D. Loft, Stephen J. Thomas, John M. Dennis: Terascale Spectral Element Dynamical Core for Atmospheric General Circulation Models. Proceedings of SC, Denver, USA, Nov.. Fig. 9 and. auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page
3 Parallel Programming Models on Hybrid Platforms Comparison I. (experiment) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Funneled Multiple Comparison II. MPI only more than one thread (theory MPI + experiment) on master-thread may communicate local data in each process OpenMP (shared data) Master thread, other threads Sequential data some_serial_code program on #pragma omp parallel for each CPU Funneled & Funneled Multiple & Multiple for (j= ; ; j++) Reserved with block_to_be_parallelized Reserved with Explicit message transfers reserved thread Full Load again_some_serial_code reserved threads Full Load sleeping by calling MPI_Send for & MPI_Recv communication Balancing for communication Balancing Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Programming Models on Hybrid Platforms: Pure MPI one MPI process on each CPU Exa.: SMP nodes, 8 CPUs/node Round-robin x Sequential x Optimal? x Slow inter-node link Advantages No modifications on existing MPI codes MPI library need not to support multiple threads Problems To fit application topology on hardware topology E.g. choosing ranks in MPI_COMM_WORLD round robin (rank on node, rank on node,... ) Sequential (ranks -7 on st node, ranks 8-5 on nd ) Disadvantages Message passing overhead inside of SMP nodes (instead of simply accessing the data via the shared memory)! Reason for using hybrid programming models Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 3
4 Programming Models on Hybrid Platforms: Hybrid MPI only outside of parallel regions for (iteration.) { #pragma omp parallel numerical code /*end omp parallel */ /* on master thread only */ MPI_Send (original data to halo areas in other SMP nodes) MPI_Recv (halo data from the neighbors) } /*end for loop Advantages No message passing inside of the SMP nodes No topology problem Problems MPI-lib must support MPI_THREAD_FUNNELED Disadvantages all other threads are sleeping while master thread communicates Reason for implementing overlapping of communication & computation Slide 7 / 6 High Perf. Comp. Center, Univ. Stuttgart Experiment: Orthogonal parallel communication MPI+OpenMP: only vertical : vertical AND horizontal messages Hitachi SR8 8 nodes each node with 8 CPUs MPI_Sendrecv intra-node 8*8*MB:. ms 8*8MB hybrid: 9. ms Slide 8 / 6 High Perf. Comp. Center, Univ. Stuttgart. inter-node 8*8*MB: 9.6 ms : Σ=.6 ms.6x slower than with, although only half of the transferred bytes and less latencies due to 8x longer messages auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 4
5 Results of the experiment is better for message size > 3 kb long messages: T hybrid / T purempi >.6 OpenMP master thread cannot saturate the inter-node network bandwidth is faster MPI+OpenMP (masteronly) Slide 9 / 6 High Perf. Comp. Center, Univ. Stuttgart Transfer time [ms] Ratio,,,8,6,4,,8,6,4, T_hybrid (size*8) T_: inter+intra T_: inter-node T_: intra-node 8 5 k 8k 3k 8k 5k M (purempi),5 k,5 4k 6k 64k 8 56k 3 8 M 5 4M 6M 48 ( hybrid ) Message size [kb],5, Message size [kb] T_hybrid / T_pureMPI (inter+intra node) Ratio T_hybrid MPI+OpenMP / T_ 6*6 CPUs IBM RS/6 SP, 8 SMP nodes, each SMP node with 6 POWER3+ CPUs, at NERSC transfer time [ms] 56 CPUs, hybrid MPI+OpenMP 56 CPUs,, hori.+vert. 56 CPUs,, vertical 56 CPUs,, horizontal 56 CPUs, T_hybrid / T_pureMPI 3,5,5 T_hybrid / T_pureMPI Pure MPI,, E+ E+ E+ E+3 E+4 E+5 E+6 E+7 Message size () [byte] Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Additional slides (measured at NERSC) auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 5,5 Hybrid MPI+OpenMP (masteronly) Measurements: Thanks to Gerhard Wellein, RRZE, and Horst Simon, NERSC.
6 Ratio T_hybrid MPI+OpenMP / T_ IBM RS/6 SP, 8 SMP nodes, each SMP node with 6 POWER3+ CPUs, at NERSC 3 T_hybrid MPI+OpenMP / T_,5 3 CPUs, T_hybrid / T_pureMPI 3 CPUs, T_hybrid / T_pureMPI 64 CPUs, T_hybrid / T_pureMPI,5 64 CPUs, T_hybrid / T_pureMPI 8 CPUs, T_hybrid / T_pureMPI 8 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI,5 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI E+ E+ E+ E+3 E+4 E+5 E+6 E+7 Message size () [byte] Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Pure MPI Hybrid MPI+OpenMP (masteronly) Measurements: Thanks to Gerhard Wellein, RRZE, and Horst Simon, NERSC. Possible Reasons Hardware: is one CPU able to saturate the inter-node network? Software: internal MPI buffering may cause additional memory traffic memory bandwidth may be the real restricting factor? Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 6
7 Optimizing the hybrid masteronly model By the MPI library: Using multiple threads using multiple memory paths (e.g., for strided data) using multiple floating point units (e.g., for reduction) using multiple communication links (if one link cannot saturate the hardware inter-node bandwidth) requires knowledge about free CPUs, e.g., via new MPI_THREAD_MASTERONLY By the user-application: unburden MPI from work, that can be done by the application e.g., concatenate strided data in parallel regions instead of using MPI derived datatypes Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart PROs & CONs Hybrid PROs hybrid more easy to fit hybrid hardware topology no communication overhead inside of the SMP nodes reduced #MPI messages & larger message sizes reduced latency-based overheads save memory reduced #MPI processes better speedup (Amdahl s law) faster convergence, e.g., if multigrid numeric is computed only on a partial grid CONs hybrid You may loose your cache (cache flushes inside of OpenMP) Requires additional level of parallelism in problem/program Programming complexity increases & additional synchronization overhead Requires support by MPI library & mature OpenMP compiler Problem to saturate the inter-node network Idling CPUs in the scheme Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 7
8 Parallel Programming Models on Hybrid Platforms Comparison I. ( experiments) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread for communication Funneled with Full Load Balancing Multiple & Reserved reserved threads for communication Multiple with Full Load Balancing Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Overlapping computation & communication funneled & reserved hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Implications: all other threads are computing while master thread communicates better CPU usage MPI-lib must provide at least MPI_THREAD_FUNNELED thread-safety inter-node bandwidth may be still used only partially Major drawback: load balancing necessary alternative: reserved thread for communication next slide Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 8
9 Overlapping computation & communication funneled & reserved Alternative: Funneled MPI only on master-thread Funneled & Reserved reserved thread for communication cons: bad load balance, if T communication i.e., reserved tasks on threads: master thread: communication all other threads: computation n communication_threads T computation n computation_threads pros: more easy programming scheme than with full load balancing chance for good performance! Slide 7 / 6 High Perf. Comp. Center, Univ. Stuttgart Performance ratio (theory) funneled & reserved T ε = ( hybrid, funneled&reserved ) performance ratio (ε) T hybrid, masteronly Good chance of funneled & reserved: ε max = +m( /n) ε > funneled& reserved ε < masteronly Small risk of funneled & reserved: ε min = m/n f comm [%] f comm [%] T hybrid, masteronly = (f comm + f comp, non-overlap + f comp, overlap ) T hybrid, masteronly n = # threads per SMP node, m = # reserved threads for MPI communication Slide 8 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 9
10 Experiment: Matrix-vector-multiply (MVM) funneled & reserved performance ratio (ε) Experiments (Theory) Source: R. Rabenseifner, G. Wellein: Communication and Optimization Aspects of Parallel Programming Models. EWOMP, Rome, Italy, Sep. 8, Slide 9 / 6 High Perf. Comp. Center, Univ. Stuttgart funneled & reserved masteronly Jacobi-Davidson-Solver Hitachi SR8 8 CPUs / SMP node JDS (Jagged Diagonal Storage) vectorizing n proc = # SMP nodes D Mat = 5*5*(n loc k *nproc) Varying n loc k Varying /f comm f comp,non-overlap = f comp,overlap 6 Experiment: Matrix-vector-multiply (MVM) funneled & reserved performance ratio (ε) funneled & reserved masteronly Same experiment on IBM SP Power3 nodes with 6 CPUs per node funneled&reserved is always faster in this experiments Reason: Memory bandwidth is already saturated by 5 CPUs, see inset Inset: Speedup on SMP node using different number of threads Source: R. Rabenseifner, G. Wellein: Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. International Journal of High Performance Computing Applications, Vol. 7, No., 3, Sage Science Press. Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page
11 Hybrid MVM implementation MASTERONLY mode: Automatic parallel. of inner i loop (data parallel) Single threaded MPI calls FUNNELED mode: Functional parallelism: Simulate asynchronous data transfer! (OpenMP) Release list - LOCK Single threaded MPI calls MASTERONLY mode MPI M Threads FUNNELED mode MPI M Threads OMP PARALLEL LOCK: Rel. list LOCK: Rel. list OMP END PARALLEL Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Parallel Programming Models on Hybrid Platforms Comparison I. (experiment) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread for communication Funneled with Full Load Balancing Multiple & Reserved reserved threads for communication Multiple with Full Load Balancing Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page
12 hybrid MPI+OpenMP OpenMP only Comparing other methods Memory copies from remote memory to local CPU register and vice versa Access method -sided MPI -sided MPI Compiler based: OpenMP on DSM (distributed shared memory) or with cluster extensions, UPC, Co-Array Fortran, HPF Copies Remarks internal MPI buffer + application receive buf. application receive buffer page based transfer word based access latency hiding with pre-fetch latency hiding with buffering bandwidth b(message size) b(size) = b / (+ b T latency /size) same formula, but probably better b and T latency extremely poor, if only parts are needed 8 byte / T latency, e.g, 8 byte /.33µs = 4MB/s b see -sided communication Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart hybrid MPI+OpenMP OpenMP only Compilation and Optimization Library based communication (e.g., MPI) clearly separated optimization of () communication MPI library () computation Compiler essential for success of MPI Compiler based parallelization (including the communication): similar strategy OpenMP Source (Fortran / C) preservation of original with optimization directives language? () OMNI Compiler optimization directives? Communication- & Thread-Library Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart C-Code + Library calls () optimizing native compiler Executable Optimization of the computation more important than optimization of the communication auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page
13 Comparison I & II Some conclusions one MPI process on each CPU Comparison I. (experiment) hybrid MPI+OpenMP MPI only outside of parallel regions Comparison II. (theory + experiment) Overlapping Comm. + Comp. Funneled & Reserved reserved thread for communication Hybrid masteronly may not saturate the inter-node bandwidth Experiments:.5.7 times faster message transfer with (only communication part!) hard topology problem with Performance chance ε < (with overlapping computation & communication and one communication thread per SMP) up to 5% total performance benefit with real matrix-vector-multiply cons: overlapping is hard & tricky Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Conclusions Pure MPI versus hybrid masteronly model: Topology problem with may be hard, e.g., with unstructured grids Communication is bigger with, but may be nevertheless faster On the other hand, typically communication is only some percent relax Efficient hybrid programming: one may overlap communication and computation hard to do! using simple hybrid funneled&reserved model, you maybe up to 5% faster (compared to masteronly model) If you want to use pure OpenMP (based on virtual shared memory) try to use still the full bandwidth of the inter-node network (keep pressure on your compiler/dsm writer) be sure that you do not lose any computational optimization e.g., best Fortran compiler & optimization directives should be usable optimal parallel programming model on clusters of SMP nodes, depends on many factors no general recommendation Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 3
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationHybrid MPI and OpenMP Parallel Programming
Hybrid MPI and OpenMP Parallel Programming MPI + OpenMP and other models on clusters of SMP nodes Hybrid Parallel Programming Slide 1 Höchstleistungsrechenzentrum Stuttgart Rolf Rabenseifner High-Performance
More informationSome Aspects of Message-Passing on Future Hybrid Systems
Some Aspects of Message-Passing on Future Hybrid Systems Rabenseifner@hlrs.de High Performance Computing Center (HLRS), University of Stuttgart, Germany www.hlrs.de Invited Talk at EuroPVM/MPI 08 Dublin,
More informationHybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes
Parallel Programming on lusters of Multi-ore SMP Nodes Rolf Rabenseifner, High Performance omputing enter Stuttgart (HLRS) Georg Hager, Erlangen Regional omputing enter (RRZE) Gabriele Jost, Texas Advanced
More informationHybrid MPI and OpenMP Parallel Programming
Hybrid MPI and OpenMP Parallel Programming MPI + OpenMP and other models on clusters of SMP nodes Rolf Rabenseifner 1) Georg Hager 2) Gabriele Jost 3) Rainer Keller 1) Rabenseifner@hlrs.de Georg.Hager@rrze.uni-erlangen.de
More informationPerformance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems
Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems Rolf Rabenseifner, Michael M. Resch, Sunil Tiyyagura, Panagiotis Adamidis rabenseifner@hlrs.de resch@hlrs.de
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationLecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1
Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationCommon lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases
Hybrid (i.e. MPI+OpenMP) applications (i.e. programming) on modern (i.e. multi- socket multi-numa-domain multi-core multi-cache multi-whatever) architectures: Things to consider Georg Hager Gerhard Wellein
More informationHybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.
Hybrid MPI/OpenMP parallelization Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Thread parallelism (such as OpenMP or Pthreads) can provide additional parallelism
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and
More informationHybrid MPI and OpenMP Parallel Programming
Hybrid MPI and OpenMP Parallel Programming Jemmy Hu SHARCNET HPTC Consultant July 8, 2015 Objectives difference between message passing and shared memory models (MPI, OpenMP) why or why not hybrid? a common
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationGerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1
Parallel lsparse matrix-vector ti t multiplication li as a test case for hybrid MPI+OpenMP programming Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1 Institute of Physics, University
More informationSMP Cluster: Reconstruct Bidirectional Sieve for Hybrid Parallelization
Proc. Int. Conf. on Advances in Computing, Control, and Telecommunication Technologies, ACT SMP Cluster: Reconstruct Bidirectional Sieve for Hybrid Parallelization Gang Liao 1, Qi Sun 1, Lian Luo 1, Shi-Yuan
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationHybrid Programming with MPI and SMPSs
Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of
More informationMPI and OpenMP. Mark Bull EPCC, University of Edinburgh
1 MPI and OpenMP Mark Bull EPCC, University of Edinburgh markb@epcc.ed.ac.uk 2 Overview Motivation Potential advantages of MPI + OpenMP Problems with MPI + OpenMP Styles of MPI + OpenMP programming MPI
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationParallel Programming
Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems
More informationHybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes Rolf Rabenseifner High Performance Computing Center Stuttgart (HLRS), Germany rabenseifner@hlrs.de Georg Hager Erlangen Regional
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationParallel Hardware Architectures and Parallel Programming Models
Parallel Hardware Architectures and Parallel Programming Models Rolf Rabenseifner rabenseifner@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Höchstleistungsrechenzentrum
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationMPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition
MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition Yun He and Chris H.Q. Ding NERSC Division, Lawrence Berkeley National
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationEarly Evaluation of the Cray X1 at Oak Ridge National Laboratory
Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Patrick H. Worley Thomas H. Dunigan, Jr. Oak Ridge National Laboratory 45th Cray User Group Conference May 13, 2003 Hyatt on Capital Square
More informationMPI - Today and Tomorrow
MPI - Today and Tomorrow ScicomP 9 - Bologna, Italy Dick Treumann - MPI Development The material presented represents a mix of experimentation, prototyping and development. While topics discussed may appear
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationAdvanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens School of Electrical
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationLessons learned from MPI
Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.
More informationParallel Poisson Solver in Fortran
Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationA Message Passing Standard for MPP and Workstations
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker Message Passing Interface (MPI) Message passing library Can be
More informationLecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp
Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler
More informationA Heat-Transfer Example with MPI Rolf Rabenseifner
A Heat-Transfer Example with MPI (short version) Rolf Rabenseifner rabenseifner@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de A Heat-Transfer Example with
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationWhat SMT can do for You. John Hague, IBM Consultant Oct 06
What SMT can do for ou John Hague, IBM Consultant Oct 06 100.000 European Centre for Medium Range Weather Forecasting (ECMWF): Growth in HPC performance 10.000 teraflops sustained 1.000 0.100 0.010 VPP700
More informationLoad Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops
Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering
More informationParallel Computing Why & How?
Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel
More informationLoad Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes
Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes Agnieszka Debudaj-Grabysz 1 and Rolf Rabenseifner 2 1 Silesian University of Technology, Gliwice, Poland; 2 High-Performance Computing
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationMessage Passing with MPI
Message Passing with MPI PPCES 2016 Hristo Iliev IT Center / JARA-HPC IT Center der RWTH Aachen University Agenda Motivation Part 1 Concepts Point-to-point communication Non-blocking operations Part 2
More informationAdvanced Message-Passing Interface (MPI)
Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationHybrid programming with MPI and OpenMP On the way to exascale
Institut du Développement et des Ressources en Informatique Scientifique www.idris.fr Hybrid programming with MPI and OpenMP On the way to exascale 1 Trends of hardware evolution Main problematic : how
More informationMPI & OpenMP Mixed Hybrid Programming
MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationPerformance Engineering - Case study: Jacobi stencil
Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More information1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell
Table of Contents Distributed and Parallel Technology Revision Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 1 Overview 2 A Classification of Parallel
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationCISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan
CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous
More informationHybrid Model Parallel Programs
Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters
More informationCSE5351: Parallel Processing Part III
CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?
More informationHigh Performance Computing
High Performance Computing ADVANCED SCIENTIFIC COMPUTING Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich
More informationCopyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP
Chapter 18 Combining MPI and OpenMP Outline Advantages of using both MPI and OpenMP Case Study: Conjugate gradient method Case Study: Jacobi method C+MPI vs. C+MPI+OpenMP Interconnection Network P P P
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationMULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA
MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationMPI+OpenMP hybrid computing
MPI+OpenMP hybrid computing (on modern multicore systems) Georg Hager Gerhard Wellein Holger Stengel Jan Treibig Markus Wittmann Michael Meier Erlangen Regional Computing Center (RRZE), Germany 39th Speedup
More informationPerformance and Software-Engineering Considerations for Massively Parallel Simulations
Performance and Software-Engineering Considerations for Massively Parallel Simulations Ulrich Rüde (ruede@cs.fau.de) Ben Bergen, Frank Hülsemann, Christoph Freundl Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de
More informationClusters of SMP s. Sean Peisert
Clusters of SMP s Sean Peisert What s Being Discussed Today SMP s Cluters of SMP s Programming Models/Languages Relevance to Commodity Computing Relevance to Supercomputing SMP s Symmetric Multiprocessors
More informationImproving the interoperability between MPI and OmpSs-2
Improving the interoperability between MPI and OmpSs-2 Vicenç Beltran Querol vbeltran@bsc.es 19/04/2018 INTERTWinE Exascale Application Workshop, Edinburgh Why hybrid MPI+OmpSs-2 programming? Gauss-Seidel
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationHPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI
HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI Robert van Engelen Due date: December 10, 2010 1 Introduction 1.1 HPC Account Setup and Login Procedure Same as in Project 1. 1.2
More informationPerformance Tuning and OpenMP
Performance Tuning and OpenMP mueller@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Höchstleistungsrechenzentrum Stuttgart Outline Motivation Performance
More informationThe Icosahedral Nonhydrostatic (ICON) Model
The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =
More informationHigh-Performance Computing-Center. EuroPVM/MPI 2004, Budapest, Sep , q 2
olf abenseifner (HLS), rabenseifner@hlrsde Jesper Larsson Träff (EC) traff@ccrl-necede University of Stuttgart, High-Performance Computing-Center Stuttgart (HLS), wwwhlrsde C&C esearch Laboratories EC
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationAdaptive Transpose Algorithms for Distributed Multicore Processors
Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationParallel Computing and the MPI environment
Parallel Computing and the MPI environment Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationShared Memory Parallelism - OpenMP
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openmp/#introduction) OpenMP sc99 tutorial
More information