Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Size: px
Start display at page:

Download "Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen"

Transcription

1 Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS) Gerhard Wellein Friedrich-Alexander-Universität Erlangen-Nürnberg, Regionales Rechenzentrum Erlangen (RRZE) ZKI, AK Supercomputing RWTH Aachen, 3. April 3. Parallel Programming on SMP-Clusters Slide Höchstleistungsrechenzentrum Stuttgart Motivation SMP nodes HPC systems often clusters of SMP nodes i.e., hybrid architectures Node Interconnect CPUs shared memory Using the communication bandwidth of the hardware optimal usage Minimizing synchronization = idle time of the hardware Appropriate parallel programming models / Pros & Cons Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

2 Major Programming models on hybrid systems Pure MPI (one MPI process on each CPU) Hybrid MPI+OpenMP shared memory OpenMP distributed memory MPI OpenMP inside of the SMP nodes MPI between the nodes via node interconnect Node Interconnect Other: Virtual shared memory systems, HPF, Often hybrid programming (MPI+OpenMP) slower than why? MPI Sequential program on each CPU local data in each process data Explicit Message Passing by calling MPI_Send & MPI_Recv Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart OpenMP (shared data) some_serial_code #pragma omp parallel for for (j= ; ; j++) block_to_be_parallelized again_some_serial_code Master thread, other threads sleeping Example from SC Pure MPI versus Hybrid MPI+OpenMP () What s better? it depends on? Integration rate [ Years per day ] 35 Explicit C54N6 6 Level SEAM: NPACI Results with 7 or 8 processes or threads per node Processors Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart Integration rate [ Years per day ] Explicit/Semi Implicit C54N6 SEAM vs T7 PSTSWM, 6 Level, NCAR Processors Figures: Richard D. Loft, Stephen J. Thomas, John M. Dennis: Terascale Spectral Element Dynamical Core for Atmospheric General Circulation Models. Proceedings of SC, Denver, USA, Nov.. Fig. 9 and. auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

3 Parallel Programming Models on Hybrid Platforms Comparison I. (experiment) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Funneled Multiple Comparison II. MPI only more than one thread (theory MPI + experiment) on master-thread may communicate local data in each process OpenMP (shared data) Master thread, other threads Sequential data some_serial_code program on #pragma omp parallel for each CPU Funneled & Funneled Multiple & Multiple for (j= ; ; j++) Reserved with block_to_be_parallelized Reserved with Explicit message transfers reserved thread Full Load again_some_serial_code reserved threads Full Load sleeping by calling MPI_Send for & MPI_Recv communication Balancing for communication Balancing Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Programming Models on Hybrid Platforms: Pure MPI one MPI process on each CPU Exa.: SMP nodes, 8 CPUs/node Round-robin x Sequential x Optimal? x Slow inter-node link Advantages No modifications on existing MPI codes MPI library need not to support multiple threads Problems To fit application topology on hardware topology E.g. choosing ranks in MPI_COMM_WORLD round robin (rank on node, rank on node,... ) Sequential (ranks -7 on st node, ranks 8-5 on nd ) Disadvantages Message passing overhead inside of SMP nodes (instead of simply accessing the data via the shared memory)! Reason for using hybrid programming models Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 3

4 Programming Models on Hybrid Platforms: Hybrid MPI only outside of parallel regions for (iteration.) { #pragma omp parallel numerical code /*end omp parallel */ /* on master thread only */ MPI_Send (original data to halo areas in other SMP nodes) MPI_Recv (halo data from the neighbors) } /*end for loop Advantages No message passing inside of the SMP nodes No topology problem Problems MPI-lib must support MPI_THREAD_FUNNELED Disadvantages all other threads are sleeping while master thread communicates Reason for implementing overlapping of communication & computation Slide 7 / 6 High Perf. Comp. Center, Univ. Stuttgart Experiment: Orthogonal parallel communication MPI+OpenMP: only vertical : vertical AND horizontal messages Hitachi SR8 8 nodes each node with 8 CPUs MPI_Sendrecv intra-node 8*8*MB:. ms 8*8MB hybrid: 9. ms Slide 8 / 6 High Perf. Comp. Center, Univ. Stuttgart. inter-node 8*8*MB: 9.6 ms : Σ=.6 ms.6x slower than with, although only half of the transferred bytes and less latencies due to 8x longer messages auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 4

5 Results of the experiment is better for message size > 3 kb long messages: T hybrid / T purempi >.6 OpenMP master thread cannot saturate the inter-node network bandwidth is faster MPI+OpenMP (masteronly) Slide 9 / 6 High Perf. Comp. Center, Univ. Stuttgart Transfer time [ms] Ratio,,,8,6,4,,8,6,4, T_hybrid (size*8) T_: inter+intra T_: inter-node T_: intra-node 8 5 k 8k 3k 8k 5k M (purempi),5 k,5 4k 6k 64k 8 56k 3 8 M 5 4M 6M 48 ( hybrid ) Message size [kb],5, Message size [kb] T_hybrid / T_pureMPI (inter+intra node) Ratio T_hybrid MPI+OpenMP / T_ 6*6 CPUs IBM RS/6 SP, 8 SMP nodes, each SMP node with 6 POWER3+ CPUs, at NERSC transfer time [ms] 56 CPUs, hybrid MPI+OpenMP 56 CPUs,, hori.+vert. 56 CPUs,, vertical 56 CPUs,, horizontal 56 CPUs, T_hybrid / T_pureMPI 3,5,5 T_hybrid / T_pureMPI Pure MPI,, E+ E+ E+ E+3 E+4 E+5 E+6 E+7 Message size () [byte] Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Additional slides (measured at NERSC) auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 5,5 Hybrid MPI+OpenMP (masteronly) Measurements: Thanks to Gerhard Wellein, RRZE, and Horst Simon, NERSC.

6 Ratio T_hybrid MPI+OpenMP / T_ IBM RS/6 SP, 8 SMP nodes, each SMP node with 6 POWER3+ CPUs, at NERSC 3 T_hybrid MPI+OpenMP / T_,5 3 CPUs, T_hybrid / T_pureMPI 3 CPUs, T_hybrid / T_pureMPI 64 CPUs, T_hybrid / T_pureMPI,5 64 CPUs, T_hybrid / T_pureMPI 8 CPUs, T_hybrid / T_pureMPI 8 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI 9 CPUs, T_hybrid / T_pureMPI,5 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI 56 CPUs, T_hybrid / T_pureMPI E+ E+ E+ E+3 E+4 E+5 E+6 E+7 Message size () [byte] Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Pure MPI Hybrid MPI+OpenMP (masteronly) Measurements: Thanks to Gerhard Wellein, RRZE, and Horst Simon, NERSC. Possible Reasons Hardware: is one CPU able to saturate the inter-node network? Software: internal MPI buffering may cause additional memory traffic memory bandwidth may be the real restricting factor? Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 6

7 Optimizing the hybrid masteronly model By the MPI library: Using multiple threads using multiple memory paths (e.g., for strided data) using multiple floating point units (e.g., for reduction) using multiple communication links (if one link cannot saturate the hardware inter-node bandwidth) requires knowledge about free CPUs, e.g., via new MPI_THREAD_MASTERONLY By the user-application: unburden MPI from work, that can be done by the application e.g., concatenate strided data in parallel regions instead of using MPI derived datatypes Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart PROs & CONs Hybrid PROs hybrid more easy to fit hybrid hardware topology no communication overhead inside of the SMP nodes reduced #MPI messages & larger message sizes reduced latency-based overheads save memory reduced #MPI processes better speedup (Amdahl s law) faster convergence, e.g., if multigrid numeric is computed only on a partial grid CONs hybrid You may loose your cache (cache flushes inside of OpenMP) Requires additional level of parallelism in problem/program Programming complexity increases & additional synchronization overhead Requires support by MPI library & mature OpenMP compiler Problem to saturate the inter-node network Idling CPUs in the scheme Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 7

8 Parallel Programming Models on Hybrid Platforms Comparison I. ( experiments) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread for communication Funneled with Full Load Balancing Multiple & Reserved reserved threads for communication Multiple with Full Load Balancing Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Overlapping computation & communication funneled & reserved hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Implications: all other threads are computing while master thread communicates better CPU usage MPI-lib must provide at least MPI_THREAD_FUNNELED thread-safety inter-node bandwidth may be still used only partially Major drawback: load balancing necessary alternative: reserved thread for communication next slide Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 8

9 Overlapping computation & communication funneled & reserved Alternative: Funneled MPI only on master-thread Funneled & Reserved reserved thread for communication cons: bad load balance, if T communication i.e., reserved tasks on threads: master thread: communication all other threads: computation n communication_threads T computation n computation_threads pros: more easy programming scheme than with full load balancing chance for good performance! Slide 7 / 6 High Perf. Comp. Center, Univ. Stuttgart Performance ratio (theory) funneled & reserved T ε = ( hybrid, funneled&reserved ) performance ratio (ε) T hybrid, masteronly Good chance of funneled & reserved: ε max = +m( /n) ε > funneled& reserved ε < masteronly Small risk of funneled & reserved: ε min = m/n f comm [%] f comm [%] T hybrid, masteronly = (f comm + f comp, non-overlap + f comp, overlap ) T hybrid, masteronly n = # threads per SMP node, m = # reserved threads for MPI communication Slide 8 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 9

10 Experiment: Matrix-vector-multiply (MVM) funneled & reserved performance ratio (ε) Experiments (Theory) Source: R. Rabenseifner, G. Wellein: Communication and Optimization Aspects of Parallel Programming Models. EWOMP, Rome, Italy, Sep. 8, Slide 9 / 6 High Perf. Comp. Center, Univ. Stuttgart funneled & reserved masteronly Jacobi-Davidson-Solver Hitachi SR8 8 CPUs / SMP node JDS (Jagged Diagonal Storage) vectorizing n proc = # SMP nodes D Mat = 5*5*(n loc k *nproc) Varying n loc k Varying /f comm f comp,non-overlap = f comp,overlap 6 Experiment: Matrix-vector-multiply (MVM) funneled & reserved performance ratio (ε) funneled & reserved masteronly Same experiment on IBM SP Power3 nodes with 6 CPUs per node funneled&reserved is always faster in this experiments Reason: Memory bandwidth is already saturated by 5 CPUs, see inset Inset: Speedup on SMP node using different number of threads Source: R. Rabenseifner, G. Wellein: Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. International Journal of High Performance Computing Applications, Vol. 7, No., 3, Sage Science Press. Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

11 Hybrid MVM implementation MASTERONLY mode: Automatic parallel. of inner i loop (data parallel) Single threaded MPI calls FUNNELED mode: Functional parallelism: Simulate asynchronous data transfer! (OpenMP) Release list - LOCK Single threaded MPI calls MASTERONLY mode MPI M Threads FUNNELED mode MPI M Threads OMP PARALLEL LOCK: Rel. list LOCK: Rel. list OMP END PARALLEL Slide / 6 High Perf. Comp. Center, Univ. Stuttgart Parallel Programming Models on Hybrid Platforms Comparison I. (experiment) one MPI process on each CPU MPI only outside of parallel regions hybrid MPI+OpenMP MPI: inter-node communication OpenMP: inside of each SMP node OpenMP only distributed virtual shared memory Comparison III. Overlapping Comm. + Comp. MPI communication by one or a few threads while other threads are computing Comparison II. (theory + experiment) Funneled MPI only on master-thread Multiple more than one thread may communicate Funneled & Reserved reserved thread for communication Funneled with Full Load Balancing Multiple & Reserved reserved threads for communication Multiple with Full Load Balancing Slide / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

12 hybrid MPI+OpenMP OpenMP only Comparing other methods Memory copies from remote memory to local CPU register and vice versa Access method -sided MPI -sided MPI Compiler based: OpenMP on DSM (distributed shared memory) or with cluster extensions, UPC, Co-Array Fortran, HPF Copies Remarks internal MPI buffer + application receive buf. application receive buffer page based transfer word based access latency hiding with pre-fetch latency hiding with buffering bandwidth b(message size) b(size) = b / (+ b T latency /size) same formula, but probably better b and T latency extremely poor, if only parts are needed 8 byte / T latency, e.g, 8 byte /.33µs = 4MB/s b see -sided communication Slide 3 / 6 High Perf. Comp. Center, Univ. Stuttgart hybrid MPI+OpenMP OpenMP only Compilation and Optimization Library based communication (e.g., MPI) clearly separated optimization of () communication MPI library () computation Compiler essential for success of MPI Compiler based parallelization (including the communication): similar strategy OpenMP Source (Fortran / C) preservation of original with optimization directives language? () OMNI Compiler optimization directives? Communication- & Thread-Library Slide 4 / 6 High Perf. Comp. Center, Univ. Stuttgart C-Code + Library calls () optimizing native compiler Executable Optimization of the computation more important than optimization of the communication auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page

13 Comparison I & II Some conclusions one MPI process on each CPU Comparison I. (experiment) hybrid MPI+OpenMP MPI only outside of parallel regions Comparison II. (theory + experiment) Overlapping Comm. + Comp. Funneled & Reserved reserved thread for communication Hybrid masteronly may not saturate the inter-node bandwidth Experiments:.5.7 times faster message transfer with (only communication part!) hard topology problem with Performance chance ε < (with overlapping computation & communication and one communication thread per SMP) up to 5% total performance benefit with real matrix-vector-multiply cons: overlapping is hard & tricky Slide 5 / 6 High Perf. Comp. Center, Univ. Stuttgart Conclusions Pure MPI versus hybrid masteronly model: Topology problem with may be hard, e.g., with unstructured grids Communication is bigger with, but may be nevertheless faster On the other hand, typically communication is only some percent relax Efficient hybrid programming: one may overlap communication and computation hard to do! using simple hybrid funneled&reserved model, you maybe up to 5% faster (compared to masteronly model) If you want to use pure OpenMP (based on virtual shared memory) try to use still the full bandwidth of the inter-node network (keep pressure on your compiler/dsm writer) be sure that you do not lose any computational optimization e.g., best Fortran compiler & optimization directives should be usable optimal parallel programming model on clusters of SMP nodes, depends on many factors no general recommendation Slide 6 / 6 High Perf. Comp. Center, Univ. Stuttgart auf hybriden HPC-Plattformen. ZKI, AK Supercomputing, RWTH Aachen, 3. April 3. Page 3

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

Hybrid MPI and OpenMP Parallel Programming

Hybrid MPI and OpenMP Parallel Programming Hybrid MPI and OpenMP Parallel Programming MPI + OpenMP and other models on clusters of SMP nodes Hybrid Parallel Programming Slide 1 Höchstleistungsrechenzentrum Stuttgart Rolf Rabenseifner High-Performance

More information

Some Aspects of Message-Passing on Future Hybrid Systems

Some Aspects of Message-Passing on Future Hybrid Systems Some Aspects of Message-Passing on Future Hybrid Systems Rabenseifner@hlrs.de High Performance Computing Center (HLRS), University of Stuttgart, Germany www.hlrs.de Invited Talk at EuroPVM/MPI 08 Dublin,

More information

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes Parallel Programming on lusters of Multi-ore SMP Nodes Rolf Rabenseifner, High Performance omputing enter Stuttgart (HLRS) Georg Hager, Erlangen Regional omputing enter (RRZE) Gabriele Jost, Texas Advanced

More information

Hybrid MPI and OpenMP Parallel Programming

Hybrid MPI and OpenMP Parallel Programming Hybrid MPI and OpenMP Parallel Programming MPI + OpenMP and other models on clusters of SMP nodes Rolf Rabenseifner 1) Georg Hager 2) Gabriele Jost 3) Rainer Keller 1) Rabenseifner@hlrs.de Georg.Hager@rrze.uni-erlangen.de

More information

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems Rolf Rabenseifner, Michael M. Resch, Sunil Tiyyagura, Panagiotis Adamidis rabenseifner@hlrs.de resch@hlrs.de

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

Common lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases

Common lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases Hybrid (i.e. MPI+OpenMP) applications (i.e. programming) on modern (i.e. multi- socket multi-numa-domain multi-core multi-cache multi-whatever) architectures: Things to consider Georg Hager Gerhard Wellein

More information

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Hybrid MPI/OpenMP parallelization Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Thread parallelism (such as OpenMP or Pthreads) can provide additional parallelism

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

COMP528: Multi-core and Multi-Processor Computing

COMP528: Multi-core and Multi-Processor Computing COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and

More information

Hybrid MPI and OpenMP Parallel Programming

Hybrid MPI and OpenMP Parallel Programming Hybrid MPI and OpenMP Parallel Programming Jemmy Hu SHARCNET HPTC Consultant July 8, 2015 Objectives difference between message passing and shared memory models (MPI, OpenMP) why or why not hybrid? a common

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1

Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1 Parallel lsparse matrix-vector ti t multiplication li as a test case for hybrid MPI+OpenMP programming Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1 Institute of Physics, University

More information

SMP Cluster: Reconstruct Bidirectional Sieve for Hybrid Parallelization

SMP Cluster: Reconstruct Bidirectional Sieve for Hybrid Parallelization Proc. Int. Conf. on Advances in Computing, Control, and Telecommunication Technologies, ACT SMP Cluster: Reconstruct Bidirectional Sieve for Hybrid Parallelization Gang Liao 1, Qi Sun 1, Lian Luo 1, Shi-Yuan

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Hybrid Programming with MPI and SMPSs

Hybrid Programming with MPI and SMPSs Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh 1 MPI and OpenMP Mark Bull EPCC, University of Edinburgh markb@epcc.ed.ac.uk 2 Overview Motivation Potential advantages of MPI + OpenMP Problems with MPI + OpenMP Styles of MPI + OpenMP programming MPI

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

Parallel Programming

Parallel Programming Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems

More information

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes Rolf Rabenseifner High Performance Computing Center Stuttgart (HLRS), Germany rabenseifner@hlrs.de Georg Hager Erlangen Regional

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Review of previous examinations TMA4280 Introduction to Supercomputing

Review of previous examinations TMA4280 Introduction to Supercomputing Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Parallel Hardware Architectures and Parallel Programming Models

Parallel Hardware Architectures and Parallel Programming Models Parallel Hardware Architectures and Parallel Programming Models Rolf Rabenseifner rabenseifner@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Höchstleistungsrechenzentrum

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition

MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition Yun He and Chris H.Q. Ding NERSC Division, Lawrence Berkeley National

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Patrick H. Worley Thomas H. Dunigan, Jr. Oak Ridge National Laboratory 45th Cray User Group Conference May 13, 2003 Hyatt on Capital Square

More information

MPI - Today and Tomorrow

MPI - Today and Tomorrow MPI - Today and Tomorrow ScicomP 9 - Bologna, Italy Dick Treumann - MPI Development The material presented represents a mix of experimentation, prototyping and development. While topics discussed may appear

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens School of Electrical

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation

More information

A Message Passing Standard for MPP and Workstations

A Message Passing Standard for MPP and Workstations A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker Message Passing Interface (MPI) Message passing library Can be

More information

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler

More information

A Heat-Transfer Example with MPI Rolf Rabenseifner

A Heat-Transfer Example with MPI Rolf Rabenseifner A Heat-Transfer Example with MPI (short version) Rolf Rabenseifner rabenseifner@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de A Heat-Transfer Example with

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

What SMT can do for You. John Hague, IBM Consultant Oct 06

What SMT can do for You. John Hague, IBM Consultant Oct 06 What SMT can do for ou John Hague, IBM Consultant Oct 06 100.000 European Centre for Medium Range Weather Forecasting (ECMWF): Growth in HPC performance 10.000 teraflops sustained 1.000 0.100 0.010 VPP700

More information

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering

More information

Parallel Computing Why & How?

Parallel Computing Why & How? Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel

More information

Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes

Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes Agnieszka Debudaj-Grabysz 1 and Rolf Rabenseifner 2 1 Silesian University of Technology, Gliwice, Poland; 2 High-Performance Computing

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Message Passing with MPI

Message Passing with MPI Message Passing with MPI PPCES 2016 Hristo Iliev IT Center / JARA-HPC IT Center der RWTH Aachen University Agenda Motivation Part 1 Concepts Point-to-point communication Non-blocking operations Part 2

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Hybrid programming with MPI and OpenMP On the way to exascale

Hybrid programming with MPI and OpenMP On the way to exascale Institut du Développement et des Ressources en Informatique Scientifique www.idris.fr Hybrid programming with MPI and OpenMP On the way to exascale 1 Trends of hardware evolution Main problematic : how

More information

MPI & OpenMP Mixed Hybrid Programming

MPI & OpenMP Mixed Hybrid Programming MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Performance Engineering - Case study: Jacobi stencil

Performance Engineering - Case study: Jacobi stencil Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell Table of Contents Distributed and Parallel Technology Revision Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 1 Overview 2 A Classification of Parallel

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

Hybrid Model Parallel Programs

Hybrid Model Parallel Programs Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters

More information

CSE5351: Parallel Processing Part III

CSE5351: Parallel Processing Part III CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?

More information

High Performance Computing

High Performance Computing High Performance Computing ADVANCED SCIENTIFIC COMPUTING Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich

More information

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP Chapter 18 Combining MPI and OpenMP Outline Advantages of using both MPI and OpenMP Case Study: Conjugate gradient method Case Study: Jacobi method C+MPI vs. C+MPI+OpenMP Interconnection Network P P P

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

MPI+OpenMP hybrid computing

MPI+OpenMP hybrid computing MPI+OpenMP hybrid computing (on modern multicore systems) Georg Hager Gerhard Wellein Holger Stengel Jan Treibig Markus Wittmann Michael Meier Erlangen Regional Computing Center (RRZE), Germany 39th Speedup

More information

Performance and Software-Engineering Considerations for Massively Parallel Simulations

Performance and Software-Engineering Considerations for Massively Parallel Simulations Performance and Software-Engineering Considerations for Massively Parallel Simulations Ulrich Rüde (ruede@cs.fau.de) Ben Bergen, Frank Hülsemann, Christoph Freundl Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de

More information

Clusters of SMP s. Sean Peisert

Clusters of SMP s. Sean Peisert Clusters of SMP s Sean Peisert What s Being Discussed Today SMP s Cluters of SMP s Programming Models/Languages Relevance to Commodity Computing Relevance to Supercomputing SMP s Symmetric Multiprocessors

More information

Improving the interoperability between MPI and OmpSs-2

Improving the interoperability between MPI and OmpSs-2 Improving the interoperability between MPI and OmpSs-2 Vicenç Beltran Querol vbeltran@bsc.es 19/04/2018 INTERTWinE Exascale Application Workshop, Edinburgh Why hybrid MPI+OmpSs-2 programming? Gauss-Seidel

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI Robert van Engelen Due date: December 10, 2010 1 Introduction 1.1 HPC Account Setup and Login Procedure Same as in Project 1. 1.2

More information

Performance Tuning and OpenMP

Performance Tuning and OpenMP Performance Tuning and OpenMP mueller@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Höchstleistungsrechenzentrum Stuttgart Outline Motivation Performance

More information

The Icosahedral Nonhydrostatic (ICON) Model

The Icosahedral Nonhydrostatic (ICON) Model The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =

More information

High-Performance Computing-Center. EuroPVM/MPI 2004, Budapest, Sep , q 2

High-Performance Computing-Center. EuroPVM/MPI 2004, Budapest, Sep , q 2 olf abenseifner (HLS), rabenseifner@hlrsde Jesper Larsson Träff (EC) traff@ccrl-necede University of Stuttgart, High-Performance Computing-Center Stuttgart (HLS), wwwhlrsde C&C esearch Laboratories EC

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Adaptive Transpose Algorithms for Distributed Multicore Processors

Adaptive Transpose Algorithms for Distributed Multicore Processors Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Parallel Computing and the MPI environment

Parallel Computing and the MPI environment Parallel Computing and the MPI environment Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Shared Memory Parallelism - OpenMP

Shared Memory Parallelism - OpenMP Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openmp/#introduction) OpenMP sc99 tutorial

More information