Research on Programming Models to foster Programmer Productivity
|
|
- Dorthy Heath
- 5 years ago
- Views:
Transcription
1 to foster Programmer Productivity Christian Terboven April 5th, 2017
2 Where is Aachen? 2
3 Where is Aachen? 3
4 Where is Aachen? 4
5 Agenda n Our Research Activities n Some Thoughts on Productivity n Example 1: Thread Affinity n Example 2: Transactional Memory n Correctness Checking n Summary 5
6 Research Activities in HPC n Focus on Efficient Parallel Programming for HPC n Topics: Ò Parallel Programming Paradigms (OpenMP and others) ÒAffinity, tasking, nesting, NUMA, Object-oriented Parallel Prog. ÒMember of the OpenMP Language Committee and ARB Ò Correctness Checking (MPI, MPI+OpenMP and other paradigms) Ò Total Cost of Ownership (Energy Efficiency, Programmability, Performance) Ò Analysis of parallel architectures ÒMember of SPEC ÒLarge Shared Memory machines ÒProgramming for Accelerators (GPUs, Intel MIC, Prototype Arch.) 6
7 Some Thoughts on Productivity 7
8 Case Study: KegelSpan n 3D simulation of bevel gear cutting process 1 n Compute key values (i.a. chip thickness) to analyze tool load and tool wear n Fortran code (chip thickness computation) à Loop nest n à Dependencies in inner loop (minimum computation) Implementation à Basis: serial Fortran code Source: BMW, ZF, Klingelnberg 8 à OpenMP-simp: straight-forward OpenMP parallelization (no code tuning), data affinity à OpenMP-vec: restructuring for good data access pattern (SoA), vectorization, alignment to vector registers, loop interchanges, inlining, data affinity à OpenMP+LEO: OpenMP-vec (adapted to KNC), LEO directives for offloading kernels à OpenACC: restructuring for good data access pattern (SoA), coalescing à OpenCL: restructuring for good data access pattern (SoA), coalescing, shared memory 1 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume of VDI-Berichte, pages , Düsseldorf, VDI Verlag.
9 KegelSpan Effort & Performance mod.locs total runtime [s] effort [days] ,0 1,5 4,5 3,5 0,5 OpenMP, Serial OpenCL, OpenACC GPU Host Compiler - Intel Sandy Bridge 16-core processor (2x Intel GHz) Scientific Linux 6.3 NVIDIA Tesla C2050 ECC on, CUDA Toolkit 5.0/4.1 Intel Westmere 4-core processor (1x Intel GHz) Scientific Linux 6.3 OpenCL (GPU) Intel Intel / PGI 12.9 OpenACC (GPU) OpenMP+LEO (Phi) OpenMP-vec (SNB) OpenMP-simp (SNB) 9
10 What is Productivity? n Productivity = +,-./ 0123 = ", /60/" 0123 = #app. runs TCO à View on productivity might differ between scientist and HPC provider àtco: topic of active research n We believe: Abstractions can foster Programmer Productivity à Several studies showed: using pragmas (i.e. OpenMP) is more productive than using lower-level APIs (i.e. Posix-Threads) àless programming effort àeasier to learn and grasp important concepts 10
11 Example: Thread Affinity 11
12 Motivation n 2004: Sun Fire E25k server n 2012: Bull BCS System (in our cluster) à 128 Intel Nehalem-EX cores à Max. memory bandwidth: ca. 230 GB/s à 6 to 10 systems per rack possible à 144 Sun UltraSPARC IV cores n à Max. memory bandwidth: ca. 170 GB/s 2016: Intel Broadwell w/ Cluster-on-Die The memory hierarchy becomes more and more complex: at least two NUMA levels this is a challenge to program for! 12
13 The OpenMP Places concept n n Specification of Thread Affinity has to happen within the machine abstraction Considering the following system: c0 c1 c2 c3 c4 c5 c6 c7 à 2 sockets, 4 cores per socket, 4 hyper-threads per core n n n 13 Place: set of execution units Place List: (ordered) list of places The OpenMP place list is defined by the OMP_PLACES environment variable: à Specification of a regular expression, or à Specification of an abstract name, such as: à threads: one place per hyper-thread à cores: à sockets: one place per core (contains multiple hyper-threads) one place per socket (contains multiple cores) à Reduction of complex architecture to relevant performance-critical properties
14 Illustration of Thread Affinity n Selection of an application-specific strategy: à spread: à close: à master: separation of threads within the place list placement of threads closely together co-location of threads on single place n Example (nested par.): separation in outer loop, nearness in inner loop: OMP_PLACES=(0,1,2,3), (4,5,6,7),... = (0-3):8:4 = cores p0 p1 p2 p3 p4 p5 p6 p7 #pragma omp parallel proc_bind(spread) num_threads(4) p0 p1 p2 p3 p4 p5 p6 p7 #pragma omp parallel proc_bind(close) num_threads(4) p0 p1 p2 p3 p4 p5 p6 p7 14
15 Analysis for a SpMXV kernel Absolute performance: ca. 38 GFlops Roofline model: 39.4 GFlops as upper limit ü Application of the OpenMP Thread Affinity model ü NUMA-specific memory management with C++ allocator ü Integration via Template Expression mechanism and Adapter pattern ü Exploitation of matrix structure for load balancing and data placement 15
16 Example: Transactional Memory 16
17 Motivation n Processor-level hardware support for speculative lock elision has been introduced by IBM, Intel and others à potential for significant performance improvement if used in the right way à danger of tremendous penalties if used inappropriately n No standardized way in OpenMP to select a lock implementation à vendor-specific approaches are neither portable nor satisfying à a global setting, such as an environment variable, is not sensible n This work proposes an extended OpenMP API for locks and to extend the critical construct à to support the selection of lock implementations on a per-lock basis 17 à to offer backwards compatibility for existing application codes
18 Extended Locking API /1 n Fundamental requirement: do not break any existing code à new functionality is introduced as hints n Three options were considered à pragmas to prefix existing lock routines with the desired hint à complete set of new locking routines and lock types à new lock initialization routines to use with the existing lock API àminimal code modification, allows for incremental code adoption n OpenMP lock review à variable of type omp_lock_t or omp_nest_lock_t à must be initialized before first use with omp_init[_nest]_lock() 18 à routines to initialize, set, unset, and test a lock and finally to destroy it
19 Extended Locking API /2 n Two new lock init function provide hints to the runtime system à void omp_init[_nest]_lock_hinted( omp[_nest]_lock_t*, omp_lock_hint ) n The omp_lock_hint type lists high-level optimization criterions: à omp_lock_hint_none à omp_lock_hint_uncontended à omp_lock_hint_contended optimize for an uncontended lock optimize for a contended lock à omp_lock_hint_nonspeculative do not use hardware speculation à omp_lock_hint_speculative à omp_lock_hint_adaptive do use hardware speculation adaptively use hw speculation 19 à plus room for vendor-specific extensions n Similarly: Extended Critical construct
20 Evaluation with NPB UA n Naive use of HLE locks is not successful n The more threads are used, the more profitable is the clever use of HLE locks 20 Intel Xeon E5-2697v3 (code-name Haswell ), 2.6 GHz (no turbo), single socket, Red Har* Enterprise Linux* 7.0 (kernel Christian Terboven, Matthias S. Müller IT Center der RWTH Aachen 123-el7.x86_64), University Intel Composer XE for C/C SP with O3 optimization. * Other names and brands may be the property of othe
21 Correctness Checking 21
22 How many errors can you spot in this tiny example? #include <mpi.h> #include <stdio.h> int main (int argc, char** argv) { int rank, size, buf[8]; MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Datatype type; MPI_Type_contiguous (2, MPI_INTEGER, &type); MPI_Recv (buf, 2, MPI_INT, size - rank, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (buf, 2, type, size - rank, 123, MPI_COMM_WORLD); printf ("Hello, I am rank %d of %d.\n", rank, size); } 22 return 0; At least 8 issues in this code example
23 How many errors can you spot in this tiny example? #include <mpi.h> #include <stdio.h> int main (int argc, char** argv) { int rank, size, buf[8]; MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Datatype type; MPI_Type_contiguous (2, MPI_INTEGER, &type); No MPI_Init before first MPI-call Fortran type in C Recv-recv deadlock Rank0: src=size (out of range) Type not committed before use Type not freed before end of main Send 4 int, recv 2 int: truncation No MPI_Finalize before end of main MPI_Recv (buf, 2, MPI_INT, size - rank, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (buf, 2, type, size - rank, 123, MPI_COMM_WORLD); printf ("Hello, I am rank %d of %d.\n", rank, size); } return 0; MUST detects these issues and pinpoints you to the source 23
24 Now what about accelerated systems? n Hybrid parallel programming (MPI + OpenMP) is even more complex n Including accelerators (i.e. OpenMP target, OpenACC) even more n Recent work made MUST support à Threading à Offloading 24
25 Example: Race between Host + Device n Result not deterministic n Race detection only possible with memory tracing (pintool) n OMPT mapping information required double result = 0; #pragma omp parallel num_threads(2) { #pragma omp sections { #pragma omp section #pragma omp target map(tofrom:result) { result += compute(); } #pragma omp section { result += compute(); } } } 25
26 Status Correctness Checking n Comparison of Correctness Checking Capabilities Catergory Errorclass Insp(clang) Insp(Phi) MUST FK1 data_missing_accelerator x FK1 data_missing_host (x) x* FK1 data_outdated_accelerator x FK1 data_outdated_host x* FK2 datarace_inside_devkernel x FK2 datarace_across_devkernels FK3 race_between_host_and_device FK4 only_some_thread_pass_barrier x FK4 deadlock_with_locks x FK4 simd_misalign x** FK5 thread_pass_different_barriers x FK5 uninitialized_locks x FK6 dev_allocation_fails x x 26 * Check directly implemented in pintool ** Only in specialized version for x86
27 Summary 27
28 Influence on OpenMP Loop-level Parallelization Tasking Heterog. Arch. n n 28 OpenMP 3.0 and 3.1: C++ à Extension of the canonical form or parallelizable loops + interator loops à Definition of object behavior in the context of data scoping OpenMP 4.0: Thread Affinity à Integration of the OpenMP thread affinity model, support for nested par. n OpenMP 4.5: à Taskloop construct: loop parallelization by means of tasks (composability) à Locks with hints: Support for different lock types, like for transactional memory n OpenMP TR5 / 5.0: à Memory management à OpenMP Tools and Debugging Interface
29 Summary n Research interests of our group in Aachen Ò Parallel Programming Paradigms Ò Correctness Checking Ò Total Cost of Ownership Ò Analysis of parallel architectures n Abstractions can foster Programmer Productivity n Development of Programming Languages has to go along with Development of Tools à Focus not only on Performance, but also on Correctness 29
30 Thank for your attention. Christian Terboven Matthias Müller
Accelerators in Technical Computing: Is it Worth the Pain?
Accelerators in Technical Computing: Is it Worth the Pain? A TCO Perspective Sandra Wienke, Dieter an Mey, Matthias S. Müller Center for Computing and Communication JARA High-Performance Computing RWTH
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationMPI Runtime Error Detection with MUST
MPI Runtime Error Detection with MUST At the 27th VI-HPS Tuning Workshop Joachim Protze IT Center RWTH Aachen University April 2018 How many issues can you spot in this tiny example? #include #include
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationMPI Runtime Error Detection with MUST
MPI Runtime Error Detection with MUST At the 25th VI-HPS Tuning Workshop Joachim Protze IT Center RWTH Aachen University March 2017 How many issues can you spot in this tiny example? #include #include
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationParallel Programming Overview
Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator
More informationGetting Performance from OpenMP Programs on NUMA Architectures
Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant
More informationRuntime Correctness Checking for Emerging Programming Paradigms
(protze@itc.rwth-aachen.de), Christian Terboven, Matthias S. Müller, Serge Petiton, Nahid Emad, Hitoshi Murai and Taisuke Boku RWTH Aachen University, Germany University of Tsukuba / RIKEN, Japan Maison
More informationIntroduction to the Message Passing Interface (MPI)
Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018
More informationMPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018
MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,
More informationUsing Intel Transactional Synchronization Extensions
Using Intel Transactional Synchronization Extensions Dr.-Ing. Michael Klemm Software and Services Group michael.klemm@intel.com 1 Credits The Tutorial Gang Christian Terboven Michael Klemm Ruud van der
More informationBlue Waters Programming Environment
December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationParallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and
More informationMessage Passing Interface
MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across
More informationELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University
ELP Effektive Laufzeitunterstützung für zukünftige Programmierstandards Agenda ELP Project Goals ELP Achievements Remaining Steps ELP Project Goals Goals of ELP: Improve programmer productivity By influencing
More informationMPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018
MPI 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationMULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA
MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationCOSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines
Network card Network card 1 COSC 6374 Parallel Computation Message Passing Interface (MPI ) I Introduction Edgar Gabriel Fall 015 Distributed memory machines Each compute node represents an independent
More informationHolland Computing Center Kickstart MPI Intro
Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationIntroduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2
Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.
More informationCS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.
CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationECE 574 Cluster Computing Lecture 13
ECE 574 Cluster Computing Lecture 13 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 15 October 2015 Announcements Homework #3 and #4 Grades out soon Homework #5 will be posted
More informationIntroduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign
Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s
More informationJURECA Tuning for the platform
JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)
More informationOpenMP and Performance
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl}@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims
More informationHPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,
HPC-CINECA infrastructure: The New Marconi System HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati, g.amati@cineca.it Agenda 1. New Marconi system Roadmap Some performance info
More informationDmitry Durnov 15 February 2017
Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationChip Multiprocessors COMP Lecture 9 - OpenMP & MPI
Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather
More informationDon t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library
Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download
More informationHigh performance computing. Message Passing Interface
High performance computing Message Passing Interface send-receive paradigm sending the message: send (target, id, data) receiving the message: receive (source, id, data) Versatility of the model High efficiency
More informationRWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
RWTH GPU-Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de March 2012 Rechen- und Kommunikationszentrum (RZ) The GPU-Cluster GPU-Cluster: 57 Nvidia Quadro 6000 (29 nodes) innovative
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationIntroduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.
Introduction to MPI SHARCNET MPI Lecture Series: Part I of II Paul Preney, OCT, M.Sc., B.Ed., B.Sc. preney@sharcnet.ca School of Computer Science University of Windsor Windsor, Ontario, Canada Copyright
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationHands-on Clone instructions: bit.ly/ompt-handson. How to get most of OMPT (OpenMP Tools Interface)
How to get most of OMPT (OpenMP Tools Interface) Hands-on Clone instructions: bit.ly/ompt-handson (protze@itc.rwth-aachen.de), Tim Cramer, Jonas Hahnfeld, Simon Convent, Matthias S. Müller What is OMPT?
More informationAutoTune Workshop. Michael Gerndt Technische Universität München
AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy
More informationPCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.
PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed
More informationOpenMP API Version 5.0
OpenMP API Version 5.0 (or: Pretty Cool & New OpenMP Stuff) Michael Klemm Chief Executive Officer OpenMP Architecture Review Board michael.klemm@openmp.org Architecture Review Board The mission of the
More informationThe GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
The GPU-Cluster Sandra Wienke wienke@rz.rwth-aachen.de Fotos: Christian Iwainsky Rechen- und Kommunikationszentrum (RZ) The GPU-Cluster GPU-Cluster: 57 Nvidia Quadro 6000 (29 nodes) innovative computer
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationBinding Nested OpenMP Programs on Hierarchical Memory Architectures
Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de
More informationHigh Performance Computing Course Notes Message Passing Programming I
High Performance Computing Course Notes 2008-2009 2009 Message Passing Programming I Message Passing Programming Message Passing is the most widely used parallel programming model Message passing works
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationCS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.
Parallel Programming Lecture 18: Introduction to Message Passing Mary Hall November 2, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. -
More informationMany-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)
Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationIntroduction to OpenMP
1 / 7 Introduction to OpenMP: Exercises and Handout Introduction to OpenMP Christian Terboven Center for Computing and Communication, RWTH Aachen University Seffenter Weg 23, 52074 Aachen, Germany Abstract
More informationSlides prepared by : Farzana Rahman 1
Introduction to MPI 1 Background on MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers, and parallel programmers Used to create parallel programs based
More informationDistributed Memory Parallel Programming
COSC Big Data Analytics Parallel Programming using MPI Edgar Gabriel Spring 201 Distributed Memory Parallel Programming Vast majority of clusters are homogeneous Necessitated by the complexity of maintaining
More informationProductivity and Performance with Multi-Core Programming
Productivity and Performance with Multi-Core Programming Christian Terboven Center for Computing and Communication RWTH Aachen University 04.07.2012, CE Seminar, TU Darmstadt,
More informationCS 426. Building and Running a Parallel Application
CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationDistributed Memory Programming with Message-Passing
Distributed Memory Programming with Message-Passing Pacheco s book Chapter 3 T. Yang, CS240A Part of slides from the text book and B. Gropp Outline An overview of MPI programming Six MPI functions and
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More information15-440: Recitation 8
15-440: Recitation 8 School of Computer Science Carnegie Mellon University, Qatar Fall 2013 Date: Oct 31, 2013 I- Intended Learning Outcome (ILO): The ILO of this recitation is: Apply parallel programs
More informationThe Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs
1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx
More information12:00 13:20, December 14 (Monday), 2009 # (even student id)
Final Exam 12:00 13:20, December 14 (Monday), 2009 #330110 (odd student id) #330118 (even student id) Scope: Everything Closed-book exam Final exam scores will be posted in the lecture homepage 1 Parallel
More informationOutline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples
Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationHybrid MPI and OpenMP Parallel Programming
Hybrid MPI and OpenMP Parallel Programming Jemmy Hu SHARCNET HPTC Consultant July 8, 2015 Objectives difference between message passing and shared memory models (MPI, OpenMP) why or why not hybrid? a common
More informationMPI Correctness Checking with MUST
Center for Information Services and High Performance Computing (ZIH) MPI Correctness Checking with MUST Parallel Programming Course, Dresden, 8.- 12. February 2016 Mathias Korepkat (mathias.korepkat@tu-dresden.de
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationOpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono
OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/
More informationProgramming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam
Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationOpenStaPLE, an OpenACC Lattice QCD Application
OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)
More informationCSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )
CSE 613: Parallel Programming Lecture 21 ( The Message Passing Interface ) Jesmin Jahan Tithi Department of Computer Science SUNY Stony Brook Fall 2013 ( Slides from Rezaul A. Chowdhury ) Principles of
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationA few words about MPI (Message Passing Interface) T. Edwald 10 June 2008
A few words about MPI (Message Passing Interface) T. Edwald 10 June 2008 1 Overview Introduction and very short historical review MPI - as simple as it comes Communications Process Topologies (I have no
More informationThe Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs
1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) s http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx
More informationMigrating Offloading Software to Intel Xeon Phi Processor
Migrating Offloading Software to Intel Xeon Phi Processor White Paper February 2018 Document Number: 337129-001US Legal Lines and Disclaimers Intel technologies features and benefits depend on system configuration
More informationLecture 7: Distributed memory
Lecture 7: Distributed memory David Bindel 15 Feb 2010 Logistics HW 1 due Wednesday: See wiki for notes on: Bottom-up strategy and debugging Matrix allocation issues Using SSE and alignment comments Timing
More informationPragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray
Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on
More informationHPC Parallel Programing Multi-node Computation with MPI - I
HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 4 Message-Passing Programming Learning Objectives Understanding how MPI programs execute Familiarity with fundamental MPI functions
More informationPerformance Analysis of Parallel Applications Using LTTng & Trace Compass
Performance Analysis of Parallel Applications Using LTTng & Trace Compass Naser Ezzati DORSAL LAB Progress Report Meeting Polytechnique Montreal Dec 2017 What is MPI? Message Passing Interface (MPI) Industry-wide
More informationIntroduction to OpenMP
Introduction to OpenMP Christian Terboven 10.04.2013 / Darmstadt, Germany Stand: 06.03.2013 Version 2.3 Rechen- und Kommunikationszentrum (RZ) History De-facto standard for
More informationLecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1
Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method
More informationIntroducing Task-Containers as an Alternative to Runtime Stacking
Introducing Task-Containers as an Alternative to Runtime Stacking EuroMPI, Edinburgh, UK September 2016 Jean-Baptiste BESNARD jbbesnard@paratools.fr Julien ADAM, Sameer SHENDE, Allen MALONY (ParaTools)
More informationAn Introduction to the SPEC High Performance Group and their Benchmark Suites
An Introduction to the SPEC High Performance Group and their Benchmark Suites Robert Henschel Manager, Scientific Applications and Performance Tuning Secretary, SPEC High Performance Group Research Technologies
More informationIntel Parallel Studio XE 2015
2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:
More informationIntroduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014
Introduction to MPI Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Topics Introduction MPI Model and Basic Calls MPI Communication Summary 2 Topics Introduction
More informationCS 470 Spring Mike Lam, Professor. Distributed Programming & MPI
CS 470 Spring 2017 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI
More information<Insert Picture Here> OpenMP on Solaris
1 OpenMP on Solaris Wenlong Zhang Senior Sales Consultant Agenda What s OpenMP Why OpenMP OpenMP on Solaris 3 What s OpenMP Why OpenMP OpenMP on Solaris
More information