Hybrid programming with MPI and OpenMP On the way to exascale
|
|
- Jennifer Hardy
- 5 years ago
- Views:
Transcription
1 Institut du Développement et des Ressources en Informatique Scientifique Hybrid programming with MPI and OpenMP On the way to exascale 1
2 Trends of hardware evolution Main problematic : how to deal with power consumption? Simplification and multiplication of the number of cores Many-cores (like Intel Xeon Phi, IBM BG/Q) Accelerators (like NVIDIA Tesla or AMD Fusion) ARM based microprocessors (see for complementary information) Common characteristics that impact users and applications Huge number of threads of executions, remember that exascale = 1 billion threads of execution! Intensive use of SMT or Hyper-Threading to get good performance (at least 2 to 4 threads per core!) Vectorization (SIMD) is required to use the hardware efficiently, compiler tries to do its best, but not enough (yet) Memory per execution thread shrinks 2
3 Introduction to hybrid MPI+OpenMP parallelization For homogeneous architectures without accelerators, two well recognized and mature standards to parallelize applications : OpenMP : For shared memory architectures Directive based API supporting C/C++ and Fortran, to create threads (via parallel region), to choose the data sharing attribute of variables (PRIVATE or SHARED), to share work among the threads (DO, SECTION and TASK) and to synchronize threads (BARRIER, ATOMIC, CRITICAL, FLUSH) Latest official OpenMP specifications : version 3.1 (july 2011) Waiting for version 4.0 (Error Model, NUMA Support, Accelerators and Tasking Extensions) MPI : For all kinds of architecture Message passing library supporting C/C++ and Fortran to manage onesided, point to point or collective communications between processes, to define topologies and derived data types, to deal with parallel IO and to synchronize processes Latest official MPI specifications : MPI 3.0 Released September 21,
4 Introduction to hybrid MPI+OpenMP parallelization Majority of codes parallelized with MPI or OpenMP Nevertheless, with some applications, this approach begins to show some limitations on the latest generation of massively parallel architectures for various reasons like : Granularity of code (it s decreasing) Memory consumption of your application (doesn t fit anymore what is available ) Algorithm and hardware limitations (visible only beyond a certain threshold ) Huge load imbalance (very hard to deal with) Overheads (increase with the number of cores) All this leads to disappointing performance and very limited scalability!!! Solving all these issues is far from being simple 4
5 Introduction to hybrid MPI+OpenMP parallelization The main problem is simple : too many MPI processes to manage, with too little work to execute How to reduce the number of MPI processes? Replace MPI processes with OpenMP threads! That s what is called hybrid programing with MPI and OpenMP OpenMP can be replaced by any threading library Take the best of both approaches : MPI to exchange data between nodes OpenMP to benefit from the shared memory inside a node Mixing MPI and OpenMP with a two levels parallelization seems natural : Fits perfectly the hardware characteristics of various machines (either fat or thin nodes ) Has a lot of advantages but also some drawbacks, be careful 5
6 Introduction to hybrid MPI+OpenMP parallelization 6
7 Thread Support in MPI For a multithreaded MPI application, replace MPI_INIT( ) with MPI_INIT_THREAD(REQUIRED, PROVIDED, IERROR) MPI_THREAD_SINGLE : only one thread per MPI process, OpenMP cannot be used MPI_THREAD_FUNNELED : multiple threads per MPI process, but only the main thread can make MPI calls. MPI calls outside OpenMP parallel region or by the main thread (the one which made the MPI_INIT_THREAD call) MPI_THREAD_SERIALIZED : all threads can make MPI calls, but only one at a time. In an OpenMP parallel region, MPI calls have to be made in critical sections MPI_THREAD_MULTIPLE : completely multithreaded without restrictions (except for MPI collective calls using the same communicator) 7
8 Introduction to hybrid MPI+OpenMP parallelization Drawbacks of hybrid MPI+OpenMP approach : Complexity of the application (especially with MPI_THREAD_MULTIPLE) and high level of expertise required for developers Performance improvements are not guaranteed, having good MPI and OpenMP performances and efficiency is mandatory (Amdahl s law applies to both approaches) Memory affinity, mapping, binding, etc has to be carefully managed (same problems with flat MPI or OpenMP codes) Data race, deadlock, race condition or wrong data sharing attribute, all the pitfalls of MPI and OpenMP combined and lead to very complex debugging No real mature and robust tools to debug hybrid MPI+OpenMP applications at scale (or even on small number of cores ) So, is there still an interest to hybrid MPI+OpenMP parallelization? Yes of course. Fortunately there are even greater advantages 8
9 Memory saving Memory per thread of execution is scarce and hybrid approach optimizes its usage. But why memory saving? The hybrid programming allows optimizing the code to the target architecture. The latter is generally composed of shared-memory nodes (SMP) linked by an interconnect network. The interest of the shared memory inside a node is that it is not necessary to duplicate data in order to exchange them. Every thread can access (read /write) SHARED data. The ghost or halo cells, introduced to simplify the programming of MPI codes using domain decomposition, are no longer required within the SMP node. Only the ghost cells associated with the inter-node communications are mandatory. This savings is far from being negligible. It depends heavily on the order of the method, on the domain type (2D or 3D), on the domain decomposition (on one or multiple dimensions) and on the number of cores of the SMP node. The footprint memory of system buffers associated with MPI is not negligible and increases with the number of processes. For example, for an Infiniband network with MPI processes, the footprint memory of system buffers reaches 300MB per process, almost 20TB in total! 9
10 Memory saving OK for the theory, but in real life, do we observe any gain in memory consumption of applications? 10
11 Memory saving Source: << Mixed Mode Programming on HECToR >>, Anastasios Stathopoulos, August 22, 2010, MSc in High Performance Computing, EPCC Target machine: HECToR CRAY XT6 Results (the memory per node is expressed in MiB) Code Pure MPI version Hybrid version Memory saving MPI processes Mem./node MPI x threads Mem./node CPMD x BQCD x SP-MZ x IRS x Jacobi x
12 Memory saving Source : << Performance evaluations of gyrokinetic Eulerian code GT5D on massively parallel multi-core platforms >>, Yasuhiro Idomura and Sébastien Jolliet, SC11 Executions on 4096 cores on : Fujitsu BX900 with Nehalem-EP processors at 2.93 GHz (8 cores and 24 GiB per node) Fujitsu FX1 with SPARC64 VII processors at 2.5 GHz (4 cores and 32 GiB per node) All sizes given in TiB System Pure MPI 4 threads/process 8 threads/process Total (code+sys) Total (code+sys) Gain Total (code+sys) Gain BX ( ) 2.69 ( ) FX1 5.4 ( ) 2.83 ( ) ( )
13 Conclusions on memory saving Too often, this aspect is forgotten when talking about hybrid programming. However, the potential gains are very significant and could be exploited to increase the size of the problems to be simulated! The gap, in term of memory usage, between the MPI and hybrid approaches will continue to grow rapidly for the next generation of machines : Increase in the total number of cores Rapid increase in the number of cores within a SMP node General use of Hyper-Threading or SMT (the possibility to run simultaneously multiple threads on one core), General use of high-order numerical methods (nearly free computational cost thanks to hardware accelerators) This will make the transition to hybrid programming almost mandatory... 13
14 Exceeded algorithmic limitations Some applications are sometimes limited in term of scalability by a physical parameter (dimension in one direction for example). In the NAS Parallel Benchmark, the problem size define the notion of zone. The maximum number of MPI process cannot exceed the number of zones (limited to 1024 for class D and 2048 for class E problem size) The hybrid version of the code is still limited in term of MPI process, but each MPI process can manage multiple OpenMP threads The total number of threads of execution is the number of MPI process times the number of OpenMP thread per MPI process. On BG/P, you can gain up to a factor of 4 and a factor of 16 on a BG/Q with an excellent scalability! 14
15 Exceeded algorithmic limitations 15
16 Performance and scalability Many factors will contribute to increase performance and scalability of applications using an hybrid MPI+OpenMP parallelization : Better MPI granularity : hybridation allows to use the same number of execution cores, but with a reduce number of MPI processes. Each MPI process has much more work to manage, improving by the way the granularity of the application Better load balancing : for a pure MPI application, a dynamic load balancing is very complex to implement and is time consuming (requiring heavy use of message passing). For an hybrid application, inside the MPI process, you can easily manage a dynamic load balancing (with the schedule DYNAMIC or GUIDED for parallel loops, or directly by hand using the shared memory). Load balancing is a critical factor for massive parallelism, impacting the scalability of the code. Optimization of communications : the reduction of the number of MPI processes minimizes the number of communications and increases the size of messages. Hence, the impact on latency is reduced and the throughput of communications is improved (even more important for applications using collective communications heavily) 16
17 Performance and scalability Improvement of the convergence of certain iterative algorithms : if your iterative algorithm uses information relative to the local domain associated with MPI process, then reducing the number of MPI process will result in having bigger local domains, with much more information. Hence the rate of convergence of your iterative algorithm will improve leading to better time to solution Optimization of I/O : the reduction of the number of MPI processes leads to less simultaneous disk access and increases the size of records. As a consequence, meta-data servers are less loaded and the size of record is much more adapted to the disk system Approach that fits perfectly new architectures (many-cores, ) : with hybrid parallelization, you can naturally create and manage lots of threads, which can be used to overload cores (SMT or Hyper-Threading) and efficiently use the hardware 17
18 Performance and scalability The potential gains in term of performance are even more important as the number of execution cores is large If the hybrid parallelization is well done, the limit of scalability of the hybrid version of the code compared to the flat MPI version can be improved by a factor up to the number of cores of the SMP node! Let s have a look to a real life application named HYDRO 18
19 Application HYDRO HYDRO is a 2D Computational Fluid Dynamics code (~1500 lines of Fortran90), that solves Euler s equations with a Finite Volume Method using Godunov s scheme and a Riemann solver at each interface on a regular mesh. Selected as the PRACE application benchmark for assessment of WP9 prototypes Thanks to many contributors, various versions of HYDRO have been developed : Sequential versions : F90, C99 Accelerated versions : HMPP, Cuda, OpenCL Parallel versions : OpenMP (fine and coarse grain), MPI, hybrid MPI+OpenMP Others versions : Cilk cache oblivious version, X10, 19
20 HYDRO results Characteristic of the hybrid version of HYDRO : MPI_THREAD_FUNNELED level of thread support (MPI calls done inside the parallel region but only by the master thread) MPI parallelization relies on a 2D domain decomposition, with MPI derived datatype and synchronous communications with neighborhoods OpenMP parallelization relies on another 2D domain decomposition (coarse grain approach), with fine synchronization among threads managed by FLUSH directive, to cope with dependencies We will compare the pure MPI version and the hybrid MPI+OpenMP version of HYDRO All timings are in second (s.) and correspond to the elapsed time of the full application 20
21 HYDRO results Goal : Is hybrid approach interesting on a moderate number of execution cores? Target architecture : 2 IBM SP6 nodes (64 cores) The total number of threads of execution is fixed to 64. The number of OpenMP threads per MPI process varies from 1 (pure MPI version) to 32. MPI x OpenMP per node Time in s. on 64 execution cores 32 x x x x x x
22 HYDRO results Goal : determine if hybrid approach is more scalable than pure MPI? Target architecture : IBM BG/P (10 racks) Strong scaling on high number of execution cores (from 4096 to cores) All timings are in second (s.) Pure MPI Hybrid with 4 threads per MPI prc 4096 cores cores cores cores cores
23 HYDRO results Scalability limit of the pure MPI version : 8192 Scalability limit of the hybrid version : optimal 16384, sub optimal On cores, the hybrid version is more than 6 times faster than the pure MPI version The best hybrid version (on cores) is 2.6 time faster than the best pure MPI version (on 8192 cores) 23
24 Conclusions No need to hybrid parallelization if you don t face any problem of scalability and/or memory consumption with your MPI application A sustainable approach, based on recognized, mature and widely available standards (MPI and OpenMP); it is a long-term investment. The advantages of the hybrid approach compared to the pure MPI approach are many: Significant memory saving Gains in performance (on a fixed number of execution cores), through a better adaptation of the code to the target architecture Gains in terms of scalability, allowing pushing the limit of a code s scalability of an equal factor to the number of cores of the sharedmemory node These different gains are proportional to the number of cores of the sharedmemory node, a number that will increase significantly in the short term (general use of multi/many-core processors) A durable solution that allows an efficient usage of the next massively parallel architectures (multi-peta, exascale,...) but still has to evolve to take accelerators into account (OpenCL, OpenACC, OpenMP 4.0, ) 24
Hybrid MPI-OpenMP Programming
Hybrid MPI-OpenMP Programming Pierre-Francois.Lavallee@idris.fr Philippe.Wautelet@aero.obs-mip.fr CNRS IDRIS / LA Version 3.0.1 1 December 2017 P.-Fr. Lavallée P. Wautelet (IDRIS / LA) Hybrid Programming
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationMPI and OpenMP. Mark Bull EPCC, University of Edinburgh
1 MPI and OpenMP Mark Bull EPCC, University of Edinburgh markb@epcc.ed.ac.uk 2 Overview Motivation Potential advantages of MPI + OpenMP Problems with MPI + OpenMP Styles of MPI + OpenMP programming MPI
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationHybrid Programming with MPI and SMPSs
Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationHigh Performance Computing (HPC) Introduction
High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationMPI+X on The Way to Exascale. William Gropp
MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Likely Exascale Architectures (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity, Low Bandwidth) Thin Cores / Accelerators
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University
ELP Effektive Laufzeitunterstützung für zukünftige Programmierstandards Agenda ELP Project Goals ELP Achievements Remaining Steps ELP Project Goals Goals of ELP: Improve programmer productivity By influencing
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationLecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp
Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationSplotch: High Performance Visualization using MPI, OpenMP and CUDA
Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,
More informationHigh performance computing and numerical modeling
High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationMPI & OpenMP Mixed Hybrid Programming
MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationHow to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO
How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationHybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.
Hybrid MPI/OpenMP parallelization Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Thread parallelism (such as OpenMP or Pthreads) can provide additional parallelism
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationLS-DYNA Scalability Analysis on Cray Supercomputers
13 th International LS-DYNA Users Conference Session: Computing Technology LS-DYNA Scalability Analysis on Cray Supercomputers Ting-Ting Zhu Cray Inc. Jason Wang LSTC Abstract For the automotive industry,
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationExperiences with ENZO on the Intel Many Integrated Core Architecture
Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationDirections in HPC Technology
Directions in HPC Technology PRACE evaluates Technologies for Multi-Petaflop/s Systems This should lead to integration of 3 5 Tier-0 world-class systems in Europe from 2010 on. It implies: New hardware
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationIssues in Developing a Thread-Safe MPI Implementation. William Gropp Rajeev Thakur Mathematics and Computer Science Division
Issues in Developing a Thread-Safe MPI Implementation William Gropp Rajeev Thakur Mathematics and Computer Science Division MPI and Threads MPI describes parallelism between processes MPI-1 (the specification)
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationBinding Nested OpenMP Programs on Hierarchical Memory Architectures
Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationMessage Passing with MPI
Message Passing with MPI PPCES 2016 Hristo Iliev IT Center / JARA-HPC IT Center der RWTH Aachen University Agenda Motivation Part 1 Concepts Point-to-point communication Non-blocking operations Part 2
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationIntroducing OpenMP Tasks into the HYDRO Benchmark
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Introducing OpenMP Tasks into the HYDRO Benchmark Jérémie Gaidamour a, Dimitri Lecas a, Pierre-François Lavallée a a 506,
More informationA Lightweight OpenMP Runtime
Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationHow to Write Code that Will Survive the Many-Core Revolution
How to Write Code that Will Survive the Many-Core Revolution Write Once, Deploy Many(-Cores) Guillaume BARAT, EMEA Sales Manager CAPS worldwide ecosystem Customers Business Partners Involved in many European
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of
More informationMPI+X on The Way to Exascale. William Gropp
MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Some Likely Exascale Architectures Figure 1: Core Group for Node (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity,
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationPRACE Autumn School Basic Programming Models
PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationParallel and Distributed Computing
Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationHow to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture
How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010
More informationThe IBM Blue Gene/Q: Application performance, scalability and optimisation
The IBM Blue Gene/Q: Application performance, scalability and optimisation Mike Ashworth, Andrew Porter Scientific Computing Department & STFC Hartree Centre Manish Modani IBM STFC Daresbury Laboratory,
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationHPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)
HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationKommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen
Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner rabenseifner@hlrs.de Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS)
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short
More information