How to Write Code that Will Survive the Many-Core Revolution

Similar documents
How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Code Migration Methodology for Heterogeneous Systems

Addressing Heterogeneity in Manycore Applications

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010

Heterogeneous Multicore Parallel Programming

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

GPUs and Emerging Architectures

Parallel Hybrid Computing Stéphane Bihan, CAPS

Overview of research activities Toward portability of performance

OpenACC 2.6 Proposed Features

Technology for a better society. hetcomp.com

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

A Uniform Programming Model for Petascale Computing

Dealing with Heterogeneous Multicores

Introduction to parallel Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Tesla GPU Computing A Revolution in High Performance Computing

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

designing a GPU Computing Solution

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

CUDA. Matthew Joyner, Jeremy Williams

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Advanced CUDA Optimization 1. Introduction

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

An Introduction to OpenACC

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Building NVLink for Developers

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

MAGMA. Matrix Algebra on GPU and Multicore Architectures

OpenACC Course. Office Hour #2 Q&A

Parallel FFT Program Optimizations on Heterogeneous Computers

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

OpenACC programming for GPGPUs: Rotor wake simulation

Trends in HPC (hardware complexity and software challenges)

Accelerating Financial Applications on the GPU

High performance Computing and O&G Challenges

CAPS Technology. ProHMPT, 2009 March12 th

Tesla GPU Computing A Revolution in High Performance Computing

Programming Heterogeneous Many-cores Using Directives

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Hybrid programming with MPI and OpenMP On the way to exascale

ClearSpeed Visual Profiler

The Art of Parallel Processing

High Performance Computing (HPC) Introduction

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Trends and Challenges in Multicore Programming

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

HPC with Multicore and GPUs

AutoTune Workshop. Michael Gerndt Technische Universität München

Performance Tools for Technical Computing

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Adrian Tate XK6 / openacc workshop Manno, Mar

FPGA-based Supercomputing: New Opportunities and Challenges

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

An Introduc+on to OpenACC Part II

n N c CIni.o ewsrg.au

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Parallel Programming on Ranger and Stampede

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

Parallel Computing. Hwansoo Han (SKKU)

HPC future trends from a science perspective

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Heterogeneous Computing and OpenCL

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Arm crossplatform. VI-HPS platform October 16, Arm Limited

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

OpenMP 4.0. Mark Bull, EPCC

Mapping MPI+X Applications to Multi-GPU Architectures

Preparing for Highly Parallel, Heterogeneous Coprocessing

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Transcription:

How to Write Code that Will Survive the Many-Core Revolution Write Once, Deploy Many(-Cores) Guillaume BARAT, EMEA Sales Manager

CAPS worldwide ecosystem Customers Business Partners Involved in many European Research projects OpenMP ARB Accelerator program subcommittee OpenStandard Initiative HMPP Competences Centers in Europe and Asia www.caps-entreprise.com 2

Foreword How to write code that will survive the many-core revolution? is being setup as a collective initiative of HMPP Competence Centers o Gather competences from universities across the world (Asia, Europe, USA) and R&D projects: H4H, Autotune, TeraFlux o Visit http://competencecenter.hmpp.org for more information www.caps-entreprise.com 3

Trend from Top500: Cores per Socket 500 Systems 450 400 350 300 250 200 150 100 50 0 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 16 12 10 9 8 6 4 2 1 4

Trend from Top500: Accelerators 40 35 Systems 30 25 20 15 10 Clearspeed CSX60022 ATI GPU IBM PowerXCell 8i NVIDIA 2070 NVIDIA 2050 NVIDIA 2090 5 0 2006 2007 2008 2009 2010 2011 5

IDC @SC11: Study Results Many HPC codes aren t seeing a speed up with new hardware systems (due to many-core, lower bandwidth, lower memory/core, etc.) o Many applications will need a major redesign o Multi-core will cause many issues to hit-the-wall o So GPUs can offer a speed-up BUT o GPU still need to be more easy to program o Future portability is a key concern www.caps-entreprise.com 6

SC11: Consensus about directives? OpenACC initiative (2011) o CAPS, Cray, NVIDIA, PGI o A first common syntax for accelerator regions o Visit http://www.openacc-standard.com for more information OpenHMPP initiative (2010) o Directive Open Standard for many-core programming o Complient with OpenACC syntax o Topics not covered by OpenACC: data-flow extension, tracing interface, auto-tuning APIs o Visit http://www.openhmpp.org for more information www.caps-entreprise.com 7

Where Are We Going? www.caps-entreprise.com 8

What to Expect From the Architecture? Strong Non Uniform Memory Access (NUMA) effects o Not all memories may be coherent o Multiple address spaces Data transfer between CPU and GPUs, multiple address spaces Data/stream/vector parallelism to be exploited by GPUs E.g. CUDA / OpenCL Node is not an homogeneous piece of hardware Not only one device Balance between small and fat cores changes over Rme Many parallelism forms are needed to deal with Increasing number of processing units Some form of vector compurng (AVX or SSE instrucrons) www.caps-entreprise.com 9

What to consider by writing your code Computing power comes from parallelism o Hardware (frequency increase) to software (parallel codes) shift o Driven by energy consumption Heterogeneity is the source of efficiency Few large fast OO cores combined with many smaller cores (e.g. APUs) Fast moving hardware targets environment o e.g. fast GPU improvements (RT and HW), new massively parallel CPU o Write codes that will last many architecture generations Keeping a unique version of the codes, preferably monolanguage, is a necessity o Reduce maintenance cost o Directive-based approaches suitable o Preserve code assets www.caps-entreprise.com 10

Software Main Driving Forces ALF - Amdahl s Law is Forever o A high percentage of the execution time has to be parallel o Many algorithms/methods/techniques will have to be reviewed to scale Data locality is expected to be the main issue o Moving data will always suffer latency www.caps-entreprise.com 11

One MPI Process per Core Approach Benefit o No need to change existing code Issues o Extra latency compared to shared memory use MPI implies some copying required by its semantics (even if efficient MPI implementations tend to reduce them) o Excessive memory utilization Partitioning for separate address spaces requires replication of parts of the data. When using domain decomposition, the sub-grid size may be so small that most points are replicated (i.e. ghost zone) Memory replication implies more stress on the memory bandwidth which finally produces a weak scaling o Cache trashing between MPI processes o Heterogeneity management How are MPI processes linked to accelerator resources? How to deal with different core speed? The balance may be hard to find between the optimal MPI process that makes good use of the CPU core and the use of the accelerators that may be more efficiently used with a different ratio www.caps-entreprise.com 12

Thread Based Parallelism Approach Benefits o Some codes are already ported to OpenMP/threads APIs o Known APIs Issues o Data locality and affinity management Data locality and load balancing (main target of thread APIs) are in general two antagonistic objectives Usual thread APIs make it difficult / not direct to express data locality affinity o Reaching a tradeoff between vector parallelism (e.g. using the AVX instruction set), thread parallelism and MPI parallelism Vector parallelism is expected to impact more and more on the performance Current threads APIs have not been designed to simplify the implementation of such tradeoff o Threads granularity has to be tuned depending on core characteristics (e.g. SMT, heterogeneity) Thread code writing style does not make it easy to tune www.caps-entreprise.com 13

An Approach for Portable Many-Core Codes 1 Try to expose node massive data parallelism in a target independent way 2 Do not hide parallelism by awkward coding 3 Keep code debug- able 4 Track performance issues 5 6 Exploit libraries When allowed, do not target 100% of possible performance, the last 20% are usually very intrusive in the code www.caps-entreprise.com 14

1 - Express Parallelism, not Implementation Rely on code generation for implementation details o Usually not easy to go from a low level API to another low level one o Tuning has to be possible from the high level o But avoid relying on compiler advanced techniques for parallelism discovery, o You may have to change the algorithm! An example with HMPP #pragma hmppcg gridify(j,i)! #pragma hmppcg unroll(4), jam(2)! for( j = 0 ; j < p ; j++ ) {! for( i = 0 ; i < m ; i++ ) {! for (k =...) {...}! vout[j][i] = alpha *...;! }! }! www.caps-entreprise.com 15

HMPP Hybrid Compiler Directive based approach for many-core Cuda & OpenCL devices and soon Intel MIC Features o With one source code, target multiple many-core architectures o Distribute computation over CPU & GPU cores (Multi-GPU) o High many-core performance with optimized data management o Libraries interoperability with user code o Protect your software investment by using an Open Standard Rapidly develop Many- core Accelerated applica4ons www.caps-entreprise.com 16

Rich set of directives for performance Manycore programming directives in legacy code o Declare and generate GPU versions of computations (codelets) o Optimize data movement o Distribute computations over CPU cores & GPUs Tuning of GPU kernels o Advanced code optimizations o Control mapping of computations o Fully exploit GPU stream architecture

Performance does matter En average, HMPP reaches Cuda performance +/- 10% 2 x Intel(R) Xeon(R) X5560 @ 2.80GHz (8 cores) - MKL NVidia Tesla C2050, ECC acrvated HMPP, CUBLAS, MAGMA October 2011 18

Guillaume Colin de Verdière, Onera XtremCFD Workshop, 7 th of October, 2011 www.caps-entreprise.com 19

2 Do not hide parallelism Do not hide parallelism by awkward coding o Data structure aliasing, o Deep routine calling sequences o Separate concerns (functionality coding versus performance coding) Data parallelism when possible o Simple form of parallelism, easy to manage o Favor data locality o But sometimes too static Kernels level o Expose massive parallelism o Ensure that data affinity can be controlled o Make sure it is easy to tune the ratio vector / thread parallelism www.caps-entreprise.com 20

Data Structure Management Data locality o Makes it easy to move from one address space to another one o Makes it easy to keep coherent Do not waste memory o Memory per core ratio not improving Choose simple data structures o Enable vector/simd computing o Use library friendly data structures o May come in multiple forms, e.g. sparse matrix representation For instance consider data collections to deal with multiple address spaces or multiple devices or parts of a device o Gives a level a adaptation for dealing with heterogeneity o Load distribution over the different devices is simple to express www.caps-entreprise.com 21

HMPP 3.0 Map Operation on Data Collection #pragma hmpp <mgrp> parallel for(k=0;k<n;k++) { #pragma hmpp <mgrp> f1 callsite myparallelfunc(d[k],n); } CPU 0 CPU 1 GPU 0 GPU 1 GPU 2 Main memory Device memory Device memory Device memory d0 d1 d2 d3 d1 d2 d3 www.caps-entreprise.com 22

3 - Debugging Issues Avoid assuming you won t have to debug! Keep serial semantic o For instance, implies keeping serial libraries in the application code o Directives based programming makes this easy Ensure validation is possible even with rounding errors o Reductions, o Aggressive compiler optimizations Use defensive coding practices o Events logging, parameterize parallelism, add synchronization points, o Use debuggers (e.g. Allinea DDT) www.caps-entreprise.com 23

Allinea DDT Debugging Tool for CPU/GPU Debug your kernels: o Debug CPU and GPU concurrently Examine thread data o Display variables Integrated with HMPP o Allows HMPP directives breakpoints o Step into HMPP codelets www.caps-entreprise.com 24

4 - Performance Measurements Parallelism is about performance! Track Amdahl s Law Issues o Serial execution is a killer o Check scalability o Use performance tools Add performance measurement in the code o Detect bottleneck a.s.a.p. o Make it part of the validation process www.caps-entreprise.com 25

HMPP Wizard Analyze the code to help migrate to many-core architectures within an incremental process Provide tuning advices o Check coalescing of memory accesses o Improve parallelism Make your Code Many- Core Friendly www.caps-entreprise.com 26

HMPP Wizard GPU library usage detection Access to MyDevDeck web online resources www.caps-entreprise.com 27

HMPP Performance Analyzer Dynamic analyses Synthetize metrics based on GPU execution profile Tune your Codelet for Your Hardware www.caps-entreprise.com 28

5 - Dealing with Libraries Library calls can usually only be partially replaced o No one to one mapping between libraries (e.g.blas, FFTW, CuFFT, CULA,LibJacket ) o No access to all code (i.e. avoid side effects) o Don t create dependencies on a specific target library as much as possible o Still want a unique source code Deal with multiple address spaces / multi-gpus o Data location may not be unique (copies) o Usual library calls assume shared memory o Library efficiency depends on updated data location (long term effect) Libraries can be written in many different languages o CUDA, OpenCL, HMPP, etc. There is not one binding choice depending on applications/users o Binding needs to adapt to uses depending on how it interacts with the remainder of the code o Choices depend on development methodology www.caps-entreprise.com 29

Library Interoperability in HMPP 3.0 C/CUDA/ call libroutine1 ( )! #pragma hmppalt! call libroutine2( )! call libroutine3( )! HMPP RunRme API proxy2 ( )! NaRve GPU RunRme API! GPU Lib gpulib( )!! CPU Lib cpulib1( )! cpulib3( )! www.caps-entreprise.com 30

Conclusion Software has to expose massive parallelism o The key to success is the algorithm! o The implementation has just to keep parallelism flexible and easy to exploit Directive-based approaches are currently one of the most promising track o Preserve code assets o May separate parallelism exposure from the implementation (at node level) Remember that even if you are very careful in your choices you may have to rewrite parts of the code o Code architecture is the main asset here www.caps-entreprise.com 31

Many-core programming Parallelization GPGPU NVIDIA Cuda OpenHMPP Directive-based programming Code Porting Methodology OpenACC Hybrid Many-core Programmin HPC community Petaflops Parallel computing HPC open standard Exaflops Open CL High Performance Computing Code speedup Multi-core programming Massively parallel HMPP Competence Center Hardware accelerators programming GPGPU DevDeck Parallel programming interface Global Solutions for Many-Core Programming http://www.caps-entreprise.com