How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Similar documents
How to Write Code that Will Survive the Many-Core Revolution

Code Migration Methodology for Heterogeneous Systems

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010

Heterogeneous Multicore Parallel Programming

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Addressing Heterogeneity in Manycore Applications

Overview of research activities Toward portability of performance

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Parallel Hybrid Computing Stéphane Bihan, CAPS

OpenACC 2.6 Proposed Features

Programming Heterogeneous Many-cores Using Directives

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Dealing with Heterogeneous Multicores

Advanced CUDA Optimization 1. Introduction

Trends and Challenges in Multicore Programming

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Tesla GPU Computing A Revolution in High Performance Computing

GPUs and Emerging Architectures

OpenACC Course. Office Hour #2 Q&A

Introduction to parallel Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Heterogeneous Many-cores Using Directives

CUDA. Matthew Joyner, Jeremy Williams

Parallel FFT Program Optimizations on Heterogeneous Computers

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

designing a GPU Computing Solution

Performance Tools for Technical Computing

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Tesla GPU Computing A Revolution in High Performance Computing

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Technology for a better society. hetcomp.com

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Introduction to Multicore Programming

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

MAGMA. Matrix Algebra on GPU and Multicore Architectures

GPU. Ben de Waal Summer 2008

Carlo Cavazzoni, HPC department, CINECA

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to Multicore Programming

The Art of Parallel Processing

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CUDA GPGPU Workshop 2012

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Advanced CUDA Optimizations

CAPS Technology. ProHMPT, 2009 March12 th

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

HPC with Multicore and GPUs

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

HMPP port. G. Colin de Verdière (CEA)

High Performance Computing Systems

Lecture 11: GPU programming

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

OpenACC programming for GPGPUs: Rotor wake simulation

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

High-Performance and Parallel Computing

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

Overview of High Performance Computing

Parallel Numerical Algorithms

Parallel Computing. Hwansoo Han (SKKU)

Parallel Programming Libraries and implementations

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

A Uniform Programming Model for Petascale Computing

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

High Performance Computing (HPC) Introduction

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Accelerating HPL on Heterogeneous GPU Clusters

Transcription:

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Foreword How to write code that will survive the many-core revolution? is being setup as a collective initiative of HMPP Competence Centers o Gather competences from universities across the world (Asia, Europe, USA) and R&D projects: H4H, Autotune, TeraFlux o Visit http://competencecenter.hmpp.org for more information www.caps-entreprise.com 2

Introduction Computing power comes from parallelism o Hardware (frequency increase) to software (parallel codes) shift o Driven by energy consumption Keeping a unique version of the codes, preferably mono-language, is a necessity o Directive-based approaches suitable o Write codes that will last many architecture generations A context of fast evolving hardware targets Addressing many-core programming challenge implies o Massively parallel algorithms o New development methodologies / code architectures o New programming environments www.caps-entreprise.com 3

Many-Cores Massively parallel devices o Thousands of threads needed Data transfer between CPU and GPUs, mulbple address spaces Data/stream/vector parallelism to be exploited by GPUs E.g. CUDA / OpenCL www.caps-entreprise.com 4

Where Are We Going? forecast www.caps-entreprise.com 5

HMPP Heterogeneous Multicore Parallel Programming A directive based approach for many-core o Cuda device, OpenCL device #pragma hmpp f1 codelet! myfunc(...){... for() for() for()...... } main(){... #pragma hmpp f1 callsite myfunc(v1[k],v2[k]);... } Available Q1 2012 CPU version GPU version www.caps-entreprise.com 6

Software Main Driving Forces ALF - Amdahl s Law is Forever o A high percentage of the execution time has to be parallel o Many algorithms/methods/techniques will have to be reviewed to scale Data locality is expected to be the main issue o Limit on the speed of light o Moving data will always suffer latency www.caps-entreprise.com 7

What to Expect From the Architecture? Strong Non Uniform Memory Access (NUMA) effects o Not all memories may be coherent o Multiple address spaces Node is not an homogeneous piece of hardware o Not only one device o Balance between small and fat cores changes over time Many parallelism forms o Increasing number of processing units o Some form of vector computing www.caps-entreprise.com 8

One MPI Process per Core Approach Benefit o No need to change existing code Issues o Extra latency compared to shared memory use MPI implies some copying required by its semantics (even if efficient MPI implementations tend to reduce them) o Excessive memory utilization Partitioning for separate address spaces requires replication of parts of the data. When using domain decomposition, the sub-grid size may be so small that most points are replicated (i.e. ghost zone) Memory replication implies more stress on the memory bandwidth which finally produces a weak scaling o Cache trashing between MPI processes o Heterogeneity management How are MPI processes linked to accelerator resources? How to deal with different core speed? The balance may be hard to find between the optimal MPI process that makes good use of the CPU core and the use of the accelerators that may be more efficiently used with a different ratio www.caps-entreprise.com 9

Thread Based Parallelism Approach Benefits o Some codes are already ported to OpenMP/threads APIs o Known APIs Issues o Data locality and affinity management Data locality and load balancing (main target of thread APIs) are in general two antagonistic objectives Usual thread APIs make it difficult / not direct to express data locality affinity o Reaching a tradeoff between vector parallelism (e.g. using the AVX instruction set), thread parallelism and MPI parallelism Vector parallelism is expected to impact more and more on the performance Current threads APIs have not been designed to simplify the implementation of such tradeoff o Threads granularity has to be tuned depending on core characteristics (e.g. SMT, heterogeneity) Thread code writing style does not make it easy to tune www.caps-entreprise.com 10

An Approach for Portable Many-Core Codes 1. Try to expose node massive data parallelism in a target independent way 2. Do not hide parallelism by awkward coding 3. Keep code debug-able 4. Track performance issues 5. Exploit libraries 6. When allowed, do not target 100% of possible performance, the last 20% are usually very intrusive in the code www.caps-entreprise.com 11

Express Parallelism, not Implementation Rely on code generation for implementation details o Usually not easy to go from a low level API to another low level one o Tuning has to be possible from the high level o But avoid relying on compiler advanced techniques for parallelism discovery, o You may have to change the algorithm! An example with HMPP Code generabon process HMPP OpenMP/Threads Cuda/OpenCL Vector ISA No automabc translabon #pragma hmppcg gridify(j,i)! #pragma hmppcg unroll(4), jam(2)! for( j = 0 ; j < p ; j++ ) {! for( i = 0 ; i < m ; i++ ) {! for (k =...) {...}! vout[j][i] = alpha *...;! }! }! www.caps-entreprise.com 12

HMPP 3.0 Map Operation on Data Collection #pragma hmpp <mgrp> parallel for(k=0;k<n;k++) { #pragma hmpp <mgrp> f1 callsite myparallelfunc(d[k],n); } CPU 0 CPU 1 GPU 0 GPU 1 GPU 2 Main memory Device Device Device memory memory memory d0 d1 d2 d3 d1 d2 d3 www.caps-entreprise.com 13

Exposing Massive Parallelism Do not hide parallelism by awkward coding o Data structure aliasing, o Deep routine calling sequences o Separate concerns (functionality coding versus performance coding) Data parallelism when possible o Simple form of parallelism, easy to manage o Favor data locality o But sometimes too static Kernels level o Expose massive parallelism o Ensure that data affinity can be controlled o Make sure it is easy to tune the ratio vector / thread parallelism www.caps-entreprise.com 14

Data Structure Management Data locality o Makes it easy to move from one address space to another one o Makes it easy to keep coherent and track affinity Do not waste memory o Memory per core ratio not improving Choose simple data structures o Enable vector/simd computing o Use library friendly data structures o May come in multiple forms, e.g. sparse matrix representation For instance consider data collections to deal with multiple address spaces or multiple devices or parts of a device o Gives a level a adaptation for dealing with heterogeneity o Load distribution over the different devices is simple to express www.caps-entreprise.com 15

Collection of Data in HMPP 3.0 d[0] d[1] CPU Memory d[0].data d[1].data struct { float * data ; int size ; } Vec; mirroring GPU 1 Memory GPU 2 Memory GPU 3 Memory www.caps-entreprise.com 16

Performance Measurements Parallelism is about performance! Track Amdahl s Law Issues o Serial execution is a killer o Check scalability o Use performance tools Add performance measurement in the code o Detect bottleneck a.s.a.p. o Make it part of the validation process www.caps-entreprise.com 17

HMPP Wizard GPU library usage detecbon Access to MyDevDeck web online resources 11/16/11 www.caps-entreprise.com 18

Debugging Issues Avoid assuming you won t have to debug! Keep serial semantic o For instance, implies keeping serial libraries in the application code o Directives based programming makes this easy Ensure validation is possible even with rounding errors o Reductions, o Aggressive compiler optimizations Use defensive coding practices o Events logging, parameterize parallelism, add synchronization points, o Use debuggers (e.g. Allinea DDT) www.caps-entreprise.com 19

Allinea DDT Debugging Tool for CPU/GPU Debug your kernels: o Debug CPU and GPU concurrently Examine thread data o Display variables Integrated with HMPP o Allows HMPP directives breakpoints o Step into HMPP codelets 11/16/11 www.caps-entreprise.com 20

Dealing with Libraries Library calls can usually only be partially replaced o No one to one mapping between libraries (e.g.blas, FFTW, CuFFT, CULA,LibJacket ) o No access to all code (i.e. avoid side effects) o Don t create dependencies on a specific target library as much as possible o Still want a unique source code Deal with multiple address spaces / multi-gpus o o o Data location may not be unique (copies) Usual library calls assume shared memory Library efficiency depends on updated data location (long term effect) Libraries can be written in many different languages o CUDA, OpenCL, HMPP, etc. There is not one binding choice depending on applications/users o o Binding needs to adapt to uses depending on how it interacts with the remainder of the code Choices depend on development methodology www.caps-entreprise.com 21

Example of Library Uses with HMPP 3.0 C CALL INIT(A,N) Replaces the call to a proxy that handles GPUs CALL ZFFT1D(A,N,0,B)! This call is needed to initialize FFTE and allows to mix user GPU code with library ones CALL DUMP(A,N) CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C TELL HMPP TO USE THE PROXY FUNCTION FROM THE FFTE PROXY GROUP C CHECK FOR EXECUTION ERROR ON THE GPU!$hmppalt ffte call, name="zfft1d", error="proxy_err" CALL ZFFT1D(A,N,-1,B) CALL DUMP(A,N) C C SAME HERE!$hmppalt ffte call, name="zfft1d", error="proxy_err" CALL ZFFT1D(A,N,1,B) CALL DUMP(A,N) www.caps-entreprise.com 22

Library Interoperability in HMPP 3.0 call libroutine1( )! #pragma hmppalt! call libroutine2( )! call libroutine3( )! HMPP RunBme API proxy2( )! NaBve GPU RunBme API C/CUDA/! GPU Lib! CPU Lib gpulib( )! cpulib1( )! cpulib3( )! www.caps-entreprise.com 23

Conclusion Software has to expose massive parallelism o The key to success is the algorithm! o The implementation has just to keep parallelism flexible and easy to exploit Directive-based approaches are currently one of the most promising track o Preserve code assets o May separate parallelism exposure from the implementation (at node level) Remember that even if you are very careful in your choices you may have to rewrite parts of the code o Code architecture is the main asset here See us on Booth #650 www.caps-entreprise.com 24

Accelerator Programming Model Parallelization Directive-based programming GPGPU Manycore programming Hybrid Manycore Programming HPC community Petaflops Parallel computing HPC open standard Multicore programming Exaflops NVIDIA Cuda Code speedup Hardware accelerators programming High Performance Computing Parallel programming interface Massively parallel Open CL hap://www.caps- entreprise.com hap://twiaer.com/capsentreprise hap://www.openhmpp.org