Parallel Hybrid Computing Stéphane Bihan, CAPS

Similar documents
Dealing with Heterogeneous Multicores

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

CAPS Technology. ProHMPT, 2009 March12 th

Heterogeneous Multicore Parallel Programming

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Addressing Heterogeneity in Manycore Applications

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010

HMPP port. G. Colin de Verdière (CEA)

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Trends and Challenges in Multicore Programming

C6000 Compiler Roadmap

Carlo Cavazzoni, HPC department, CINECA

Threaded Programming. Lecture 9: Alternatives to OpenMP

Heterogeneous multicore parallel programming for graphics processing units

Parallel Programming Libraries and implementations

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PRACE Autumn School Basic Programming Models

Technology for a better society. hetcomp.com

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

How to Write Code that Will Survive the Many-Core Revolution

G. Colin de Verdière CEA

Overview of research activities Toward portability of performance

A General Discussion on! Parallelism!

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Heterogeneous Computing and OpenCL

Code Migration Methodology for Heterogeneous Systems

The Era of Heterogeneous Computing

Getting Started with Intel SDK for OpenCL Applications

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Chapter 3 Parallel Software

Parallel Programming. Libraries and Implementations

Intel Software Development Products for High Performance Computing and Parallel Programming

Introduction to Runtime Systems

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

OpenACC Course. Office Hour #2 Q&A

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Modern Processor Architectures. L25: Modern Compiler Design

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

A Uniform Programming Model for Petascale Computing

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

The Stampede is Coming: A New Petascale Resource for the Open Science Community

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Programming Models for Supercomputing in the Era of Multicore

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

GPUs and Emerging Architectures

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

AutoTune Workshop. Michael Gerndt Technische Universität München

Accelerated Test Execution Using GPUs

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Parallel Programming on Larrabee. Tim Foley Intel Corp

Compilation for Heterogeneous Platforms

What s New August 2015

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

SIMD Exploitation in (JIT) Compilers

Higher Level Programming Abstractions for FPGAs using OpenCL

The Art of Parallel Processing

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Parallella: A $99 Open Hardware Parallel Computing Platform

Cover Page. The handle holds various files of this Leiden University dissertation

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

Pedraforca: a First ARM + GPU Cluster for HPC

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Parallel and Distributed Programming Introduction. Kenjiro Taura

Optimising for the p690 memory system

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Intel Many Integrated Core (MIC) Architecture

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

JCudaMP: OpenMP/Java on CUDA

CPU-GPU Heterogeneous Computing

GPU Architecture. Alan Gray EPCC The University of Edinburgh

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP for next generation heterogeneous clusters

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

High Performance Computing (HPC) Introduction

Parallel Programming. Libraries and implementations

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

OpenMP tasking model for Ada: safety and correctness

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

OpenMP 4.0 (and now 5.0)

CS 470 Spring Parallel Languages. Mike Lam, Professor

Transcription:

Parallel Hybrid Computing Stéphane Bihan, CAPS

Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware General purpose cores Application specific cores GPUs (HWAs) HPC and embedded applications are increasingly sharing characteristics

An Overview of Hybrid Parallel Computing

Manycore Architectures HWA General purpose cores Main Memory Share a main memory Core ISA provides fast SIMD instructions Application data Upload remote data Application data Streaming engines / DSP / FPGA Application specific architectures ( narrow band ) Download remote data Vector/SIMD Can be extremely fast Hundreds of GigaOps But not easy to take advantage of General Purpose Processor Cores Remote Procedure call Stream cores One platform type cannot satisfy everyone Operation/Watt is the efficiency scale

Multicore/Manycore Workload Multiple applications sharing the hardware Multimedia, game, encryption, security, health, Unfriendly environment with many competitions Global resource allocation, no warranty on availability Must be taken into account when programming/compiling Applications cannot always be recompiled Most applications are distributed as binaries A binary will have to run on many platforms Forward scalability or write once, run faster on new hardware Loosing performance is not an option

Multiple Parallelism Levels Amdahl s law is forever, all levels of parallelism need to be exploited Programming various hardware components of a node cannot be done separately OpenMP GPU GPU GPU PCIe gen2 PCIe gen2 Network PCIe gen2 HMPP PCIe gen2 GPU GPU GPU CUDA GPU GPU MPI

The Past of Parallel Computing, the Future of Manycores? The Past Scientific computing focused Microprocessor or vector based, homogeneous architectures Trained programmers willing to pay effort for performance Fixed execution environments The Future New applications (multimedia, medical, ) Thousands of heterogeneous systems configurations Unfriendly execution environments

Programming Multicores/ Manycores Physical architecture oriented Shared memory architectures OpenMP, CILK, TBB, automatic parallelization, vectorization Distributed memory architectures Message passing, PGAS (Partition Global Address Space), Hardware accelerators, GPU CUDA, OpenCL, Brook+, HMPP, Different styles Libraries MPI, pthread, TBB, SSE intrinsic functions, Directives OpenMP, HMPP, Language constructs UPC, Cilk, Co-array Fortran, UPC, Fortress, Titanium,

Multi (languages) programming Happens when programmers need to deal with multiple programming languages E.g. Fortran and CUDA, Java and OpenCL, Multiprogramming impacts on Programmer s expertise Program maintenance and correctness Long term technology availability Performance programming versus domain specific programming Libraries, parallel components to be provided to divide the issues

Manycore = Multiple µ-architectures Each μ-architecture requires different code generation/ optimization strategies Not one compiler in many cases High performance variance between implementations ILP, GPCore/TLP, HWA Dramatic effect of tuning Bad decisions have a strong effect on performance Efficiency is very input parameter dependent Data transfers for HWA add a lot of overheads How to organize the compilation flow?

CAPS Compiler Flow for Heterogeneous Targets Dealing with various ISAs HMPP annotated application HMPP annotated native codelet main function Not all code generation can be performed in the same framework HMPP preprocessor Generic host compiler template generator target generator HMPP runtime interface HMPP codelet Target codelet Hardware vendor compiler Binary host application HMPP runtime Dynamic library HMPP codelet

Can the Hardware be Hidden? Programming style is usually hardware independent but Programmers need to take into account available hardware resources Quantitative decisions as important as parallel programming Performance is about quantity Tuning is specific to a configuration Runtime adaptation is a key feature Algorithm, implementation choice Programming/computing decision

Varying Available Resources Available hardware resources are changing over the execution time Not all resources are time-shared, e.g. a HWA may not be available Data affinity must be respected How to ensure that conflicts in resource usage will not lead to global performance degradation?

Difficult Decisions Making with Alternative Codes (Multiversioning) Various implementations of routines are available or can be generated for a given target CUBLAS, MKL, ATLAS, SIMD instructions, GPcore, HWA, Hybrid No strict performance order Each implementation has a different performance profile Best choice depends on platform and runtime parameters Decision is a complex issue How to produce the decision?

Research Directions New Languages X10, Fortress, Chapel, PGAS languages, OpenCL, MS Axum, Libraries Atlas, MKL, Global Array, Spiral, Telescoping languages, TBB, Compilers Classical compiler flow needs to be revisited Acknowledge lack of static performance model Adaptative code generation OS Virtualization/hypervisors Architectures Integration on the chip of the accelerators AMD Fusion, Alleviate data transfers costs PCI Gen 3x, Key for the short/mid term

HMPP Approach

HMPP Objectives Efficiently orchestrate /GPU computations in legacy code With OpenMP-like directives Automatically produce tunable manycore applications C and Fortran to CUDA data parallel code generator Make use of available compilers to produce binary Ease application deployment HMPP a high level abstraction for manycore programming

HMPP1.5 Simple Example #pragma hmpp sgemmlabel codelet, target=cuda, args[vout].io=inout extern void sgemm( int m, int n, int k, float alpha, const float vin1[n][n], const float vin2[n][n], float beta, float vout[n][n] ); int main(int argc, char **argv) { for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemmlabel callsite sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); } #pragma hmpp label codelet, target=cuda:brook, args[v1].io=out #pragma hmpp label2 codelet, target=sse, args[v1].io=out, cond= n<800 void MyCodelet(int n, float v1[n], float v2[n], float v3[n]) { int i; for (i = 0 ; i < n ; i++) { v1[i] = v2[i] + v3[i]; } }

Group of Codelets (HMPP 2.0) Declare group of codelets to optimize data transfers Codelets can share variables Keep data in GPUs between two codelets Avoid useless data transfers Map arguments of different functions in same GPU memory location (equivalence Fortran declaration) Flexibility and Performance

Optimizing Communications Exploit two properties Communication / computation overlap Temporal locality of parameters Various techniques Advancedload and Delegatedstore Constant parameter Resident data Actual argument mapping

Hiding Data Transfers Pipeline GPU kernel execution with data transfers Split single function call in two codelets (C1, C2) Overlap with data transfers time time Vin Min Vout Vin 1 2 3 4 5 C1C2 C1 C2 C1 Min 1 2 3 4 5 Vin Vout Vin C1C2C1C2C1

Advancedload Directive Avoid reloading constant data int main(int argc, char **argv) { #pragma hmpp simple advancedload, args[v2], const for (j=0; j<n; j++){ #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n,t1[j], t2, t3[j], alpha); } #pragma hmpp label release } t2 is not reloaded each loop iteration

Actual Argument Mapping Allocate arguments of various codelets to the same memory space Allow to exploit reuses of argument to reduce communications Close to equivalence in Fortran #pragma hmpp <mygp> group, target=cuda #pragma hmpp <mygp> map, args[f1::inm; f2::inm] #pragma hmpp <mygp> f1 codelet, args[outv].io=inout static void matvec1(int sn, int sm, float inv[sn], float inm[sn][sm], float outv[sm]) {... } #pragma hmpp <mygp> f2 codelet, args[v2].io=inout static void otherfunc2(int sn, int sm, float v2[sn], float inm[sn][sm]) {... } Arguments share the same space on the HWA 23

HMPP Tuning!$HMPP sgemm3 codelet, target=cuda, args[vout].io=inout SUBROUTINE sgemm(m,n,k2,alpha,vin1,vin2,beta,vout) INTEGER, INTENT(IN) :: m,n,k2 REAL, INTENT(IN) :: alpha,beta REAL, INTENT(IN) :: vin1(n,n), vin2(n,n) REAL, INTENT(INOUT) :: vout(n,n) REAL :: prod INTEGER :: i,j,k!$hmppcg unroll(x), jam(2), noremainder!$hmppcg parallel DO j=1,n!$hmppcg unroll(x), splitted, noremainder!$hmppcg parallel DO i=1,n prod = 0.0 DO k=1,n prod = prod + vin1(i,k) * vin2(k,j) ENDDO vout(i,j) = alpha * prod + beta * vout(i,j) ; END DO END DO END SUBROUTINE sgemm X>8 GPU compiler fails X=8 200 Gigaflops X=4 100 Gigaflops

Conclusion Multicore ubiquity is going to have a large impact on software industry New applications but many new issues Will one parallel model fit all? Surely not but multi languages programming should be avoided Directive based programming is a safe approach Ideally OpenMP will be extended to HWA Toward Adaptative Parallel Programming Compiler alone cannot solve it Compiler must interact with the runtime environment Programming must help expressing global strategies / patterns Compiler as provider of basic implementations Offline-Online compilation has to be revisited