The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Similar documents
GPU Architecture. Alan Gray EPCC The University of Edinburgh

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

GPUs and Emerging Architectures

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Trends in HPC (hardware complexity and software challenges)

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Trends and Challenges in Multicore Programming

ECE 8823: GPU Architectures. Objectives

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Trends in the Infrastructure of Computing

CPU-GPU Heterogeneous Computing

AutoTune Workshop. Michael Gerndt Technische Universität München

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Compilation for Heterogeneous Platforms

Addressing Heterogeneity in Manycore Applications

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

StarPU: a runtime system for multigpu multicore machines

45-year CPU Evolution: 1 Law -2 Equations

Introduction to Runtime Systems

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

The Era of Heterogeneous Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Heterogeneous Computing and OpenCL

ECE/CS 250 Computer Architecture. Summer 2016

What does Heterogeneity bring?

OpenMP 4.0: A Significant Paradigm Shift in Parallelism

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

CUDA. Matthew Joyner, Jeremy Williams

Modern Processor Architectures. L25: Modern Compiler Design

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

A General Discussion on! Parallelism!

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

OpenACC 2.6 Proposed Features

Architecture, Programming and Performance of MIC Phi Coprocessor

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

High Performance Computing (HPC) Introduction

CSC573: TSHA Introduction to Accelerators

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

HPC future trends from a science perspective

OpenCL: History & Future. November 20, 2017

Overview of research activities Toward portability of performance

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Mapping applications into MPSoC

Multicore Hardware and Parallelism

Vectorisation and Portable Programming using OpenCL

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Big Data Meets High-Performance Reconfigurable Computing

Parallel Computing. November 20, W.Homberg

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal

Why you should care about hardware locality and how.

Parallel Programming on Ranger and Stampede

x Welcome to the jungle. The free lunch is so over

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Parallel Programming. Libraries and Implementations

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Dr. Yassine Hariri CMC Microsystems

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Parallelism in Hardware

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

GPUs have enormous power that is enormously difficult to use

Introduction to the Intel Xeon Phi on Stampede

Experts in Application Acceleration Synective Labs AB

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Real-Time Rendering Architectures

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Higher Level Programming Abstractions for FPGAs using OpenCL

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Early Experiences Writing Performance Portable OpenMP 4 Codes

OpenMP for Heterogeneous Multicore Embedded Systems using MCA API standard interface

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Master Informatics Eng.

Parallel Computing Platforms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Center Extreme Scale CS Research

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

CUDA GPGPU Workshop 2012

Accelerator programming with OpenACC

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

The Mont-Blanc approach towards Exascale

OpenMP 4.0/4.5. Mark Bull, EPCC

GP-GPU. General Purpose Programming on the Graphics Processing Unit

Transcription:

The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012

Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming Model Goal 5. Howto 6. Conclusion Francois Rue - The Heterogeneous Programming Jungle June 19, 2012-2

Introduction 1Introduction Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-3

Introduction Motivations Motivations The main goal is to focus on heterogeneous programming. This presentation is based on an article writed by Michael Wolfe, Compiler Engineer, The Portland Group, Inc. several approaches developed to program heterogeneous system... which is the good approache? Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-4

Introduction reminder Why Moore? Step by step video... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-5

Introduction Accelerators Accelerators Context Main idea: heterogeneous systems = normal system + coprocessor Accelerator : - specialized in one type of architecture - exhibit internal parallelism Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-6

Introduction Accelerators Accelerators Objectives Parallel programming is intrinsically hard - create parallel activities - insert synchronisation between them - manage data locality Programming a heterogeneous system : more complex! - manage concurrent activities between host and device(s) - manage data locality between host and device(s) Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-7

Heterogeneous System Zoo 2Heterogeneous System Zoo Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-8

Heterogeneous System Zoo The range of heterogeneous systems Most Popular Intel/AMD X86 host + NVIDIA GPUs (x86+gpu) : 35 of the Top 500 supercomputers in the November 2011 list GPU proper memory connected by PCi to the host Gestion by the host of memory and kernels Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-9

Heterogeneous System Zoo The range of heterogeneous systems Full AMD AMD Opteron + AMD GPUs : NVIDIA Gpu replaced by ALD Firestream... another x86+gpu option Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-10

Heterogeneous System Zoo The range of heterogeneous systems Full AMD but... AMD Opteron + AMD APU : Figure: AMD APU integrated on the same chip physical memory shared... but partitionned Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-11

Heterogeneous System Zoo The range of heterogeneous systems And now for Intel Intel Core + Intel Ivy Bridge integrated GPU : Figure: Intel Ivy Bridge on chip GPU OpenCL programmable Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-12

Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-13

Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-14

Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-15

Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Convey Intel x86 + FPGA-implemented reconfigurable vector unit Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-16

Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Convey Intel x86 + FPGA-implemented reconfigurable vector unit Tilera multicore or the Chinese FeiTeng FT64. Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-17

Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Convey Intel x86 + FPGA-implemented reconfigurable vector unit Tilera multicore or the Chinese FeiTeng FT64. GP core + FPGA fabric Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-18

Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Convey Intel x86 + FPGA-implemented reconfigurable vector unit Tilera multicore or the Chinese FeiTeng FT64. GP core + FPGA fabric IBM Power + Cell Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-19

Heterogeneous System Zoo The range of heterogeneous systems Other? bref... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-20

Similarities 3Similarities Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-21

Similarities Surprising similarity In general All the systems allow the attached device to execute asynchronously with the host All the systems exhibit several levels of parallelism within the coprocessor - coprocessor has several execution units - Each execution unit typically has SIMD or vector execution Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-22

Similarities Surprising similarity Same Problem devices process large block of data : memory latency dataset larger than the cache - use large memory bandwidth - add multithreading own path to memory - separate physical memory - partitioned physical memory Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-23

Programming Model Goal 4Programming Model Goal Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-24

Programming Model Goal Programming langage Main Goal Program strategy that preserve : portability performance across all the devices a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-25

Programming Model Goal Programming langage Main Goal Two standards high level programming languages - Pascal, Fortran, C, C++, Java etc... - same program give same results on any number of different processors and operating systems vector computing - Cray, NEC, Fujitsu, IBM, Convex etc... - vectorizing compilers generate pretty good vector code from loops in your program Vectorization advantages : compilers feedback when they failed feedback slowly trained the programmer style of programming that vectorizing compilers promoted gave good performance across a wide range of machines Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-26

Programming Model Goal Programming langage Programming Strategy We need programming strategy model or language a style that will give good performance across a wide range of heterogeneous systems create a set of coding rules that will allow compilers and tools to exploit the parallelism effectively Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-27

Howto 5Howto Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-28

Howto Programming langage Why it s hard... parallelism on our cpu deal with an attached asynchronous device parallelism on this device(s) optimize locality and synchronization managing the distinct host and device memory spaces - data movement problem - data distribution problem - load balancing issues take advantage of the features of the coprocessor to get this performance Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-29

Howto Programming langage Some solution The challenge is : What to virtualize? What to expose? Vectorizing compilers such as the Intel SSE Intrinsics no portability Vector librairy routines such as BLAS or STL C++ Vector (or array) extension of the language such as Fortran array or Intel Array Notation for C Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-30

Howto Programming langage - Solution? Some solution - SSE Compiling and vectorizing the following loop for SSE : do i = 1,n x = a(i) + b(i) c(i) = exp(x) + 1/x enddo Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-31

Howto Programming langage - Solution? Some solution - SSE Compiling and vectorizing the following loop for SSE : Figure: SSE Register Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-32

Howto Programming langage - Solution? Some solution - SSE Compiling and vectorizing the following loop for SSE : Figure: SSE Register Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-33

Howto Programming langage - Solution? Some solution Portability? Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-34

Howto Programming langage - Solution? Some solution - Array Portability? The equivalent array code... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-35

Howto Programming langage - Solution? Some solution - Array The equivalent array code... x(:) = a(:) + b(:) c(:) = exp(x(:)) + 1/x(:) Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-36

Howto Programming langage - Solution? Some solution - Array In Fortran and C, we ve got : forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-37

Howto Programming langage - Solution? Some solution - Array In Fortran and C, we ve got : forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) computing first the whole right hand side... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-38

Howto Programming langage - Solution? Some solution - Array In Fortran and C, we ve got : forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) computing first the whole right hand side... then doing all the stores Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-39

Howto Programming langage - Solution? Some solution - Array In Fortran and C, we ve got : forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) computing first the whole right hand side... then doing all the stores The compiler determine the two loops fusion... or not Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-40

Howto Programming langage - Solution? Some solution - Array forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) is array x needeed at all? At first x was scalar... At best: code generated as good as vectorized loop Probably: generate more memory access for large datasets, more cache misses Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-41

Howto Programming langage - Solution... Programming Model Programming model should virtualize those aspects that are different among target systems Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-42

Howto Programming model zoo What s up? some directive programming model OpenCL Microsoft C++AMP Google RenderScript OpenACC (consortium...) StarPU... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-43

Howto Programming model zoo What s up? some directive programming model OpenCL Microsoft C++AMP Google RenderScript OpenACC (consortium...) StarPU... The two big challenges in parallel computing are getting it correct and getting it to scale... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-44

Howto Programming model zoo What s up? some directive programming model OpenCL Microsoft C++AMP Google RenderScript OpenACC (consortium...) StarPU... The two big challenges in parallel computing are getting it correct and getting it to scale, and Ct directly takes aim at both, said James Reinders (Director Software Products and Multi-core Evangelist Intel Corporation) Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-45

Conclusion 6Conclusion Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-46

Conclusion Conclusion Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-47

Conclusion Questions? Bordeaux INRIA Bordeaux Sud-Ouest www.inria.fr