The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012
Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming Model Goal 5. Howto 6. Conclusion Francois Rue - The Heterogeneous Programming Jungle June 19, 2012-2
Introduction 1Introduction Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-3
Introduction Motivations Motivations The main goal is to focus on heterogeneous programming. This presentation is based on an article writed by Michael Wolfe, Compiler Engineer, The Portland Group, Inc. several approaches developed to program heterogeneous system... which is the good approache? Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-4
Introduction reminder Why Moore? Step by step video... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-5
Introduction Accelerators Accelerators Context Main idea: heterogeneous systems = normal system + coprocessor Accelerator : - specialized in one type of architecture - exhibit internal parallelism Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-6
Introduction Accelerators Accelerators Objectives Parallel programming is intrinsically hard - create parallel activities - insert synchronisation between them - manage data locality Programming a heterogeneous system : more complex! - manage concurrent activities between host and device(s) - manage data locality between host and device(s) Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-7
Heterogeneous System Zoo 2Heterogeneous System Zoo Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-8
Heterogeneous System Zoo The range of heterogeneous systems Most Popular Intel/AMD X86 host + NVIDIA GPUs (x86+gpu) : 35 of the Top 500 supercomputers in the November 2011 list GPU proper memory connected by PCi to the host Gestion by the host of memory and kernels Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-9
Heterogeneous System Zoo The range of heterogeneous systems Full AMD AMD Opteron + AMD GPUs : NVIDIA Gpu replaced by ALD Firestream... another x86+gpu option Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-10
Heterogeneous System Zoo The range of heterogeneous systems Full AMD but... AMD Opteron + AMD APU : Figure: AMD APU integrated on the same chip physical memory shared... but partitionned Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-11
Heterogeneous System Zoo The range of heterogeneous systems And now for Intel Intel Core + Intel Ivy Bridge integrated GPU : Figure: Intel Ivy Bridge on chip GPU OpenCL programmable Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-12
Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-13
Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-14
Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-15
Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Convey Intel x86 + FPGA-implemented reconfigurable vector unit Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-16
Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Convey Intel x86 + FPGA-implemented reconfigurable vector unit Tilera multicore or the Chinese FeiTeng FT64. Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-17
Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Convey Intel x86 + FPGA-implemented reconfigurable vector unit Tilera multicore or the Chinese FeiTeng FT64. GP core + FPGA fabric Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-18
Heterogeneous System Zoo The range of heterogeneous systems Other? Intel Core + Intel MIC. NVIDIA Denver: ARM + NVIDIA GPU Texas Instruments: ARM + TI DSPs Convey Intel x86 + FPGA-implemented reconfigurable vector unit Tilera multicore or the Chinese FeiTeng FT64. GP core + FPGA fabric IBM Power + Cell Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-19
Heterogeneous System Zoo The range of heterogeneous systems Other? bref... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-20
Similarities 3Similarities Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-21
Similarities Surprising similarity In general All the systems allow the attached device to execute asynchronously with the host All the systems exhibit several levels of parallelism within the coprocessor - coprocessor has several execution units - Each execution unit typically has SIMD or vector execution Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-22
Similarities Surprising similarity Same Problem devices process large block of data : memory latency dataset larger than the cache - use large memory bandwidth - add multithreading own path to memory - separate physical memory - partitioned physical memory Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-23
Programming Model Goal 4Programming Model Goal Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-24
Programming Model Goal Programming langage Main Goal Program strategy that preserve : portability performance across all the devices a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-25
Programming Model Goal Programming langage Main Goal Two standards high level programming languages - Pascal, Fortran, C, C++, Java etc... - same program give same results on any number of different processors and operating systems vector computing - Cray, NEC, Fujitsu, IBM, Convex etc... - vectorizing compilers generate pretty good vector code from loops in your program Vectorization advantages : compilers feedback when they failed feedback slowly trained the programmer style of programming that vectorizing compilers promoted gave good performance across a wide range of machines Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-26
Programming Model Goal Programming langage Programming Strategy We need programming strategy model or language a style that will give good performance across a wide range of heterogeneous systems create a set of coding rules that will allow compilers and tools to exploit the parallelism effectively Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-27
Howto 5Howto Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-28
Howto Programming langage Why it s hard... parallelism on our cpu deal with an attached asynchronous device parallelism on this device(s) optimize locality and synchronization managing the distinct host and device memory spaces - data movement problem - data distribution problem - load balancing issues take advantage of the features of the coprocessor to get this performance Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-29
Howto Programming langage Some solution The challenge is : What to virtualize? What to expose? Vectorizing compilers such as the Intel SSE Intrinsics no portability Vector librairy routines such as BLAS or STL C++ Vector (or array) extension of the language such as Fortran array or Intel Array Notation for C Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-30
Howto Programming langage - Solution? Some solution - SSE Compiling and vectorizing the following loop for SSE : do i = 1,n x = a(i) + b(i) c(i) = exp(x) + 1/x enddo Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-31
Howto Programming langage - Solution? Some solution - SSE Compiling and vectorizing the following loop for SSE : Figure: SSE Register Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-32
Howto Programming langage - Solution? Some solution - SSE Compiling and vectorizing the following loop for SSE : Figure: SSE Register Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-33
Howto Programming langage - Solution? Some solution Portability? Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-34
Howto Programming langage - Solution? Some solution - Array Portability? The equivalent array code... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-35
Howto Programming langage - Solution? Some solution - Array The equivalent array code... x(:) = a(:) + b(:) c(:) = exp(x(:)) + 1/x(:) Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-36
Howto Programming langage - Solution? Some solution - Array In Fortran and C, we ve got : forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-37
Howto Programming langage - Solution? Some solution - Array In Fortran and C, we ve got : forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) computing first the whole right hand side... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-38
Howto Programming langage - Solution? Some solution - Array In Fortran and C, we ve got : forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) computing first the whole right hand side... then doing all the stores Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-39
Howto Programming langage - Solution? Some solution - Array In Fortran and C, we ve got : forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) computing first the whole right hand side... then doing all the stores The compiler determine the two loops fusion... or not Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-40
Howto Programming langage - Solution? Some solution - Array forall(i = 1,n) x(i) = a(i) + b(i) c(i) = exp(x(i)) + 1/x(i) is array x needeed at all? At first x was scalar... At best: code generated as good as vectorized loop Probably: generate more memory access for large datasets, more cache misses Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-41
Howto Programming langage - Solution... Programming Model Programming model should virtualize those aspects that are different among target systems Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-42
Howto Programming model zoo What s up? some directive programming model OpenCL Microsoft C++AMP Google RenderScript OpenACC (consortium...) StarPU... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-43
Howto Programming model zoo What s up? some directive programming model OpenCL Microsoft C++AMP Google RenderScript OpenACC (consortium...) StarPU... The two big challenges in parallel computing are getting it correct and getting it to scale... Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-44
Howto Programming model zoo What s up? some directive programming model OpenCL Microsoft C++AMP Google RenderScript OpenACC (consortium...) StarPU... The two big challenges in parallel computing are getting it correct and getting it to scale, and Ct directly takes aim at both, said James Reinders (Director Software Products and Multi-core Evangelist Intel Corporation) Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-45
Conclusion 6Conclusion Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-46
Conclusion Conclusion Francois Rue - The Heterogeneous Programming Jungle October 4, 2012-47
Conclusion Questions? Bordeaux INRIA Bordeaux Sud-Ouest www.inria.fr