HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public

Size: px

Start display at page:

Download "HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public"

Michael Mills
5 years ago
Views:

1 HETEROGENEOUS COMPUTING IN THE MODERN AGE 1 HiPEAC January, 2012 Public

2 Single-thread Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity? we are here Time 2 HiPEAC January, 2012 Public

3 Single-thread Performance Throughput Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability? we are here we are here Time Time (# of processors) 3 HiPEAC January, 2012 Public

4 Single-thread Performance Throughput Performance Modern Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 4 HiPEAC January, 2012 Public

Single-thread Performance Throughput Performance Modern Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law

5 Single-thread Performance Throughput Performance Modern Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Assembly C/C++ Java pthreads OpenMP / TBB Shader CUDA OpenCL!!!? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 5 HiPEAC January, 2012 Public

6 WHAT DOES THIS MEAN FROM AMD S POINT OF VIEW? Everyone will have seen AMD s efforts in developing the APU A core feature of the current consumer product line Cost and power reduction for mainstream users 6 HiPEAC January, 2012 Public

7 APU: ACCELERATED PROCESSING UNIT The APU has arrived and it is a great advance over (AMD s?) previous platforms Combines limited parallelism on the CPU with throughput computing on the GPU and high bandwidth access to memory How do we make it even better? Easier to program Easier to optimize Easier to load balance Higher performance Lower power 7 HiPEAC January, 2012 Public

8 WHAT ABOUT OTHER COMPANIES? Obviously AMD isn t alone in this Embedded systems are already highly integrated and heterogeneous as any ARM representative will gladly remind us HPC systems have been mixing node types a while now Roadrunner with its Cell-based accelerators Something like 3 of the latest top 5 supercomputer list are GPU-accelerated But what is the utilization like with such systems? Are all the components being used anywhere close to headline performance rates 8 HiPEAC January, 2012 Public

9 HETEROGENEITY IN THE APU 9 HiPEAC January, 2012 Public

10 TAKING A REALISTIC LOOK AT THE APU GPUs are not magic We ve often heard about 100x performance improvements These are usually the result of poor CPU code The APU offers a balance GPU cores optimized for arithmetic workloads and latency hiding CPU cores to deal with the branchy code for which branch prediction and out of order execution are so valuable A tight integration such that shorter, more tightly bound, sections of code targeted at the different styles of device can be linked together However, not without costs! 10 HiPEAC January, 2012 Public

11 CPUS AND GPUS Different design goals: CPU design is based on maximizing performance of a single thread GPU design aims to maximize throughput at the cost of lower performance for each thread CPU use of area: Transistors are dedicated to branch prediction, out of order logic and caching to reduce latency to memory, to allow efficient instruction prefetching and deep pipelines (fast clocks) GPU use of area: Transistors concentrated on ALUs and registers Registers store thread state and allow fast switching between threads to cover (rather than reduce) latency 11 HiPEAC January, 2012 Public

12 HIDING OR REDUCING LATENCY Any instruction issued might stall waiting on memory or at least in the computation pipeline SIMD Operation Stall 12 HiPEAC January, 2012 Public

13 HIDING OR REDUCING LATENCY Any instruction issued might stall waiting on memory or at least in the computation pipeline Out of order logic attempts to issue as many instructions as possible as tightly as possible, filling small stalls with other instructions from the same thread SIMD Operation Stall 13 HiPEAC January, 2012 Public

14 REDUCING MEMORY LATENCY ON THE CPU Larger memory stalls are harder to cover Memory can be some way from the ALU Many cycles to access Memory Lanes 0-3 Stall Instruction 0 Instruction 1 14 HiPEAC January, 2012 Public

REDUCING MEMORY LATENCY ON THE CPU Larger memory stalls are harder to cover Memory can be some way from the ALU Many cycles to access Lanes 0-3 Instruction 0 CPU

15 REDUCING MEMORY LATENCY ON THE CPU Larger memory stalls are harder to cover Memory can be some way from the ALU Many cycles to access Lanes 0-3 Instruction 0 CPU solution is to introduce an intermediate cache memory Memory Cache Stall Instruction 1 Minimizes the latency of most accesses Reduces stalls 15 HiPEAC January, 2012 Public

16 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes HiPEAC January, 2012 Public

17 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 1 instruction 1 17 HiPEAC January, 2012 Public

18 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 2 instruction 0 Multiple threads fill gaps in instruction stream Wave 1 instruction 1 18 HiPEAC January, 2012 Public

HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD6970 64-wide wavefronts

19 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 2 instruction 0 Multiple threads fill gaps in instruction stream Wave 1 instruction 1 Single thread latency actually INCREASES 19 HiPEAC January, 2012 Public

20 GPU CACHES GPUs also have caches The goal is generally to improve spatial locality rather than temporal Wave 1 instruction 0 Memory Cache Different parts of a wide SIMD thread and different threads may require similar locations and share through the cache 20 HiPEAC January, 2012 Public

21 GPU CACHES GPUs also have caches The goal is generally to improve spatial locality rather than temporal Wave 1 instruction 0 Memory Cache Wave 2 instruction 0 Different parts of a wide SIMD thread and different threads may require similar locations and share through the cache 21 HiPEAC January, 2012 Public

22 COSTS The CPU approach: Requires large caches to maximize the number of memory operations caught Requires considerable transistor logic to support the out of order control The GPU approach: Requires wide hardware vectors, not all code is easily vectorized Requires considerable state storage to support active threads Note: we need not pretend that OpenCL or CUDA are NOT vectorization The entire point of the design is hand vectorization These two approaches suit different algorithm designs They require different hardware implementations 22 HiPEAC January, 2012 Public

the OS AMD Radeon TM HD6970: 24 cores, 16-way SIMD (plus VLIW issue, but the Phenom processor does multi-issue

23 THE TRADEOFFS IN PICTURES Instruction decode AMD Phenom TM II X6: 6 cores, 4-way SIMD (ALUs) A single set of registers Register state ALUs Registers are backed out to memory on a thread switch, and this is performed by the OS AMD Radeon TM HD6970: 24 cores, 16-way SIMD (plus VLIW issue, but the Phenom processor does multi-issue too), 64-wide SIMD state Multiple register sets (somewhat dynamic) 8, 16 threads per core 23 HiPEAC January, 2012 Public

24 COMBINING THE TWO So what did we see? Diagrams were vague, but Large amount of orange! Lots of register state Also more green on the GPU cores The APU combines the two styles of core: The E350 has two Bobcat cores and two Cedar -like cores, for example 2- and 8-wide physical SIMD 24 HiPEAC January, 2012 Public

25 25 HiPEAC January, 2012 Public THINKING ABOUT PROGRAMMING

26 OPENCL, CUDA, ARE THESE GOOD MODELS? Designed for wide data-parallel computation Pretty low level There is a lack of good scheduling and coherence control We see cheating all the time: the lane-wise programming model only becomes efficient when we program it like the vector model it really is, making assumptions about wave or warp synchronicity However: they re better than SSE! We have proper scatter/gather memory access The lane wise programming does help: we still have to think about vectors, but it s much easier to do than in a vector language We can even program lazily and pretend that a single work item is a thread and yet it still (sortof) works 26 HiPEAC January, 2012 Public

27 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? 27 HiPEAC January, 2012 Public

28 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array 28 HiPEAC January, 2012 Public

29 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) 29 HiPEAC January, 2012 Public

30 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block float sum( 0 ) for( i = n to n + b ) sum += input[i] 30 HiPEAC January, 2012 Public

31 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block Reduce across threads float sum( 0 ) for( i = n to n + b ) sum += input[i] float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 31 HiPEAC January, 2012 Public

32 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block Reduce across threads Vectorize float4 sum( 0, 0, 0, 0 ) for( i = n/4 to (n + b)/4 ) sum += input[i] float scalarsum = sum.x + sum.y + sum.z + sum.w float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 32 HiPEAC January, 2012 Public

33 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? 33 HiPEAC January, 2012 Public

34 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array 34 HiPEAC January, 2012 Public

35 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) 35 HiPEAC January, 2012 Public

36 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block float sum( 0 ) for( i = n to n + b ) sum += input[i] 36 HiPEAC January, 2012 Public

37 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads float sum( 0 ) for( i = n to n + b ) sum += input[i] float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 37 HiPEAC January, 2012 Public

38 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads Vectorize (this bit may be a different kernel dispatch given current models) float64 sum( 0,, 0 ) for( i = n/64 to (n + b)/64 ) sum += input[i] float scalarsum = wavereduce(sum) float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 38 HiPEAC January, 2012 Public

different kernel dispatch given current models) float64 sum( 0,, 0 ) Current models for( ease i = programming n/64 to (n + b)/64 by viewing ) the vector as a set of scalars ALUs, apparently sum

39 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads Vectorize (this bit may be a different kernel dispatch given current models) float64 sum( 0,, 0 ) Current models for( ease i = programming n/64 to (n + b)/64 by viewing ) the vector as a set of scalars ALUs, apparently sum though += not input[i] really independent, with varying degree of hardware assistance float scalarsum (and hence = overhead): wavereduce(sum) float sum( 0 ) for( i = n/64 to (n + b)/64; float i += reductionvalue( 64) 0 ) sum += input[i] for( t in threadcount ) float scalarsum = wavereducevialocalmemory(sum) reductionvalue += t.sum 39 HiPEAC January, 2012 Public

40 THEY DON T SEEM SO DIFFERENT! More blocks of data More cores More threads Wider threads 64 on high end AMD GPUs 4/8 on current CPUs Hard to develop efficiently for wide threads Lots of state, makes context switching and stacks problematic float4 sum( 0, 0, 0, 0 ) for( i = n/4 to (n + b)/4 ) sum += input[i] float scalarsum = sum.x + sum.y + sum.z + sum.w float64 sum( 0,, 0 ) for( i = n/64 to (n + b)/64 ) sum += input[i] float scalarsum = wavereduce(sum) 40 HiPEAC January, 2012 Public

41 THAT WAS TRIVIAL MORE GENERALLY, WHAT WORKS WELL? On GPU cores: We need a lot of data parallelism Algorithms that can be mapped to multiple cores and multiple threads per core Approaches that map efficiently to wide SIMD units So a nice simple functional map operation is great! Array O p O p O p Array.map(Op) This is basically the OpenCL tm model 41 HiPEAC January, 2012 Public

42 THAT WAS TRIVIAL MORE GENERALLY, WHAT WORKS WELL? On CPU cores: Some data parallelism for multiple cores Narrow SIMD units simplify the problem: pixels work fine rather than data-parallel pixel clusters Does AVX change this? High clock rates and caches make serial execution efficient So in addition to the simple map (which boils down to a for loop on the CPU) we can do complex task graphs O p O p O p O p O p O p 42 HiPEAC January, 2012 Public

43 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 43 HiPEAC January, 2012 Public

SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data

44 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Discrete GPU configurations suffer from communication latency. Nested data parallel/braided parallel code benefits from close coupling. Discrete GPUs don t provide it well. But each individual core isn t great at certain types of algorithm Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 44 HiPEAC January, 2012 Public

45 CUE APU Tight integration of narrow and wide vector kernels Combination of high and low degrees of threading Fast turnaround Negligible kernel launch time Communication between kernels Shared buffers Data CPU kernel GPU kernel For example: Generating a tree structure on the CPU cores, processing the scene on the GPU cores Mixed scale particle simulations (see a later talk) 45 HiPEAC January, 2012 Public

parallel Code Lots of conditional data parallelism.

pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data

46 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 46 HiPEAC January, 2012 Public

47 HOW DO WE USE THESE DEVICES? Heterogeneous programming isn t easy Particularly if you want performance To date: CPUs with visible vector ISAs GPUs mostly lane-wise (implicit vector) ISAs Clunky separate programming models with explicit data movement How can we target both? With a fair degree of efficiency True shared memory with passable pointers Let s talk about the future of programming models 47 HiPEAC January, 2012 Public

48 PROGRAMMING MODELS IN THE PRESENT 48 HiPEAC January, 2012 Public

AMD, Apple, IBM, Intel, Nvidia, Sony, etc.

49 INDUSTRY STANDARD API STRATEGY OpenCL Open development platform for multi-vendor heterogeneous architectures The power of AMD Fusion: Leverages CPUs and GPUs for balanced system approach Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc. Momentum: Enthusiasm from mainstream developers and application software partners Microsoft s C++AMP Microsoft distribution Part of Visual Studio 11 Based on C++11 C++AMP 49 HiPEAC January, 2012 Public

Level Language Compilers Open and Custom

50 Moving Past Proprietary Solutions for Ease of Cross- Platform Programming High Level Language Compilers Open and Custom Tools High Level Tools Application Specific Libraries C++AMP Industry Standard Interfaces OpenCL OpenGL AMD GPUs OpenCL - AMD CPUs Other CPUs/GPUs Cross-platform development Interoperability with OpenGL and DX CPU/GPU backends enable balanced platform approach 50 HiPEAC January, 2012 Public

51 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. 51 HiPEAC January, 2012 Public

52 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. WI WI WG WI WI 52 HiPEAC January, 2012 Public

runs a single kernel simultaneously Separation between compute units is relevant for

53 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. WI WI WG WI WI Modern CPUs & GPUs can support more! 53 HiPEAC January, 2012 Public

54 MODERN GPU (& CPU) CAPABILITIES Modern GPUs can execute a different instruction stream per core Some even have a few HW threads per core (each runs separated streams) This is still a highly parallelized machine! HW thread executes N-wide vector instructions (8-64 wide) Scheduler switches HW threads on the fly to hide memory misses 54 HiPEAC January, 2012 Public

55 MODERN GPU (& CPU) CAPABILITIES Modern GPUs can execute a different instruction stream per core Some even have a few HW threads per core (each runs separated streams) This is still a highly parallelized machine! HW thread executes N-wide vector instructions (8-64 wide) Scheduler switches HW threads on the fly to hide memory misses Shader Core Shader Core Shader Core Shader Core Input Assembler Geometry Assembler Rasterizer Output blending Input Assembly Vertex Shader Time Geometry Assembly Pixel Shader Pixel Shader Pixel Shader Pixel Shader Rasterization Blend Blend Blend Blend 55 HiPEAC January, 2012 Public

56 PERSISTENT THREADS People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching The more cores we add the worse this is Task A Task B Task C Queue A C A GPU Core Ben will go into a bit more depth on this topic later 56 HiPEAC January, 2012 Public

57 PERSISTENT THREADS Queue People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching The more cores we add the worse this is Task A Task B Task C A A C GPU Core Ben will go into a bit more depth on this topic later 57 HiPEAC January, 2012 Public

PERSISTENT THREADS Queue People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Task A A Branches to particular code for that

58 PERSISTENT THREADS Queue People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Task A A Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching Task B Task C C GPU Core The more cores we add the worse this is Ben will go into a bit more depth on this topic later 58 HiPEAC January, 2012 Public

59 CUDA, OPENCL AND C++AMP Heterogeneous programming with GPUs has become relatively widespread Relatively easy access to the system through: CUDA OpenCL C++AMP: access to DirectX 11 compute shaders We have to be able to deal with the complexities of the programming model 59 HiPEAC January, 2012 Public

60 THE WORLD IS MOVING FORWARD These models are not static Obviously there are serious downsides to that OpenCL 1.2 was recently released, 2.0 is in development and making good progress NVIDIA has been adding features to CUDA with some persistence 60 HiPEAC January, 2012 Public

61 C++AMP Microsoft has decided on a single source model called C++AMP Sits on top of DirectX Highly limited as a result of this (some way behind CUDA and OpenCL currently) However, from an ease of access point of view there are clear benefits For example: void ampsquareexample(const std::vector<int> &in, std::vector<int> &out) { concurrency::array_view<const int> avin(in.size(), in); concurrency::array_view<int> avout(out.size(), out); concurrency::parallel_for_each(avout.extent, [=](concurrency::index<1> idx) restrict(amp) { avout[idx] = avin[idx] * avin[idx]; }); avout.synchronize(); } 61 HiPEAC January, 2012 Public

62 OPENACC Looking at it from the OpenMP pragma directive angle we see OpenACC Initially developed by PGI, NVIDIA, Cray and CAPS Clear benefits for developers starting with C or fortran source Though it s a shame to not be integrated cleanly with the language and type system The aim is to merge this technology with OpenMP and, quoting from the OpenACC web site: The intent is that the lessons learned from use an implementations of the OpenACC API will lead to a more complete and robust OpenMP heterogeneous computing standard #pragma acc parallel 62 HiPEAC January, 2012 Public

63 HOW DO WE MOVE FORWARD? We should concentrate on ways to abstract the algorithms over the features that differ? A difficult challenge For now we seem to be stuck with inefficient raw-data-parallelism or expertly coded SIMD algorithms So, instead, how should we approach simplifying the model for the platform Assume we have the optimized kernels for the compute nodes 63 HiPEAC January, 2012 Public

64 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc HiPEAC January, 2012 Public

65 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU OpenCL, DirectCompute Driver-based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch CUDA, Brook+, etc HiPEAC January, 2012 Public

Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Architected Era Heterogeneous System Architecture GPU Peer Processor

66 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Architected Era Heterogeneous System Architecture GPU Peer Processor Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc OpenCL, DirectCompute Driver-based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching HiPEAC January, 2012 Public

67 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Integrate CPU & GPU in silicon Unified Memory Controller Common Manufacturing Technology 67 HiPEAC January, 2012 Public

68 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Integrate CPU & GPU in silicon GPU Compute C++ support Unified Memory Controller User mode schedulng Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU 68 HiPEAC January, 2012 Public

69 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Architectural Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU 69 HiPEAC January, 2012 Public

HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU &

mode schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing

70 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Extend to Discrete GPU 70 HiPEAC January, 2012 Public

71 HETEROGENEOUS SYSTEM ARCHITECTURE AN OPEN PLATFORM Open Architecture, published specifications HSAIL virtual ISA HSA memory model HSA dispatch ISA agnostic for both CPU and GPU 71 HiPEAC January, 2012 Public

72 HSA INTERMEDIATE LAYER - HSAIL HSAIL is a virtual ISA for parallel programs Finalized to ISA by a JIT compiler or Finalizer Explicitly parallel Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Syscall methods GPU code can call directly to system services, IO, printf, etc Debugging support 72 HiPEAC January, 2012 Public

73 HSA MEMORY MODEL Designed to be compatible with C++11, Java and.net Memory Models Relaxed consistency memory model for parallel compute performance Loads and stores can be re-ordered by the finalizer Visibility controlled by: Load.Acquire*, Load.Dep, Store.Release* Barriers *sequential consistent ordering 73 HiPEAC January, 2012 Public

74 HSA ENABLES TASK QUEUING RUNTIMES Popular pattern for task- and data-parallel programming on SMP systems today Characterized by: A work queue per core Runtime library that divides large loops into tasks and distributes to queues A work stealing runtime that keeps the system balanced HSA is designed to extend this pattern to run on heterogeneous systems 74 HiPEAC January, 2012 Public

75 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. 75 HiPEAC January, 2012 Public

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU