HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public

Size: px
Start display at page:

Download "HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public"

Transcription

1 HETEROGENEOUS COMPUTING IN THE MODERN AGE 1 HiPEAC January, 2012 Public

2 Single-thread Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity? we are here Time 2 HiPEAC January, 2012 Public

3 Single-thread Performance Throughput Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability? we are here we are here Time Time (# of processors) 3 HiPEAC January, 2012 Public

4 Single-thread Performance Throughput Performance Modern Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 4 HiPEAC January, 2012 Public

5 Single-thread Performance Throughput Performance Modern Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Assembly C/C++ Java pthreads OpenMP / TBB Shader CUDA OpenCL!!!? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 5 HiPEAC January, 2012 Public

6 WHAT DOES THIS MEAN FROM AMD S POINT OF VIEW? Everyone will have seen AMD s efforts in developing the APU A core feature of the current consumer product line Cost and power reduction for mainstream users 6 HiPEAC January, 2012 Public

7 APU: ACCELERATED PROCESSING UNIT The APU has arrived and it is a great advance over (AMD s?) previous platforms Combines limited parallelism on the CPU with throughput computing on the GPU and high bandwidth access to memory How do we make it even better? Easier to program Easier to optimize Easier to load balance Higher performance Lower power 7 HiPEAC January, 2012 Public

8 WHAT ABOUT OTHER COMPANIES? Obviously AMD isn t alone in this Embedded systems are already highly integrated and heterogeneous as any ARM representative will gladly remind us HPC systems have been mixing node types a while now Roadrunner with its Cell-based accelerators Something like 3 of the latest top 5 supercomputer list are GPU-accelerated But what is the utilization like with such systems? Are all the components being used anywhere close to headline performance rates 8 HiPEAC January, 2012 Public

9 HETEROGENEITY IN THE APU 9 HiPEAC January, 2012 Public

10 TAKING A REALISTIC LOOK AT THE APU GPUs are not magic We ve often heard about 100x performance improvements These are usually the result of poor CPU code The APU offers a balance GPU cores optimized for arithmetic workloads and latency hiding CPU cores to deal with the branchy code for which branch prediction and out of order execution are so valuable A tight integration such that shorter, more tightly bound, sections of code targeted at the different styles of device can be linked together However, not without costs! 10 HiPEAC January, 2012 Public

11 CPUS AND GPUS Different design goals: CPU design is based on maximizing performance of a single thread GPU design aims to maximize throughput at the cost of lower performance for each thread CPU use of area: Transistors are dedicated to branch prediction, out of order logic and caching to reduce latency to memory, to allow efficient instruction prefetching and deep pipelines (fast clocks) GPU use of area: Transistors concentrated on ALUs and registers Registers store thread state and allow fast switching between threads to cover (rather than reduce) latency 11 HiPEAC January, 2012 Public

12 HIDING OR REDUCING LATENCY Any instruction issued might stall waiting on memory or at least in the computation pipeline SIMD Operation Stall 12 HiPEAC January, 2012 Public

13 HIDING OR REDUCING LATENCY Any instruction issued might stall waiting on memory or at least in the computation pipeline Out of order logic attempts to issue as many instructions as possible as tightly as possible, filling small stalls with other instructions from the same thread SIMD Operation Stall 13 HiPEAC January, 2012 Public

14 REDUCING MEMORY LATENCY ON THE CPU Larger memory stalls are harder to cover Memory can be some way from the ALU Many cycles to access Memory Lanes 0-3 Stall Instruction 0 Instruction 1 14 HiPEAC January, 2012 Public

15 REDUCING MEMORY LATENCY ON THE CPU Larger memory stalls are harder to cover Memory can be some way from the ALU Many cycles to access Lanes 0-3 Instruction 0 CPU solution is to introduce an intermediate cache memory Memory Cache Stall Instruction 1 Minimizes the latency of most accesses Reduces stalls 15 HiPEAC January, 2012 Public

16 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes HiPEAC January, 2012 Public

17 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 1 instruction 1 17 HiPEAC January, 2012 Public

18 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 2 instruction 0 Multiple threads fill gaps in instruction stream Wave 1 instruction 1 18 HiPEAC January, 2012 Public

19 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 2 instruction 0 Multiple threads fill gaps in instruction stream Wave 1 instruction 1 Single thread latency actually INCREASES 19 HiPEAC January, 2012 Public

20 GPU CACHES GPUs also have caches The goal is generally to improve spatial locality rather than temporal Wave 1 instruction 0 Memory Cache Different parts of a wide SIMD thread and different threads may require similar locations and share through the cache 20 HiPEAC January, 2012 Public

21 GPU CACHES GPUs also have caches The goal is generally to improve spatial locality rather than temporal Wave 1 instruction 0 Memory Cache Wave 2 instruction 0 Different parts of a wide SIMD thread and different threads may require similar locations and share through the cache 21 HiPEAC January, 2012 Public

22 COSTS The CPU approach: Requires large caches to maximize the number of memory operations caught Requires considerable transistor logic to support the out of order control The GPU approach: Requires wide hardware vectors, not all code is easily vectorized Requires considerable state storage to support active threads Note: we need not pretend that OpenCL or CUDA are NOT vectorization The entire point of the design is hand vectorization These two approaches suit different algorithm designs They require different hardware implementations 22 HiPEAC January, 2012 Public

23 THE TRADEOFFS IN PICTURES Instruction decode AMD Phenom TM II X6: 6 cores, 4-way SIMD (ALUs) A single set of registers Register state ALUs Registers are backed out to memory on a thread switch, and this is performed by the OS AMD Radeon TM HD6970: 24 cores, 16-way SIMD (plus VLIW issue, but the Phenom processor does multi-issue too), 64-wide SIMD state Multiple register sets (somewhat dynamic) 8, 16 threads per core 23 HiPEAC January, 2012 Public

24 COMBINING THE TWO So what did we see? Diagrams were vague, but Large amount of orange! Lots of register state Also more green on the GPU cores The APU combines the two styles of core: The E350 has two Bobcat cores and two Cedar -like cores, for example 2- and 8-wide physical SIMD 24 HiPEAC January, 2012 Public

25 25 HiPEAC January, 2012 Public THINKING ABOUT PROGRAMMING

26 OPENCL, CUDA, ARE THESE GOOD MODELS? Designed for wide data-parallel computation Pretty low level There is a lack of good scheduling and coherence control We see cheating all the time: the lane-wise programming model only becomes efficient when we program it like the vector model it really is, making assumptions about wave or warp synchronicity However: they re better than SSE! We have proper scatter/gather memory access The lane wise programming does help: we still have to think about vectors, but it s much easier to do than in a vector language We can even program lazily and pretend that a single work item is a thread and yet it still (sortof) works 26 HiPEAC January, 2012 Public

27 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? 27 HiPEAC January, 2012 Public

28 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array 28 HiPEAC January, 2012 Public

29 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) 29 HiPEAC January, 2012 Public

30 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block float sum( 0 ) for( i = n to n + b ) sum += input[i] 30 HiPEAC January, 2012 Public

31 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block Reduce across threads float sum( 0 ) for( i = n to n + b ) sum += input[i] float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 31 HiPEAC January, 2012 Public

32 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block Reduce across threads Vectorize float4 sum( 0, 0, 0, 0 ) for( i = n/4 to (n + b)/4 ) sum += input[i] float scalarsum = sum.x + sum.y + sum.z + sum.w float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 32 HiPEAC January, 2012 Public

33 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? 33 HiPEAC January, 2012 Public

34 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array 34 HiPEAC January, 2012 Public

35 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) 35 HiPEAC January, 2012 Public

36 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block float sum( 0 ) for( i = n to n + b ) sum += input[i] 36 HiPEAC January, 2012 Public

37 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads float sum( 0 ) for( i = n to n + b ) sum += input[i] float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 37 HiPEAC January, 2012 Public

38 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads Vectorize (this bit may be a different kernel dispatch given current models) float64 sum( 0,, 0 ) for( i = n/64 to (n + b)/64 ) sum += input[i] float scalarsum = wavereduce(sum) float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 38 HiPEAC January, 2012 Public

39 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads Vectorize (this bit may be a different kernel dispatch given current models) float64 sum( 0,, 0 ) Current models for( ease i = programming n/64 to (n + b)/64 by viewing ) the vector as a set of scalars ALUs, apparently sum though += not input[i] really independent, with varying degree of hardware assistance float scalarsum (and hence = overhead): wavereduce(sum) float sum( 0 ) for( i = n/64 to (n + b)/64; float i += reductionvalue( 64) 0 ) sum += input[i] for( t in threadcount ) float scalarsum = wavereducevialocalmemory(sum) reductionvalue += t.sum 39 HiPEAC January, 2012 Public

40 THEY DON T SEEM SO DIFFERENT! More blocks of data More cores More threads Wider threads 64 on high end AMD GPUs 4/8 on current CPUs Hard to develop efficiently for wide threads Lots of state, makes context switching and stacks problematic float4 sum( 0, 0, 0, 0 ) for( i = n/4 to (n + b)/4 ) sum += input[i] float scalarsum = sum.x + sum.y + sum.z + sum.w float64 sum( 0,, 0 ) for( i = n/64 to (n + b)/64 ) sum += input[i] float scalarsum = wavereduce(sum) 40 HiPEAC January, 2012 Public

41 THAT WAS TRIVIAL MORE GENERALLY, WHAT WORKS WELL? On GPU cores: We need a lot of data parallelism Algorithms that can be mapped to multiple cores and multiple threads per core Approaches that map efficiently to wide SIMD units So a nice simple functional map operation is great! Array O p O p O p Array.map(Op) This is basically the OpenCL tm model 41 HiPEAC January, 2012 Public

42 THAT WAS TRIVIAL MORE GENERALLY, WHAT WORKS WELL? On CPU cores: Some data parallelism for multiple cores Narrow SIMD units simplify the problem: pixels work fine rather than data-parallel pixel clusters Does AVX change this? High clock rates and caches make serial execution efficient So in addition to the simple map (which boils down to a for loop on the CPU) we can do complex task graphs O p O p O p O p O p O p 42 HiPEAC January, 2012 Public

43 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 43 HiPEAC January, 2012 Public

44 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Discrete GPU configurations suffer from communication latency. Nested data parallel/braided parallel code benefits from close coupling. Discrete GPUs don t provide it well. But each individual core isn t great at certain types of algorithm Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 44 HiPEAC January, 2012 Public

45 CUE APU Tight integration of narrow and wide vector kernels Combination of high and low degrees of threading Fast turnaround Negligible kernel launch time Communication between kernels Shared buffers Data CPU kernel GPU kernel For example: Generating a tree structure on the CPU cores, processing the scene on the GPU cores Mixed scale particle simulations (see a later talk) 45 HiPEAC January, 2012 Public

46 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 46 HiPEAC January, 2012 Public

47 HOW DO WE USE THESE DEVICES? Heterogeneous programming isn t easy Particularly if you want performance To date: CPUs with visible vector ISAs GPUs mostly lane-wise (implicit vector) ISAs Clunky separate programming models with explicit data movement How can we target both? With a fair degree of efficiency True shared memory with passable pointers Let s talk about the future of programming models 47 HiPEAC January, 2012 Public

48 PROGRAMMING MODELS IN THE PRESENT 48 HiPEAC January, 2012 Public

49 INDUSTRY STANDARD API STRATEGY OpenCL Open development platform for multi-vendor heterogeneous architectures The power of AMD Fusion: Leverages CPUs and GPUs for balanced system approach Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc. Momentum: Enthusiasm from mainstream developers and application software partners Microsoft s C++AMP Microsoft distribution Part of Visual Studio 11 Based on C++11 C++AMP 49 HiPEAC January, 2012 Public

50 Moving Past Proprietary Solutions for Ease of Cross- Platform Programming High Level Language Compilers Open and Custom Tools High Level Tools Application Specific Libraries C++AMP Industry Standard Interfaces OpenCL OpenGL AMD GPUs OpenCL - AMD CPUs Other CPUs/GPUs Cross-platform development Interoperability with OpenGL and DX CPU/GPU backends enable balanced platform approach 50 HiPEAC January, 2012 Public

51 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. 51 HiPEAC January, 2012 Public

52 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. WI WI WG WI WI 52 HiPEAC January, 2012 Public

53 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. WI WI WG WI WI Modern CPUs & GPUs can support more! 53 HiPEAC January, 2012 Public

54 MODERN GPU (& CPU) CAPABILITIES Modern GPUs can execute a different instruction stream per core Some even have a few HW threads per core (each runs separated streams) This is still a highly parallelized machine! HW thread executes N-wide vector instructions (8-64 wide) Scheduler switches HW threads on the fly to hide memory misses 54 HiPEAC January, 2012 Public

55 MODERN GPU (& CPU) CAPABILITIES Modern GPUs can execute a different instruction stream per core Some even have a few HW threads per core (each runs separated streams) This is still a highly parallelized machine! HW thread executes N-wide vector instructions (8-64 wide) Scheduler switches HW threads on the fly to hide memory misses Shader Core Shader Core Shader Core Shader Core Input Assembler Geometry Assembler Rasterizer Output blending Input Assembly Vertex Shader Time Geometry Assembly Pixel Shader Pixel Shader Pixel Shader Pixel Shader Rasterization Blend Blend Blend Blend 55 HiPEAC January, 2012 Public

56 PERSISTENT THREADS People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching The more cores we add the worse this is Task A Task B Task C Queue A C A GPU Core Ben will go into a bit more depth on this topic later 56 HiPEAC January, 2012 Public

57 PERSISTENT THREADS Queue People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching The more cores we add the worse this is Task A Task B Task C A A C GPU Core Ben will go into a bit more depth on this topic later 57 HiPEAC January, 2012 Public

58 PERSISTENT THREADS Queue People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Task A A Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching Task B Task C C GPU Core The more cores we add the worse this is Ben will go into a bit more depth on this topic later 58 HiPEAC January, 2012 Public

59 CUDA, OPENCL AND C++AMP Heterogeneous programming with GPUs has become relatively widespread Relatively easy access to the system through: CUDA OpenCL C++AMP: access to DirectX 11 compute shaders We have to be able to deal with the complexities of the programming model 59 HiPEAC January, 2012 Public

60 THE WORLD IS MOVING FORWARD These models are not static Obviously there are serious downsides to that OpenCL 1.2 was recently released, 2.0 is in development and making good progress NVIDIA has been adding features to CUDA with some persistence 60 HiPEAC January, 2012 Public

61 C++AMP Microsoft has decided on a single source model called C++AMP Sits on top of DirectX Highly limited as a result of this (some way behind CUDA and OpenCL currently) However, from an ease of access point of view there are clear benefits For example: void ampsquareexample(const std::vector<int> &in, std::vector<int> &out) { concurrency::array_view<const int> avin(in.size(), in); concurrency::array_view<int> avout(out.size(), out); concurrency::parallel_for_each(avout.extent, [=](concurrency::index<1> idx) restrict(amp) { avout[idx] = avin[idx] * avin[idx]; }); avout.synchronize(); } 61 HiPEAC January, 2012 Public

62 OPENACC Looking at it from the OpenMP pragma directive angle we see OpenACC Initially developed by PGI, NVIDIA, Cray and CAPS Clear benefits for developers starting with C or fortran source Though it s a shame to not be integrated cleanly with the language and type system The aim is to merge this technology with OpenMP and, quoting from the OpenACC web site: The intent is that the lessons learned from use an implementations of the OpenACC API will lead to a more complete and robust OpenMP heterogeneous computing standard #pragma acc parallel 62 HiPEAC January, 2012 Public

63 HOW DO WE MOVE FORWARD? We should concentrate on ways to abstract the algorithms over the features that differ? A difficult challenge For now we seem to be stuck with inefficient raw-data-parallelism or expertly coded SIMD algorithms So, instead, how should we approach simplifying the model for the platform Assume we have the optimized kernels for the compute nodes 63 HiPEAC January, 2012 Public

64 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc HiPEAC January, 2012 Public

65 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU OpenCL, DirectCompute Driver-based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch CUDA, Brook+, etc HiPEAC January, 2012 Public

66 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Architected Era Heterogeneous System Architecture GPU Peer Processor Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc OpenCL, DirectCompute Driver-based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching HiPEAC January, 2012 Public

67 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Integrate CPU & GPU in silicon Unified Memory Controller Common Manufacturing Technology 67 HiPEAC January, 2012 Public

68 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Integrate CPU & GPU in silicon GPU Compute C++ support Unified Memory Controller User mode schedulng Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU 68 HiPEAC January, 2012 Public

69 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Architectural Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU 69 HiPEAC January, 2012 Public

70 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Extend to Discrete GPU 70 HiPEAC January, 2012 Public

71 HETEROGENEOUS SYSTEM ARCHITECTURE AN OPEN PLATFORM Open Architecture, published specifications HSAIL virtual ISA HSA memory model HSA dispatch ISA agnostic for both CPU and GPU 71 HiPEAC January, 2012 Public

72 HSA INTERMEDIATE LAYER - HSAIL HSAIL is a virtual ISA for parallel programs Finalized to ISA by a JIT compiler or Finalizer Explicitly parallel Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Syscall methods GPU code can call directly to system services, IO, printf, etc Debugging support 72 HiPEAC January, 2012 Public

73 HSA MEMORY MODEL Designed to be compatible with C++11, Java and.net Memory Models Relaxed consistency memory model for parallel compute performance Loads and stores can be re-ordered by the finalizer Visibility controlled by: Load.Acquire*, Load.Dep, Store.Release* Barriers *sequential consistent ordering 73 HiPEAC January, 2012 Public

74 HSA ENABLES TASK QUEUING RUNTIMES Popular pattern for task- and data-parallel programming on SMP systems today Characterized by: A work queue per core Runtime library that divides large loops into tasks and distributes to queues A work stealing runtime that keeps the system balanced HSA is designed to extend this pattern to run on heterogeneous systems 74 HiPEAC January, 2012 Public

75 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. 75 HiPEAC January, 2012 Public

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

FUSION PROCESSORS AND HPC

FUSION PROCESSORS AND HPC FUSION PROCESSORS AND HPC Chuck Moore AMD Corporate Fellow & Technology Group CTO June 14, 2011 Fusion Processors and HPC Today: Multi-socket x86 CMPs + optional dgpu + high BW memory Fusion APUs (SPFP)

More information

Heterogeneous Computing

Heterogeneous Computing Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:

More information

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER Budirijanto Purnomo AMD Technical Lead, GPU Compute Tools PRESENTATION OVERVIEW Motivation AMD APP Profiler

More information

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such

More information

SIMULATOR AMD RESEARCH JUNE 14, 2015

SIMULATOR AMD RESEARCH JUNE 14, 2015 AMD'S gem5apu SIMULATOR AMD RESEARCH JUNE 14, 2015 OVERVIEW Introducing AMD s gem5 APU Simulator Extends gem5 with a GPU timing model Supports Heterogeneous System Architecture in SE mode Includes several

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016 AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

viewdle! - machine vision experts

viewdle! - machine vision experts viewdle! - machine vision experts topic using algorithmic metadata creation and heterogeneous computing to build the personal content management system of the future Page 2 Page 3 video of basic recognition

More information

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

Panel Discussion: The Future of I/O From a CPU Architecture Perspective Panel Discussion: The Future of I/O From a CPU Architecture Perspective Brad Benton AMD, Inc. #OFADevWorkshop Issues Move to Exascale involves more parallel processing across more processing elements GPUs,

More information

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session 2117 Olivier Zegdoun AMD Sr. Software Engineer CONTENTS Rendering Effects Before Fusion: single discrete GPU case Before Fusion: multiple discrete

More information

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011 MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to

More information

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING 1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING TASK-PARALLELISM OpenCL, CUDA, OpenMP (traditionally) and the like are largely data-parallel models Their core unit of parallelism

More information

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level CLICK TO EDIT MASTER TITLE STYLE Second level THE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU PAUL BLINZER, FELLOW, HSA SYSTEM SOFTWARE, AMD SYSTEM ARCHITECTURE WORKGROUP CHAIR, HSA FOUNDATION

More information

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D AMD GRAPHIC CORE NEXT Low Power High Performance Graphics & Parallel Compute Michael Mantor AMD Senior Fellow Architect Michael.mantor@amd.com Mike Houston AMD

More information

Graphics Hardware 2008

Graphics Hardware 2008 AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different? Many Cores Disruptive Change

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015 The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM Dong Ping Zhang Heterogeneous System Architecture AMD VASCULATURE ENHANCEMENT 3 Biomedical data analysis on heterogeneous platform June, 2012 EXAMPLE:

More information

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS James Ross High Performance Technologies, Inc (HPTi) Computational Scientist Edward Carmack David Richie Song Park, Brian Henz and Dale Shires HPTi

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

Generic System Calls for GPUs

Generic System Calls for GPUs Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices

More information

x Welcome to the jungle. The free lunch is so over

x Welcome to the jungle. The free lunch is so over Herb Sutter 1975-2005 Put a computer on every desk, in every home, in every pocket. The free lunch is so over 2005-2011 Put a parallel supercomputer on every desk, in every home, in every pocket. Welcome

More information

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Multi-core processors are here, but how do you resolve data bottlenecks in native code? Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

AMD s Unified CPU & GPU Processor Concept

AMD s Unified CPU & GPU Processor Concept Advanced Seminar Computer Engineering Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview 1 2 Current Platforms: 3 4 5 Architecture 6 2/37 Single-thread Performance

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

ASYNCHRONOUS SHADERS WHITE PAPER 0

ASYNCHRONOUS SHADERS WHITE PAPER 0 ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Sequential Consistency for Heterogeneous-Race-Free

Sequential Consistency for Heterogeneous-Race-Free Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD JUNE 12, 2013 EXECUTIVE

More information

Fusion Enabled Image Processing

Fusion Enabled Image Processing Fusion Enabled Image Processing I Jui (Ray) Sung, Mattieu Delahaye, Isaac Gelado, Curtis Davis MCW Strengths Complete Tools Port, Explore, Analyze, Tune Training World class R&D team Leading Algorithms

More information

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC. MODERN SYSTEMS ARE POWERED BY HETEROGENEITY

More information

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT Bootstrapping the industry for better VR experience Complimentary to HMD SDKs It s all about giving developers the tools they want! AMD LIQUIDVR

More information

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration Accelerating Applications the art of maximum performance computing James Spooner Maxeler VP of Acceleration Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How

More information

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO Gestural and Cinematic Interfaces - DX11 David Brebner Unlimited Realities CTO Gestural and Cinematic Interfaces DX11 Making an emotional connection with users 3 Unlimited Realities / Fingertapps About

More information

Vulkan (including Vulkan Fast Paths)

Vulkan (including Vulkan Fast Paths) Vulkan (including Vulkan Fast Paths) Łukasz Migas Software Development Engineer WS Graphics Let s talk about OpenGL (a bit) History 1.0-1992 1.3-2001 multitexturing 1.5-2003 vertex buffer object 2.0-2004

More information

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management

More information

ROCm: An open platform for GPU computing exploration

ROCm: An open platform for GPU computing exploration UCX-ROCm: ROCm Integration into UCX {Khaled Hamidouche, Brad Benton}@AMD Research ROCm: An open platform for GPU computing exploration 1 JUNE, 2018 ISC ROCm Software Platform An Open Source foundation

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008 Run Anywhere The Hardware Platform Perspective Ben Pollan, AMD Java Labs October 28, 2008 Agenda Java Labs Introduction Community Collaboration Performance Optimization Recommendations Leveraging the Latest

More information

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD AMD APU and Processor Comparisons AMD Client Desktop Feb 2013 AMD SUMMARY 3DMark released Feb 4, 2013 Contains DirectX 9, DirectX 10, and DirectX 11 tests AMD s current product stack features DirectX 11

More information

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling AMD Radeon Open Compute Platform Felix Kuehling ROCM PLATFORM ON LINUX Compiler Front End AMDGPU Driver Enabled with ROCm GCN Assembly Device LLVM Compiler (GCN) LLVM Opt Passes GCN Target Host LLVM Compiler

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

A LOOK INTO THE FUTURE. 1 HiPEAC January, 2012 Public

A LOOK INTO THE FUTURE. 1 HiPEAC January, 2012 Public A LOOK INTO THE FUTURE 1 HiPEAC January, 2012 Public THIS PART OF THE TUTORIAL GPGPU programming has been around for some time now Reasonable adoption, but Often quoted as being too hard, limited applications

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE VIGNESH ADHINARAYANAN, INDRANI PAUL, JOSEPH L. GREATHOUSE, WEI HUANG, ASHUTOSH PATTNAIK, WU-CHUN FENG POWER AND ENERGY ARE FIRST-CLASS

More information

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to the features, functionality, availability, timing,

More information

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager INTRODUCTION WHO WE ARE 3 Highly Parallel Computing in Physics-based

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD D3D12 & Vulkan: Lessons learned Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD D3D12 What s new? DXIL DXGI & UWP updates Root Signature 1.1 Shader cache GPU validation PIX D3D12 / DXIL DXBC

More information

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015 SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015 PROBLEM Shaders are compiled in draw calls Emulating certain features in shaders Drivers keep shaders in some intermediate representation And insert

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

HyperTransport Technology

HyperTransport Technology HyperTransport Technology in 2009 and Beyond Mike Uhler VP, Accelerated Computing, AMD President, HyperTransport Consortium February 11, 2009 Agenda AMD Roadmap Update Torrenza, Fusion, Stream Computing

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

THE HETEROGENEOUS SYSTEM ARCHITECTURE IT S BEYOND THE GPU

THE HETEROGENEOUS SYSTEM ARCHITECTURE IT S BEYOND THE GPU THE HETEROGENEOUS SYSTEM ARCHITECTURE IT S BEYOND THE GPU PAUL BLINZER AMD INC, FELLOW, SYSTEM SOFTWARE SYSTEM ARCHITECTURE WORKGROUP CHAIR HSA FOUNDATION THE HSA VISION MAKE HETEROGENEOUS PROGRAMMING

More information

RegMutex: Inter-Warp GPU Register Time-Sharing

RegMutex: Inter-Warp GPU Register Time-Sharing RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer

More information

Cilk Plus: Multicore extensions for C and C++

Cilk Plus: Multicore extensions for C and C++ Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1 What s In This Talk? Overview of parallel programming

More information

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer STREAMING VIDEO DATA INTO 3D APPLICATIONS Session 2116 Christopher Mayer AMD Sr. Software Engineer CONTENT Introduction Pinned Memory Streaming Video Data How does the APU change the game 3 Streaming Video

More information

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling

More information

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer SESSION AGENDA Quick Keywords Abstract and Scope Introduction Current User Interface [ UI

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015 KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015 AGENDA Background & Motivation Challenges Native Page Tables Emulating the OS Kernel 2 KVM CPU MODEL IN SYSCALL EMULATION

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION 1 3DMark Overview February 2013 Approved for public distribution 2 3DMark Overview February 2013 Approved for public distribution

More information

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide AMD HD3D Technology Setup Guide 1 AMD HD3D TECHNOLOGY: Setup Guide Contents AMD HD3D Technology... 3 Frame Sequential Displays... 4 Supported 3D Display Hardware... 5 AMD Display Drivers... 5 Configuration

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011 AMD IOMMU VERSION 2 How KVM will use it Jörg Rödel August 16th, 2011 AMD IOMMU VERSION 2 WHAT S NEW? 2 AMD IOMMU Version 2 Support in KVM August 16th, 2011 Public NEW FEATURES - OVERVIEW Two-level page

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018 CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Anatomy of AMD s TeraScale Graphics Engine

Anatomy of AMD s TeraScale Graphics Engine Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

SIGGRAPH Briefing August 2014

SIGGRAPH Briefing August 2014 Copyright Khronos Group 2014 - Page 1 SIGGRAPH Briefing August 2014 Neil Trevett VP Mobile Ecosystem, NVIDIA President, Khronos Copyright Khronos Group 2014 - Page 2 Significant Khronos API Ecosystem Advances

More information

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO Desktop Telepresence Arrived! Sudha Valluru ViVu CEO 3 Desktop Telepresence Arrived! Video Collaboration market Telepresence Telepresence Cost Expensive Expensive HW HW Legacy Apps Interactivity ViVu CONFIDENTIAL

More information