HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public
|
|
- Michael Mills
- 5 years ago
- Views:
Transcription
1 HETEROGENEOUS COMPUTING IN THE MODERN AGE 1 HiPEAC January, 2012 Public
2 Single-thread Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity? we are here Time 2 HiPEAC January, 2012 Public
3 Single-thread Performance Throughput Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability? we are here we are here Time Time (# of processors) 3 HiPEAC January, 2012 Public
4 Single-thread Performance Throughput Performance Modern Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 4 HiPEAC January, 2012 Public
5 Single-thread Performance Throughput Performance Modern Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Assembly C/C++ Java pthreads OpenMP / TBB Shader CUDA OpenCL!!!? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 5 HiPEAC January, 2012 Public
6 WHAT DOES THIS MEAN FROM AMD S POINT OF VIEW? Everyone will have seen AMD s efforts in developing the APU A core feature of the current consumer product line Cost and power reduction for mainstream users 6 HiPEAC January, 2012 Public
7 APU: ACCELERATED PROCESSING UNIT The APU has arrived and it is a great advance over (AMD s?) previous platforms Combines limited parallelism on the CPU with throughput computing on the GPU and high bandwidth access to memory How do we make it even better? Easier to program Easier to optimize Easier to load balance Higher performance Lower power 7 HiPEAC January, 2012 Public
8 WHAT ABOUT OTHER COMPANIES? Obviously AMD isn t alone in this Embedded systems are already highly integrated and heterogeneous as any ARM representative will gladly remind us HPC systems have been mixing node types a while now Roadrunner with its Cell-based accelerators Something like 3 of the latest top 5 supercomputer list are GPU-accelerated But what is the utilization like with such systems? Are all the components being used anywhere close to headline performance rates 8 HiPEAC January, 2012 Public
9 HETEROGENEITY IN THE APU 9 HiPEAC January, 2012 Public
10 TAKING A REALISTIC LOOK AT THE APU GPUs are not magic We ve often heard about 100x performance improvements These are usually the result of poor CPU code The APU offers a balance GPU cores optimized for arithmetic workloads and latency hiding CPU cores to deal with the branchy code for which branch prediction and out of order execution are so valuable A tight integration such that shorter, more tightly bound, sections of code targeted at the different styles of device can be linked together However, not without costs! 10 HiPEAC January, 2012 Public
11 CPUS AND GPUS Different design goals: CPU design is based on maximizing performance of a single thread GPU design aims to maximize throughput at the cost of lower performance for each thread CPU use of area: Transistors are dedicated to branch prediction, out of order logic and caching to reduce latency to memory, to allow efficient instruction prefetching and deep pipelines (fast clocks) GPU use of area: Transistors concentrated on ALUs and registers Registers store thread state and allow fast switching between threads to cover (rather than reduce) latency 11 HiPEAC January, 2012 Public
12 HIDING OR REDUCING LATENCY Any instruction issued might stall waiting on memory or at least in the computation pipeline SIMD Operation Stall 12 HiPEAC January, 2012 Public
13 HIDING OR REDUCING LATENCY Any instruction issued might stall waiting on memory or at least in the computation pipeline Out of order logic attempts to issue as many instructions as possible as tightly as possible, filling small stalls with other instructions from the same thread SIMD Operation Stall 13 HiPEAC January, 2012 Public
14 REDUCING MEMORY LATENCY ON THE CPU Larger memory stalls are harder to cover Memory can be some way from the ALU Many cycles to access Memory Lanes 0-3 Stall Instruction 0 Instruction 1 14 HiPEAC January, 2012 Public
15 REDUCING MEMORY LATENCY ON THE CPU Larger memory stalls are harder to cover Memory can be some way from the ALU Many cycles to access Lanes 0-3 Instruction 0 CPU solution is to introduce an intermediate cache memory Memory Cache Stall Instruction 1 Minimizes the latency of most accesses Reduces stalls 15 HiPEAC January, 2012 Public
16 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes HiPEAC January, 2012 Public
17 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 1 instruction 1 17 HiPEAC January, 2012 Public
18 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 2 instruction 0 Multiple threads fill gaps in instruction stream Wave 1 instruction 1 18 HiPEAC January, 2012 Public
19 HIDING LATENCY ON THE GPU AMD s GPU designs hide latency in two main ways: Issue an instruction over multiple cycles Interleave instructions from multiple threads AMD Radeon HD wide wavefronts interleaved on 16-wide vector unit Wave Lanes 0-15 SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave Lanes SIMD Lanes 0-15 Wave 1 instruction 0 Multi-cycle a large vector on a smaller vector unit Reduces instruction decode overhead Improves throughput Wave 2 instruction 0 Multiple threads fill gaps in instruction stream Wave 1 instruction 1 Single thread latency actually INCREASES 19 HiPEAC January, 2012 Public
20 GPU CACHES GPUs also have caches The goal is generally to improve spatial locality rather than temporal Wave 1 instruction 0 Memory Cache Different parts of a wide SIMD thread and different threads may require similar locations and share through the cache 20 HiPEAC January, 2012 Public
21 GPU CACHES GPUs also have caches The goal is generally to improve spatial locality rather than temporal Wave 1 instruction 0 Memory Cache Wave 2 instruction 0 Different parts of a wide SIMD thread and different threads may require similar locations and share through the cache 21 HiPEAC January, 2012 Public
22 COSTS The CPU approach: Requires large caches to maximize the number of memory operations caught Requires considerable transistor logic to support the out of order control The GPU approach: Requires wide hardware vectors, not all code is easily vectorized Requires considerable state storage to support active threads Note: we need not pretend that OpenCL or CUDA are NOT vectorization The entire point of the design is hand vectorization These two approaches suit different algorithm designs They require different hardware implementations 22 HiPEAC January, 2012 Public
23 THE TRADEOFFS IN PICTURES Instruction decode AMD Phenom TM II X6: 6 cores, 4-way SIMD (ALUs) A single set of registers Register state ALUs Registers are backed out to memory on a thread switch, and this is performed by the OS AMD Radeon TM HD6970: 24 cores, 16-way SIMD (plus VLIW issue, but the Phenom processor does multi-issue too), 64-wide SIMD state Multiple register sets (somewhat dynamic) 8, 16 threads per core 23 HiPEAC January, 2012 Public
24 COMBINING THE TWO So what did we see? Diagrams were vague, but Large amount of orange! Lots of register state Also more green on the GPU cores The APU combines the two styles of core: The E350 has two Bobcat cores and two Cedar -like cores, for example 2- and 8-wide physical SIMD 24 HiPEAC January, 2012 Public
25 25 HiPEAC January, 2012 Public THINKING ABOUT PROGRAMMING
26 OPENCL, CUDA, ARE THESE GOOD MODELS? Designed for wide data-parallel computation Pretty low level There is a lack of good scheduling and coherence control We see cheating all the time: the lane-wise programming model only becomes efficient when we program it like the vector model it really is, making assumptions about wave or warp synchronicity However: they re better than SSE! We have proper scatter/gather memory access The lane wise programming does help: we still have to think about vectors, but it s much easier to do than in a vector language We can even program lazily and pretend that a single work item is a thread and yet it still (sortof) works 26 HiPEAC January, 2012 Public
27 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? 27 HiPEAC January, 2012 Public
28 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array 28 HiPEAC January, 2012 Public
29 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) 29 HiPEAC January, 2012 Public
30 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block float sum( 0 ) for( i = n to n + b ) sum += input[i] 30 HiPEAC January, 2012 Public
31 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block Reduce across threads float sum( 0 ) for( i = n to n + b ) sum += input[i] float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 31 HiPEAC January, 2012 Public
32 THE CPU PROGRAMMATICALLY: A TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block Reduce across threads Vectorize float4 sum( 0, 0, 0, 0 ) for( i = n/4 to (n + b)/4 ) sum += input[i] float scalarsum = sum.x + sum.y + sum.z + sum.w float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 32 HiPEAC January, 2012 Public
33 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? 33 HiPEAC January, 2012 Public
34 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array 34 HiPEAC January, 2012 Public
35 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) 35 HiPEAC January, 2012 Public
36 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block float sum( 0 ) for( i = n to n + b ) sum += input[i] 36 HiPEAC January, 2012 Public
37 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads float sum( 0 ) for( i = n to n + b ) sum += input[i] float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 37 HiPEAC January, 2012 Public
38 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads Vectorize (this bit may be a different kernel dispatch given current models) float64 sum( 0,, 0 ) for( i = n/64 to (n + b)/64 ) sum += input[i] float scalarsum = wavereduce(sum) float reductionvalue( 0 ) for( t in threadcount ) reductionvalue += t.sum 38 HiPEAC January, 2012 Public
39 THE GPU PROGRAMMATICALLY: THE SAME TRIVIAL EXAMPLE What s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core, usually, up to 24 cores) Iterate to produce a sum in each block Reduce across threads Vectorize (this bit may be a different kernel dispatch given current models) float64 sum( 0,, 0 ) Current models for( ease i = programming n/64 to (n + b)/64 by viewing ) the vector as a set of scalars ALUs, apparently sum though += not input[i] really independent, with varying degree of hardware assistance float scalarsum (and hence = overhead): wavereduce(sum) float sum( 0 ) for( i = n/64 to (n + b)/64; float i += reductionvalue( 64) 0 ) sum += input[i] for( t in threadcount ) float scalarsum = wavereducevialocalmemory(sum) reductionvalue += t.sum 39 HiPEAC January, 2012 Public
40 THEY DON T SEEM SO DIFFERENT! More blocks of data More cores More threads Wider threads 64 on high end AMD GPUs 4/8 on current CPUs Hard to develop efficiently for wide threads Lots of state, makes context switching and stacks problematic float4 sum( 0, 0, 0, 0 ) for( i = n/4 to (n + b)/4 ) sum += input[i] float scalarsum = sum.x + sum.y + sum.z + sum.w float64 sum( 0,, 0 ) for( i = n/64 to (n + b)/64 ) sum += input[i] float scalarsum = wavereduce(sum) 40 HiPEAC January, 2012 Public
41 THAT WAS TRIVIAL MORE GENERALLY, WHAT WORKS WELL? On GPU cores: We need a lot of data parallelism Algorithms that can be mapped to multiple cores and multiple threads per core Approaches that map efficiently to wide SIMD units So a nice simple functional map operation is great! Array O p O p O p Array.map(Op) This is basically the OpenCL tm model 41 HiPEAC January, 2012 Public
42 THAT WAS TRIVIAL MORE GENERALLY, WHAT WORKS WELL? On CPU cores: Some data parallelism for multiple cores Narrow SIMD units simplify the problem: pixels work fine rather than data-parallel pixel clusters Does AVX change this? High clock rates and caches make serial execution efficient So in addition to the simple map (which boils down to a for loop on the CPU) we can do complex task graphs O p O p O p O p O p O p 42 HiPEAC January, 2012 Public
43 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 43 HiPEAC January, 2012 Public
44 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Discrete GPU configurations suffer from communication latency. Nested data parallel/braided parallel code benefits from close coupling. Discrete GPUs don t provide it well. But each individual core isn t great at certain types of algorithm Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 44 HiPEAC January, 2012 Public
45 CUE APU Tight integration of narrow and wide vector kernels Combination of high and low degrees of threading Fast turnaround Negligible kernel launch time Communication between kernels Shared buffers Data CPU kernel GPU kernel For example: Generating a tree structure on the CPU cores, processing the scene on the GPU cores Mixed scale particle simulations (see a later talk) 45 HiPEAC January, 2012 Public
46 SO TO SUMMARIZE THAT Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 46 HiPEAC January, 2012 Public
47 HOW DO WE USE THESE DEVICES? Heterogeneous programming isn t easy Particularly if you want performance To date: CPUs with visible vector ISAs GPUs mostly lane-wise (implicit vector) ISAs Clunky separate programming models with explicit data movement How can we target both? With a fair degree of efficiency True shared memory with passable pointers Let s talk about the future of programming models 47 HiPEAC January, 2012 Public
48 PROGRAMMING MODELS IN THE PRESENT 48 HiPEAC January, 2012 Public
49 INDUSTRY STANDARD API STRATEGY OpenCL Open development platform for multi-vendor heterogeneous architectures The power of AMD Fusion: Leverages CPUs and GPUs for balanced system approach Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc. Momentum: Enthusiasm from mainstream developers and application software partners Microsoft s C++AMP Microsoft distribution Part of Visual Studio 11 Based on C++11 C++AMP 49 HiPEAC January, 2012 Public
50 Moving Past Proprietary Solutions for Ease of Cross- Platform Programming High Level Language Compilers Open and Custom Tools High Level Tools Application Specific Libraries C++AMP Industry Standard Interfaces OpenCL OpenGL AMD GPUs OpenCL - AMD CPUs Other CPUs/GPUs Cross-platform development Interoperability with OpenGL and DX CPU/GPU backends enable balanced platform approach 50 HiPEAC January, 2012 Public
51 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. 51 HiPEAC January, 2012 Public
52 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. WI WI WG WI WI 52 HiPEAC January, 2012 Public
53 TODAY S EXECUTION MODEL Single program multiple data (SPMD) Same kernel runs on: All compute units All processing elements Purely data parallel mode Device model: Device runs a single kernel simultaneously Separation between compute units is relevant for memory model only. WI WI WG WI WI Modern CPUs & GPUs can support more! 53 HiPEAC January, 2012 Public
54 MODERN GPU (& CPU) CAPABILITIES Modern GPUs can execute a different instruction stream per core Some even have a few HW threads per core (each runs separated streams) This is still a highly parallelized machine! HW thread executes N-wide vector instructions (8-64 wide) Scheduler switches HW threads on the fly to hide memory misses 54 HiPEAC January, 2012 Public
55 MODERN GPU (& CPU) CAPABILITIES Modern GPUs can execute a different instruction stream per core Some even have a few HW threads per core (each runs separated streams) This is still a highly parallelized machine! HW thread executes N-wide vector instructions (8-64 wide) Scheduler switches HW threads on the fly to hide memory misses Shader Core Shader Core Shader Core Shader Core Input Assembler Geometry Assembler Rasterizer Output blending Input Assembly Vertex Shader Time Geometry Assembly Pixel Shader Pixel Shader Pixel Shader Pixel Shader Rasterization Blend Blend Blend Blend 55 HiPEAC January, 2012 Public
56 PERSISTENT THREADS People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching The more cores we add the worse this is Task A Task B Task C Queue A C A GPU Core Ben will go into a bit more depth on this topic later 56 HiPEAC January, 2012 Public
57 PERSISTENT THREADS Queue People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching The more cores we add the worse this is Task A Task B Task C A A C GPU Core Ben will go into a bit more depth on this topic later 57 HiPEAC January, 2012 Public
58 PERSISTENT THREADS Queue People emulate more flexible MPMD models using persistent threads Each core executes a scheduler loop Takes tasks off a queue Task A A Branches to particular code for that task This bypasses the hardware scheduler Gives flexibility without necessary significant overhead Reduces the ability of the hardware to flexibly schedule for power reduction in the absence of context switching Task B Task C C GPU Core The more cores we add the worse this is Ben will go into a bit more depth on this topic later 58 HiPEAC January, 2012 Public
59 CUDA, OPENCL AND C++AMP Heterogeneous programming with GPUs has become relatively widespread Relatively easy access to the system through: CUDA OpenCL C++AMP: access to DirectX 11 compute shaders We have to be able to deal with the complexities of the programming model 59 HiPEAC January, 2012 Public
60 THE WORLD IS MOVING FORWARD These models are not static Obviously there are serious downsides to that OpenCL 1.2 was recently released, 2.0 is in development and making good progress NVIDIA has been adding features to CUDA with some persistence 60 HiPEAC January, 2012 Public
61 C++AMP Microsoft has decided on a single source model called C++AMP Sits on top of DirectX Highly limited as a result of this (some way behind CUDA and OpenCL currently) However, from an ease of access point of view there are clear benefits For example: void ampsquareexample(const std::vector<int> &in, std::vector<int> &out) { concurrency::array_view<const int> avin(in.size(), in); concurrency::array_view<int> avout(out.size(), out); concurrency::parallel_for_each(avout.extent, [=](concurrency::index<1> idx) restrict(amp) { avout[idx] = avin[idx] * avin[idx]; }); avout.synchronize(); } 61 HiPEAC January, 2012 Public
62 OPENACC Looking at it from the OpenMP pragma directive angle we see OpenACC Initially developed by PGI, NVIDIA, Cray and CAPS Clear benefits for developers starting with C or fortran source Though it s a shame to not be integrated cleanly with the language and type system The aim is to merge this technology with OpenMP and, quoting from the OpenACC web site: The intent is that the lessons learned from use an implementations of the OpenACC API will lead to a more complete and robust OpenMP heterogeneous computing standard #pragma acc parallel 62 HiPEAC January, 2012 Public
63 HOW DO WE MOVE FORWARD? We should concentrate on ways to abstract the algorithms over the features that differ? A difficult challenge For now we seem to be stuck with inefficient raw-data-parallelism or expertly coded SIMD algorithms So, instead, how should we approach simplifying the model for the platform Assume we have the optimized kernels for the compute nodes 63 HiPEAC January, 2012 Public
64 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc HiPEAC January, 2012 Public
65 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU OpenCL, DirectCompute Driver-based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch CUDA, Brook+, etc HiPEAC January, 2012 Public
66 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Architected Era Heterogeneous System Architecture GPU Peer Processor Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc OpenCL, DirectCompute Driver-based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching HiPEAC January, 2012 Public
67 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Integrate CPU & GPU in silicon Unified Memory Controller Common Manufacturing Technology 67 HiPEAC January, 2012 Public
68 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Integrate CPU & GPU in silicon GPU Compute C++ support Unified Memory Controller User mode schedulng Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU 68 HiPEAC January, 2012 Public
69 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Architectural Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU 69 HiPEAC January, 2012 Public
70 HETEROGENEOUS SYSTEM ARCHITECTURE Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Extend to Discrete GPU 70 HiPEAC January, 2012 Public
71 HETEROGENEOUS SYSTEM ARCHITECTURE AN OPEN PLATFORM Open Architecture, published specifications HSAIL virtual ISA HSA memory model HSA dispatch ISA agnostic for both CPU and GPU 71 HiPEAC January, 2012 Public
72 HSA INTERMEDIATE LAYER - HSAIL HSAIL is a virtual ISA for parallel programs Finalized to ISA by a JIT compiler or Finalizer Explicitly parallel Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Syscall methods GPU code can call directly to system services, IO, printf, etc Debugging support 72 HiPEAC January, 2012 Public
73 HSA MEMORY MODEL Designed to be compatible with C++11, Java and.net Memory Models Relaxed consistency memory model for parallel compute performance Loads and stores can be re-ordered by the finalizer Visibility controlled by: Load.Acquire*, Load.Dep, Store.Release* Barriers *sequential consistent ordering 73 HiPEAC January, 2012 Public
74 HSA ENABLES TASK QUEUING RUNTIMES Popular pattern for task- and data-parallel programming on SMP systems today Characterized by: A work queue per core Runtime library that divides large loops into tasks and distributes to queues A work stealing runtime that keeps the system balanced HSA is designed to extend this pattern to run on heterogeneous systems 74 HiPEAC January, 2012 Public
75 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. 75 HiPEAC January, 2012 Public
THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD
THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationFUSION PROCESSORS AND HPC
FUSION PROCESSORS AND HPC Chuck Moore AMD Corporate Fellow & Technology Group CTO June 14, 2011 Fusion Processors and HPC Today: Multi-socket x86 CMPs + optional dgpu + high BW memory Fusion APUs (SPFP)
More informationHeterogeneous Computing
Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:
More informationOPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER
OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER Budirijanto Purnomo AMD Technical Lead, GPU Compute Tools PRESENTATION OVERVIEW Motivation AMD APP Profiler
More informationUse cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games
Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such
More informationSIMULATOR AMD RESEARCH JUNE 14, 2015
AMD'S gem5apu SIMULATOR AMD RESEARCH JUNE 14, 2015 OVERVIEW Introducing AMD s gem5 APU Simulator Extends gem5 with a GPU timing model Supports Heterogeneous System Architecture in SE mode Includes several
More informationHSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!
Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationAMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016
AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationviewdle! - machine vision experts
viewdle! - machine vision experts topic using algorithmic metadata creation and heterogeneous computing to build the personal content management system of the future Page 2 Page 3 video of basic recognition
More informationPanel Discussion: The Future of I/O From a CPU Architecture Perspective
Panel Discussion: The Future of I/O From a CPU Architecture Perspective Brad Benton AMD, Inc. #OFADevWorkshop Issues Move to Exascale involves more parallel processing across more processing elements GPUs,
More informationADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer
ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session 2117 Olivier Zegdoun AMD Sr. Software Engineer CONTENTS Rendering Effects Before Fusion: single discrete GPU case Before Fusion: multiple discrete
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More information1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING
1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING TASK-PARALLELISM OpenCL, CUDA, OpenMP (traditionally) and the like are largely data-parallel models Their core unit of parallelism
More informationCLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level
CLICK TO EDIT MASTER TITLE STYLE Second level THE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU PAUL BLINZER, FELLOW, HSA SYSTEM SOFTWARE, AMD SYSTEM ARCHITECTURE WORKGROUP CHAIR, HSA FOUNDATION
More informationHPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D
HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D AMD GRAPHIC CORE NEXT Low Power High Performance Graphics & Parallel Compute Michael Mantor AMD Senior Fellow Architect Michael.mantor@amd.com Mike Houston AMD
More informationGraphics Hardware 2008
AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different? Many Cores Disruptive Change
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationThe Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015
The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationBIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD
BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM Dong Ping Zhang Heterogeneous System Architecture AMD VASCULATURE ENHANCEMENT 3 Biomedical data analysis on heterogeneous platform June, 2012 EXAMPLE:
More informationEXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS
EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS James Ross High Performance Technologies, Inc (HPTi) Computational Scientist Edward Carmack David Richie Song Park, Brian Henz and Dale Shires HPTi
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationGeneric System Calls for GPUs
Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices
More informationx Welcome to the jungle. The free lunch is so over
Herb Sutter 1975-2005 Put a computer on every desk, in every home, in every pocket. The free lunch is so over 2005-2011 Put a parallel supercomputer on every desk, in every home, in every pocket. Welcome
More informationMulti-core processors are here, but how do you resolve data bottlenecks in native code?
Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationAMD s Unified CPU & GPU Processor Concept
Advanced Seminar Computer Engineering Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview 1 2 Current Platforms: 3 4 5 Architecture 6 2/37 Single-thread Performance
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationASYNCHRONOUS SHADERS WHITE PAPER 0
ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationSequential Consistency for Heterogeneous-Race-Free
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD JUNE 12, 2013 EXECUTIVE
More informationFusion Enabled Image Processing
Fusion Enabled Image Processing I Jui (Ray) Sung, Mattieu Delahaye, Isaac Gelado, Curtis Davis MCW Strengths Complete Tools Port, Explore, Analyze, Tune Training World class R&D team Leading Algorithms
More informationINTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS
INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC. MODERN SYSTEMS ARE POWERED BY HETEROGENEITY
More informationLIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT
LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT Bootstrapping the industry for better VR experience Complimentary to HMD SDKs It s all about giving developers the tools they want! AMD LIQUIDVR
More informationAccelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration
Accelerating Applications the art of maximum performance computing James Spooner Maxeler VP of Acceleration Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How
More informationGestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO
Gestural and Cinematic Interfaces - DX11 David Brebner Unlimited Realities CTO Gestural and Cinematic Interfaces DX11 Making an emotional connection with users 3 Unlimited Realities / Fingertapps About
More informationVulkan (including Vulkan Fast Paths)
Vulkan (including Vulkan Fast Paths) Łukasz Migas Software Development Engineer WS Graphics Let s talk about OpenGL (a bit) History 1.0-1992 1.3-2001 multitexturing 1.5-2003 vertex buffer object 2.0-2004
More informationThe mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management
Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management
More informationROCm: An open platform for GPU computing exploration
UCX-ROCm: ROCm Integration into UCX {Khaled Hamidouche, Brad Benton}@AMD Research ROCm: An open platform for GPU computing exploration 1 JUNE, 2018 ISC ROCm Software Platform An Open Source foundation
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationRun Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008
Run Anywhere The Hardware Platform Perspective Ben Pollan, AMD Java Labs October 28, 2008 Agenda Java Labs Introduction Community Collaboration Performance Optimization Recommendations Leveraging the Latest
More informationAMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD
AMD APU and Processor Comparisons AMD Client Desktop Feb 2013 AMD SUMMARY 3DMark released Feb 4, 2013 Contains DirectX 9, DirectX 10, and DirectX 11 tests AMD s current product stack features DirectX 11
More informationAMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling
AMD Radeon Open Compute Platform Felix Kuehling ROCM PLATFORM ON LINUX Compiler Front End AMDGPU Driver Enabled with ROCm GCN Assembly Device LLVM Compiler (GCN) LLVM Opt Passes GCN Target Host LLVM Compiler
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationA LOOK INTO THE FUTURE. 1 HiPEAC January, 2012 Public
A LOOK INTO THE FUTURE 1 HiPEAC January, 2012 Public THIS PART OF THE TUTORIAL GPGPU programming has been around for some time now Reasonable adoption, but Often quoted as being too hard, limited applications
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationMEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE
MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE VIGNESH ADHINARAYANAN, INDRANI PAUL, JOSEPH L. GREATHOUSE, WEI HUANG, ASHUTOSH PATTNAIK, WU-CHUN FENG POWER AND ENERGY ARE FIRST-CLASS
More informationCAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to the features, functionality, availability, timing,
More informationHIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager
HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager INTRODUCTION WHO WE ARE 3 Highly Parallel Computing in Physics-based
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationD3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD
D3D12 & Vulkan: Lessons learned Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD D3D12 What s new? DXIL DXGI & UWP updates Root Signature 1.1 Shader cache GPU validation PIX D3D12 / DXIL DXBC
More informationSOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015
SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015 PROBLEM Shaders are compiled in draw calls Emulating certain features in shaders Drivers keep shaders in some intermediate representation And insert
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationHyperTransport Technology
HyperTransport Technology in 2009 and Beyond Mike Uhler VP, Accelerated Computing, AMD President, HyperTransport Consortium February 11, 2009 Agenda AMD Roadmap Update Torrenza, Fusion, Stream Computing
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationTHE HETEROGENEOUS SYSTEM ARCHITECTURE IT S BEYOND THE GPU
THE HETEROGENEOUS SYSTEM ARCHITECTURE IT S BEYOND THE GPU PAUL BLINZER AMD INC, FELLOW, SYSTEM SOFTWARE SYSTEM ARCHITECTURE WORKGROUP CHAIR HSA FOUNDATION THE HSA VISION MAKE HETEROGENEOUS PROGRAMMING
More informationRegMutex: Inter-Warp GPU Register Time-Sharing
RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer
More informationCilk Plus: Multicore extensions for C and C++
Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduction to Parallel Programming For Real-Time Graphics (CPU + GPU)
Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1 What s In This Talk? Overview of parallel programming
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationSTREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer
STREAMING VIDEO DATA INTO 3D APPLICATIONS Session 2116 Christopher Mayer AMD Sr. Software Engineer CONTENT Introduction Pinned Memory Streaming Video Data How does the APU change the game 3 Streaming Video
More informationSCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL
SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling
More informationLecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationNEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer
NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer SESSION AGENDA Quick Keywords Abstract and Scope Introduction Current User Interface [ UI
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationKVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015
KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015 AGENDA Background & Motivation Challenges Native Page Tables Emulating the OS Kernel 2 KVM CPU MODEL IN SYSCALL EMULATION
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research
ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 7
General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationAMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution
AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION 1 3DMark Overview February 2013 Approved for public distribution 2 3DMark Overview February 2013 Approved for public distribution
More informationAMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide
AMD HD3D Technology Setup Guide 1 AMD HD3D TECHNOLOGY: Setup Guide Contents AMD HD3D Technology... 3 Frame Sequential Displays... 4 Supported 3D Display Hardware... 5 AMD Display Drivers... 5 Configuration
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationAMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011
AMD IOMMU VERSION 2 How KVM will use it Jörg Rödel August 16th, 2011 AMD IOMMU VERSION 2 WHAT S NEW? 2 AMD IOMMU Version 2 Support in KVM August 16th, 2011 Public NEW FEATURES - OVERVIEW Two-level page
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationCAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationAnatomy of AMD s TeraScale Graphics Engine
Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationSIGGRAPH Briefing August 2014
Copyright Khronos Group 2014 - Page 1 SIGGRAPH Briefing August 2014 Neil Trevett VP Mobile Ecosystem, NVIDIA President, Khronos Copyright Khronos Group 2014 - Page 2 Significant Khronos API Ecosystem Advances
More informationDesktop Telepresence Arrived! Sudha Valluru ViVu CEO
Desktop Telepresence Arrived! Sudha Valluru ViVu CEO 3 Desktop Telepresence Arrived! Video Collaboration market Telepresence Telepresence Cost Expensive Expensive HW HW Legacy Apps Interactivity ViVu CONFIDENTIAL
More information