A LOOK INTO THE FUTURE. 1 HiPEAC January, 2012 Public

Size: px
Start display at page:

Download "A LOOK INTO THE FUTURE. 1 HiPEAC January, 2012 Public"

Transcription

1 A LOOK INTO THE FUTURE 1 HiPEAC January, 2012 Public

2 THIS PART OF THE TUTORIAL GPGPU programming has been around for some time now Reasonable adoption, but Often quoted as being too hard, limited applications fit model, and so on Lack of composibilty function composition is the true blessing of sequential programming So New and future hardware will address limitations from a silicon point of view Programming models emerging to address restrictions to address the software point of view 2 HiPEAC January, 2012 Public

3 GPGPU PROGRAMMING UNTIL NOW 3 HiPEAC January, 2012 Public

4 TODAY S GPGPU PROGRAMMING EXECUTION MODEL On GPU cores: We need a lot of data parallelism Algorithms that can be mapped to multiple cores and multiple threads per core Approaches that map efficiently to wide SIMD units So a nice simple functional map operation is great! Array OP OP OP OP Array.map(Op) This is basically the OpenCL TM 1.x model 4 HiPEAC January, 2012 Public

5 CPUS FAIR A BIT BETTER On CPU cores: Some data-parallelism for multiple cores Narrow SIMD units simplify the problem: pixels work fine rather than data-parallel pixel clusters Does AVX change this? High clock rates and caches make serial execution efficient So in addition to the simple map (which boils down to a for loop on the CPU), we can do complex task graphs OP OP OP OP OP OP 5 HiPEAC January, 2012 Public

6 HOW DO THE CPU AND GPU MEET TODAY? CPU programs can call CPU programs: vecadd(arg 0,, arg n-1 ); // C++11 CPU host programs can create data-parallel work on the GPU: vecadd<<numthreads, BlockSize>>(arg 0,, arg n-1 ); // Cuda setkernelarg(vecadd, 0, arg 0 ); // OpenCL 1.x setkernelarg(vecadd, n-1, arg n-1 ); clenqueuendrangekernel(vecadd, kernel, NumThreads, BlockSize, ); 6 HiPEAC January, 2012 Public

7 HOW DO THE CPU AND GPU MEET TODAY? GPU programs can call GPU functions: vecadd(arg 0,, arg n-1 ); // OpenCL C, Cuda, C++ AMP GPU device programs calling HOST code: Some research work, c.f. Stuart et al [2], in this area but limited GPU device programs creating data-parallel work on the GPU: Not currently possible (There is some limited application through persistent threads, which we will come to, but no general solution exists today.) 7 HiPEAC January, 2012 Public

8 BUT SO MANY APPLICATIONS Are there really that many interesting data-parallel applications? Answer is probably yes: but they are not necessarily embarrassingly data-parallel Nodes in application execution graph often contain data-parallel work but serialization (i.e. communication) between nodes Node in application execution graph are data-dependent and must be generated dynamically 8 HiPEAC January, 2012 Public

9 SOME EXAMPLE APPLICATIONS INCLUDE Irregular data-parallelism in applications: Poster child: ray tracing Alternative rendering techniques: rasterization, REYES, etc. Some workloads in medical imaging Gesture recognition (think Kinect) Others? 9 HiPEAC January, 2012 Public

10 BRAIDED PARALLELISM* Data-parallelism first class Task-parallelism first class Flight Simulator User Interface Physics Simulation AI sound keyboard mouse Network Particle Systems Graphics server player 1 player 2 * To our knowledge this term was first used by Lefohn [4] 10 HiPEAC January, 2012 Public

11 THE IDEAL PROGRAMMING MODEL Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Discrete GPU configurations suffer from communication latency. Nested data parallel/braided parallel code benefits from close coupling. Discrete GPUs don t provide it well. But each individual core isn t great at certain types of algorithm Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 11 HiPEAC January, 2012 Public

12 FUNCTION COMPOSITION algorithm = f g The ability of decomposing a larger problem into sub-components The ability to replace one sub-component with another sub-component with the same external semantics 12 HiPEAC January, 2012 Public

13 ARE CURRENT GPGPU PROGRAMMING MODELS COMPOSIBLE? No! 13 HiPEAC January, 2012 Public

14 MEMORY ADDRESS SPACES ARE NOT COMPOSIBLE void foo(global int *) { } void bar(global int * x) { foo(x); // works fine } local int a[1]; a[0] = *x; foo(a); // will now not work 14 HiPEAC January, 2012 Public

15 BARRIER ELISION IS NOT COMPOSIBLE Parallel prefix sum (in this case taken from a RadixSort) for level = 0 to n foreach work item i in buffer range if( i > 2^level ) temp = buffer[i-2^(n-1)] + buffer[i]; barrier(); // barrier is not needed within a wavefront or warp // adds overhead so often dropped for optimization reasons buffer[i] = temp; 15 HiPEAC January, 2012 Public

16 HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) 16 HiPEAC January, 2012 Public

17 HSA FEATURE ROADMAP Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Extend to Discrete GPU 17 HiPEAC January, 2012 Public

18 TASK QUEUING RUNTIMES Popular pattern for task and data parallel programming on SMP systems today Characterized by: A work queue per core Runtime library that divides large loops into tasks and distributes to queues A work stealing runtime that keeps the system balanced HSA is designed to extend this pattern to run on heterogeneous systems 18 HiPEAC January, 2012 Public

19 TASK QUEUING RUNTIME ON CPUS Work Stealing Runtime Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker x86 CPU x86 CPU x86 CPU x86 CPU CPU Threads GPU Threads Memory 19 HiPEAC January, 2012 Public

20 TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker GPU Manager X86 CPU X86 CPU X86 CPU X86 CPU Radeon GPU CPU Threads GPU Threads Memory 20 HiPEAC January, 2012 Public

21 TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker GPU Manager Memory X86 CPU X86 CPU X86 CPU X86 CPU Fetch and Dispatch CPU Threads GPU Threads Memory S I M D S I M D S I M D S I M D S I M D 21 HiPEAC January, 2012 Public

22 HSA SOFTWARE EXAMPLE REDUCTION float foo(float); float myarray[ ]; Task<float, ReductionBin> sum([myarray](indexrange<1> index) [[device]] { float sum = 0.; for (size_t I = index.begin(); I!= index.end(); i++) { sum += foo(myarray[i]); } return float; }); float sum = task.enqueuewithreduce(partition<1, Auto>(1920), [] (int x, int y) restict(opp) { return x+y; }, 0.); 22 HiPEAC January, 2012 Public

23 HETEROGENEOUS COMPUTE DISPATCH How compute dispatch operates today in the driver model How compute dispatch improves tomorrow under HSA 23 HiPEAC January, 2012 Public

24 TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer DMA Buffer A GPU HARDWARE Hardware Queue 24 HiPEAC January, 2012 Public

25 TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer DMA Buffer Command Flow Data Flow Application B Direct3D User Mode Driver Soft Queue Kernel Mode Driver A GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Mode Driver Soft Queue Kernel Mode Driver Hardware Queue Command Buffer DMA Buffer 25 HiPEAC January, 2012 Public

26 TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer DMA Buffer Application B Command Flow Direct3D User Mode Driver Data Flow Soft Queue Kernel Mode Driver A C B A B GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Mode Driver Soft Queue Kernel Mode Driver Hardware Queue Command Buffer DMA Buffer 26 HiPEAC January, 2012 Public

27 TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer DMA Buffer Application B Command Flow Direct3D User Mode Driver Data Flow Soft Queue Kernel Mode Driver A C B A B GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Mode Driver Soft Queue Kernel Mode Driver Hardware Queue Command Buffer DMA Buffer 27 HiPEAC January, 2012 Public

28 FUTURE COMMAND AND DISPATCH FLOW Application C C C C C C Application codes to the hardware User mode queuing Application B Optional Dispatch Buffer Hardware Queue B B B GPU HARDWARE Hardware scheduling Low dispatch times No APIs Application A Hardware Queue A A A No Soft Queues No User Mode Drivers No Kernel Mode Transitions No Overhead! Hardware Queue 28 HiPEAC January, 2012 Public

29 FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 29 HiPEAC January, 2012 Public

30 FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 30 HiPEAC January, 2012 Public

31 FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 31 HiPEAC January, 2012 Public

32 FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 32 HiPEAC January, 2012 Public

33 HETEROGENEOUS SYSTEM ARCHITECTURE PROGRAMMING MODEL (HSAPM) 33 HiPEAC January, 2012 Public

34 HSAPM - FOUNDATIONS Leverage existing C++ expertise of developers Type Checking C++ abstraction features C++ compilation model Separate compilation units supported Small set of extensions to C++ Task parallel algorithms Data parallel algorithms 34 HiPEAC January, 2012 Public

35 PROGRAMMING MODEL CONCEPTS Kernels are in C++0x source language All features supported, including exceptions Kernels marked with a single attribute: restrict(hsa)* Function calls are to kernel code running on the same device Function calls to CPU are indirect (i.e. via OPP constructs) restrict(opencl) OpenCL C kernel (must follow OpenCL C rules) cpu - CPU target, always generated OpenCL C kernels fully supported but not required * Based on AMD s C++ AMP compiler technology 35 HiPEAC January, 2012 Public

36 PROGRAMMING MODEL CONCEPTS Smart pointers are used for shared virtual address space: Easily emulated template<class T, > class Pointer; Pointer<T> malloc<t>(size_t); free<pointer<t>); Example: Pointer<float> array = malloc<float>(10); 36 HiPEAC January, 2012 Public

37 PROGRAMMING MODEL CONCEPTS Range N-dimensional index space, where N is one, two or three. Examples: Range<1> range(100); // 1D range with extent 0..99, Range<2> range(4,8); // 2D range with extents x = 0..3, y = 0..7 Sub-index spaces are supported (i.e. groups) Example: Range<1,1> range(100, Range<1>(10)); // 1D range with extent and groups of HiPEAC January, 2012 Public

38 PROGRAMMING MODEL CONCEPTS Index Single point in the N-dimensional index space of Range. Examples: Index<1> index(10); // 11 th element in Range<1>(100), for example Index<2> index(4,1); // column x = 4 and row y = 1 Sub-index spaces are support (i.e. local index) Example: Index<1,1> index(2, Index<1>(3)); 38 HiPEAC January, 2012 Public

39 PROGRAMMING MODEL CONCEPTS DistArray Distributed array providing Memory hierarchy abstraction Memory coherency properties Can be typed Example: Addresses part 1 of function composition problem DistArray<float> dstarray; // no practical use 39 HiPEAC January, 2012 Public

40 PROGRAMMING MODEL CONCEPTS Regions Typed arrays allocated against a DistArray Example: DistArray<float> dstarray; dstarray.allocregion<float>(10); Memory access descriptors supported (not described in this deck) 40 HiPEAC January, 2012 Public

41 PROGRAMMING MODEL CONCEPTS Bound distributed arrays When bound at dispatch bound arrays represent PGAS address space Example: DistArray<float> dstarray; dstarray.allocregion<float>(10); parallelfor(range<1>(10), dstarray, [] (Index<1> index, BoundDistArray<float>) { }); 41 HiPEAC January, 2012 Public

42 PROGRAMMING MODEL CONCEPTS parallelfor Execute a kernel across a Range Optional DistArray allocation (bound to shared memory) Example with C++ kernel: Pointer<float> input = malloc<float>(1024*1024); Pointer<float> output = malloc<float>(1024*1024); // initialize input parallelfor(range<1>(100), [] (Index<1> index) restrict(hsa) { output[index.getx()] = input[index.getx()] * 2; // square input }); // do something with output 42 HiPEAC January, 2012 Public

43 PROGRAMMING MODEL CONCEPTS Example with OpenCL C kernel: void square( global float * input, global float * output) restrict(opencl) { output[get_global_id(0)] = input[get_global_id(0)]; } Pointer<float> input = Pointer<float>(1024*1024); Pointer<float> output = Pointer<float>(1024*1024); // initialize input parallelfor(range<1>(100), square, input, output); // do something with output 43 HiPEAC January, 2012 Public

44 PROGRAMMING MODEL CONCEPTS Task Execute a kernel across a range Optional DistArray allocation Hmm similar to parallelfor! Example (simple): Task<void> ([] (Index<1> index) restrict(hsa){ }); 44 HiPEAC January, 2012 Public Can be nested on device: Addresses part 2 of output[index.getx()] = input[index.getx()] * 2; // square input function composition Future<void> f; task.enqueue(range<1>(20), f); problem f = task.enqeue(range<1>(100)); // future is single assignment task.enqueue(range<1>(30), f);

45 DATA AFFINITY Need the ability to describe affinity of tasks Default is run anyway (CPU or GPU) But what if it depends on local resource, e.g. local? Allow user to specify locality of task(s) Example: DistArray distarray; Region<float> region = dstarray.allocregion<float>(10); parallelfor(range<1,1>(100, Range<1>(10)), distarray, [region] (Index<1,1> i, BoundDistArray<float> da) restrict(hsa) { task<void> ([] (Index<1> index, Region<float> r) { }); task.enqueue(range<1>(100), da(region)); }); 45 HiPEAC January, 2012 Public

46 DEVICE AFFINITY Need the ability to describe affinity of tasks with respect to a device Use affinity rather that queuing Example: task<void> ([] (Index<1> index) restrict(hsa) { output[index.getx()] = input[index.getx()] * 2; // square input }); Future<void> f; task.enqueue(range<1>(20), f, CPU0); f = task.enqeue(range<1>(100)); task.enqueue<device0>(range<1>(30), f); 46 HiPEAC January, 2012 Public

47 THERE IS A LOT MORE This is just a short introduction the concepts that we have been developing for AMD s future programming models. We are currently working on a paper that will (all things being well) be published later this year. The rest of this part of the tutorial will focus in detail on two interesting (and new) aspects of this system: Channels, a data-driven model for dynamic work generation Barrier objects, a solution to the remaining composibilty problem 47 HiPEAC January, 2012 Public

48 DYNAMIC TASK EXECUTION AND CHANNELS 48 HiPEAC January, 2012 Public

49 49 HiPEAC January, 2012 Public MOTIVATING EXAMPLE: RAY COMPACTION

50 PRIMARY RAYS - COHERENT Primary Rays 50 HiPEAC January, 2012 Public

51 SECONDARY RAYS NOT COHERENT Secondary Rays Primary Reflection Refraction Probes 51 HiPEAC January, 2012 Public

52 SHADOW RAYS NOT COHRENT Primary Shadow Ray In Shadow Area Light 52 HiPEAC January, 2012 Public

53 SIMD AND MEMORY DIVERGENCE Secondary and shadow rays, for example, introduce both: SIMD divergence (different kernels run for different type of ray) Memory coherence (different rays have random memory access patterns) Persistent thread model provides a solution to SIMD divergence Originally developed by the ray tracing group at Nvidia We look at an example of this approach and its draw backs Channels (data-flow model) provide a solution to both problems An alternative approach (addressing limitations with persistent thread models) Developed at AMD We look at underlying idea and example of this approach 53 HiPEAC January, 2012 Public

54 BUT OF COURSE All real analysis of Task Parallel Runtimes (TPRs) need to spend a large amount of time and effort on Fib! In fairness it demonstrates many of the important properties that a TPR must have, while of course fitting on single a slide! We will now continue in this vein! 54 HiPEAC January, 2012 Public

55 PARALLEL FIB Excellent example of how a recursive stack based algorithm might be implemented. OPP example fib() implementation. Compiling to OpenCL with persistent threads, using our Kite runtime, produces 3 tasks: TASK_RESULT TASK_ADD TASK_FIB Tasks and scheduler are compiled into an uber-scheduler that generate work into (workstealing) task queues. int parallelfib(int n) { if (n <= 2) { return 1; } opp::future<int> x = Task<int>( [n] () > int restrict(hsa) { return parallelfib(n - 1); }).enqueue(); return x.getdata() + parallelfib(n - 2); } 55 HiPEAC January, 2012 Public

56 OPENCL UBER-SCHEDULER FOR KITE FIB IMPLEMENTATION kernel void fib( global CircularWSDeque * wsdeque, global volatile Task * continuations, global volatile int * nextfreecontinuationslot, global int * results) { global CircularWSDeque * lwsdeque = wsdeque + get_global_id(0); Task t = popbottom(lwsdeque); while (!istaskempty(t)) { switch(t.id_) { case TASK_RESULT: results[get_global_id(0)] = t.params_.x_; break; case TASK_ADD: sendargument(t.continuation_, t.result_, t.params_.x_+ t.params_.y_); break; case TASK_FIB: if (t.params_.x_ < 2) { sendargument( lwsdeque,t.continuation_,t.result_,t.params_.x_); } else { int freecont = atomic_inc(nextfreecontinuationslot); global volatile Task * cond = &continuations[freecont]; *cond = makeadd(t.result_, t.continuation_); pushbottom(lwsdeque, makefib(t.params_.x_ - 1, ARG_OFFSET_FIRST, cond)); pushbottom(lwsdeque, makefib(t.params_.x_ - 2, ARG_OFFSET_SECOND, cond)); } break; } t = popbottom(lwsdeque); } } 56 HiPEAC January, 2012 Public

57 KITE COMPUTE UNIT PERSISTENT THREAD MODEL ACE CP... ACE CP CS Pipe CS Pipe CU 0 Private Private memory memory Uber scheduler -WF wi0... win... CU m Private Private memory memory Uber scheduler -WF wi0... win LDS backed queue LDS backed queue L2 backed queue L2 CACHE L2 backed queue Global memory CACHE x86 thread... CACHE x86 thread 57 HiPEAC January, 2012 Public

58 UBER-SCHEDULER Given N tasks (i.e. kernels) t 1,, t n it is compiled to: Limitation: As kernels are resource allocated statically: Uber-scheduler is restricted to the maximum resource allocation of its work t i! No resource amplification! kernel void fib( global CircularWSDeque * wsdeque, global volatile Task * continuations, global volatile int * nextfreecontinuationslot, global int * results) { global CircularWSDeque * lwsdeque = wsdeque + get_global_id(0); Task t = popbottom(lwsdeque); while (!istaskempty(t)) { switch(t.id_) { case t1_id: case tn_id: } } } 58 HiPEAC January, 2012 Public

59 59 HiPEAC January, 2012 Public GPGPU PROGRAMMING MOVING FORWARD

60 CHANNELS - PERSISTENT CONTROL PROCESSOR THREADING MODEL Add data-flow support to GPGPU We are not primarily notating this as producer/consumer kernel bodies That is that we are not promoting a method where one kernel loops producing values and another loops to consume them That has the negative behavior of promoting long-running kernels We ve tried to avoid this elsewhere by basing in-kernel launches around continuations rather than waiting on children Instead we assume that kernel entities produce/consume but consumer work-items are launched ondemand An alternative to the point to point data flow using of persistent threads, avoiding the uber-kernel 60 HiPEAC January, 2012 Public

61 OPERATIONAL FLOW Kernel Queue A Channel CP Uber-scheduler 61 HiPEAC January, 2012 Public

62 OPERATIONAL FLOW Kernel Queue Channel CP Uber-scheduler 62 HiPEAC January, 2012 Public

63 OPERATIONAL FLOW Kernel Work-items Queue A Channel CP Uber-scheduler Queue B 63 HiPEAC January, 2012 Public

64 OPERATIONAL FLOW Kernel Work-items write Queue A Channel CP Uber-scheduler Queue B 64 HiPEAC January, 2012 Public

65 OPERATIONAL FLOW Kernel Work-items write Queue A Channel CP Uber-scheduler 65 HiPEAC January, 2012 Public

66 OPERATIONAL FLOW Kernel Work-items Queue A Channel Trigger Dispatch CP Uber-scheduler 66 HiPEAC January, 2012 Public

67 OPERATIONAL FLOW Kernel Work-items Queue A Channel Work-items CP Uber-scheduler 67 HiPEAC January, 2012 Public

68 OPERATIONAL FLOW Kernel Work-items Queue A Channel read Work-items CP Uber-scheduler 68 HiPEAC January, 2012 Public

69 OPERATIONAL FLOW Kernel Work-items Queue A Channel read Work-items CP Uber-scheduler 69 HiPEAC January, 2012 Public

70 OPERATIONAL FLOW Kernel Work-items write Queue A Channel CP Uber-scheduler 70 HiPEAC January, 2012 Public

71 OPERATIONAL FLOW Kernel Work-items Queue A Channel Trigger Dispatch CP Uber-scheduler 71 HiPEAC January, 2012 Public

72 OPERATIONAL FLOW Kernel Work-items Queue A Channel Work-items CP Uber-scheduler 72 HiPEAC January, 2012 Public

73 OPERATIONAL FLOW Kernel Work-items Queue A Channel read Work-items CP Uber-scheduler 73 HiPEAC January, 2012 Public

74 OPERATIONAL FLOW Kernel Work-items Queue A Channel read Work-items CP Uber-scheduler 74 HiPEAC January, 2012 Public

75 OPERATIONAL FLOW Kernel Queue A Channel CP Uber-scheduler 75 HiPEAC January, 2012 Public

76 OPP - PERSISTENT CONTROL PROCESSOR THREAD MODEL ACE CP... Scheduler 0 ACE CP Scheduler 1 CS Pipe CS Pipe CU 0 Private Private memory memory Kernel wi0... i i -WF win... Private memory CU m... Kernel wi0... j j -WF Private memory win LDS backed channel LDS backed channel Channel 0... Channel n L2 CACHE Global memory CACHE CACHE x86 thread... x86 thread 76 HiPEAC January, 2012 Public

77 OPP CHANNEL IMPLEMENTATION OF FIB class FibChannel { private: atomic_int& result_; opp::channel<int> channel_; int n_; std::function<bool (opp::channel<int>*)> pred_; public: FibChannel(int n, atomic_int& result) : result_(result), channel_(opp::channel<int>(1)), n_(n), pred_([](opp::channel<int> * c) -> bool restrict(fql) { return true; }) { channel_.executewith( pred_, Range<1>(1), [&] (Index<1>, const int v) { result_ += v; }); } void operator() () restrict(hsa){ if (n_ <= 2) { channel_.write(1); return; } FibChannel x(n_ - 1, result_); FibChannel y(n_ - 2, result_); Task<void>(x).enqueue(); Task<void>(y).enqueue(); } }; int parallelfibwithchannels(int n) { atomic_int result = 0; FibChannel(n, result)(); return result; } 77 HiPEAC January, 2012 Public

78 LIBERATING GPGPU PROGRAMMING MODEL So far we have addressed: Tasks and dynamic work generation from any where in the system (i.e. tasks) Part of the function composition problem, i.e. DistArray address the shared and global memory problem But we have not addressed the issue of nesting barriers within control flow 78 HiPEAC January, 2012 Public

79 MAKING BARRIERS FIRST CLASS We looked at two approaches to solving the barrier composibility problem: 1. Implicit barriers, simply extend the current barrier Problem with this approach is it has surprisingly limited application, for example think a set of waves producing data for another set of waves within the same wavefront. It is not possible to express this relationship in a way that allows the producer to progress for multiple clients and multiple data sets. 2. Barrier objects, introduce barriers as first class values with a set of well define operations: Construction initialize a barrier for some sub-set of work-items within a work-group or across workgroups Arrive work-item marks the barrier as satisfied Skip work-item marks the barrier as satisfied and note that it will no longer take part in barrier operations. Allows early exit from a loop or divergent control where work-item never intends to hit take part in barrier Wait work-item marks the barrier as satisfied and waits for all other work-items to arrive, wait, or skip. 79 HiPEAC January, 2012 Public

80 BARRIER OBJECT EXAMPLE SIMPLE DIVERGENT CONTROL FLOW barrier b(8); parallelfor(range<1>(8), [&b] (Index<1> i) { int val = i.getx(); scratch[i] = val; if( i < 4 ) { b.wait(); x[i] = scratch[i+1]; } else { b.skip(); x[i] = 17; }}); 80 HiPEAC January, 2012 Public

81 BARRIER OBJECT EXAMPLE LOOP FLOW barrier b(8); parallelfor(range<1>(8) [&b] (Index<1> i) { scratch[i.getx()] = i.getx(); if( i.getx() < 4 ) { for( int j = 0; j < i; ++j ) { b.wait(); x[i] += scratch[i+1]; } b.skip(); } else { b.skip(); x[i.getx()] = 17; }}); 81 HiPEAC January, 2012 Public

82 BARRIER OBJECT EXAMPLE CALLING A FUNCTION FROM WITHIN CONTROL FLOW barrier b(8); parallelfor(range<1>(8), [&b] (Index<1> i) { scratch[i] = i.getx(); if( i.getx() < 4 ) { someopaquelibraryfunction(i.getx(), b); } else { b.skip(); x[i.getx()] = 17; }}); void someopaquelibraryfunction( const int i, barrier &b) { for( int j = 0; j < i; ++j ) { b.wait(); x[i] += scratch[i+1]; } b.skip(); } 82 HiPEAC January, 2012 Public

83 BARRIER OBJECT EXAMPLE CALLING A FUNCTION FROM WITHIN CONTROL FLOW barrier b(8); parallelfor(range<1>(8), [&b] (Index<1> i) { scratch[i] = val; if( i.getx() < 4 ) { someopaquelibraryfunction(i.getx(), b); } else { b.wait(); x[i.getx()] = 17; }}); This is INVALID The caller has no way to know how many times the library function will use the barrier. Therefore skip is the only option. 83 HiPEAC January, 2012 Public

84 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. 84 HiPEAC January, 2012 Public

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

FUSION PROCESSORS AND HPC

FUSION PROCESSORS AND HPC FUSION PROCESSORS AND HPC Chuck Moore AMD Corporate Fellow & Technology Group CTO June 14, 2011 Fusion Processors and HPC Today: Multi-socket x86 CMPs + optional dgpu + high BW memory Fusion APUs (SPFP)

More information

SIMULATOR AMD RESEARCH JUNE 14, 2015

SIMULATOR AMD RESEARCH JUNE 14, 2015 AMD'S gem5apu SIMULATOR AMD RESEARCH JUNE 14, 2015 OVERVIEW Introducing AMD s gem5 APU Simulator Extends gem5 with a GPU timing model Supports Heterogeneous System Architecture in SE mode Includes several

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public

HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public HETEROGENEOUS COMPUTING IN THE MODERN AGE 1 HiPEAC January, 2012 Public Single-thread Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Enabled by: Moore s Law Voltage Scaling Constrained

More information

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING 1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING TASK-PARALLELISM OpenCL, CUDA, OpenMP (traditionally) and the like are largely data-parallel models Their core unit of parallelism

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session 2117 Olivier Zegdoun AMD Sr. Software Engineer CONTENTS Rendering Effects Before Fusion: single discrete GPU case Before Fusion: multiple discrete

More information

Heterogeneous Computing

Heterogeneous Computing Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:

More information

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER Budirijanto Purnomo AMD Technical Lead, GPU Compute Tools PRESENTATION OVERVIEW Motivation AMD APP Profiler

More information

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

Panel Discussion: The Future of I/O From a CPU Architecture Perspective Panel Discussion: The Future of I/O From a CPU Architecture Perspective Brad Benton AMD, Inc. #OFADevWorkshop Issues Move to Exascale involves more parallel processing across more processing elements GPUs,

More information

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM Dong Ping Zhang Heterogeneous System Architecture AMD VASCULATURE ENHANCEMENT 3 Biomedical data analysis on heterogeneous platform June, 2012 EXAMPLE:

More information

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such

More information

Generic System Calls for GPUs

Generic System Calls for GPUs Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices

More information

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS James Ross High Performance Technologies, Inc (HPTi) Computational Scientist Edward Carmack David Richie Song Park, Brian Henz and Dale Shires HPTi

More information

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D AMD GRAPHIC CORE NEXT Low Power High Performance Graphics & Parallel Compute Michael Mantor AMD Senior Fellow Architect Michael.mantor@amd.com Mike Houston AMD

More information

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011 MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to

More information

Sequential Consistency for Heterogeneous-Race-Free

Sequential Consistency for Heterogeneous-Race-Free Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD JUNE 12, 2013 EXECUTIVE

More information

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011 AMD IOMMU VERSION 2 How KVM will use it Jörg Rödel August 16th, 2011 AMD IOMMU VERSION 2 WHAT S NEW? 2 AMD IOMMU Version 2 Support in KVM August 16th, 2011 Public NEW FEATURES - OVERVIEW Two-level page

More information

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 ANECDOTE DISCOVERING A BUFFER OVERFLOW CPU GPU MEMORY MEMORY Data Data Data Data Data 2 clarmor: A

More information

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to the features, functionality, availability, timing,

More information

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time

More information

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015 The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA

More information

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016 AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC. MODERN SYSTEMS ARE POWERED BY HETEROGENEITY

More information

Cilk Plus: Multicore extensions for C and C++

Cilk Plus: Multicore extensions for C and C++ Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting

More information

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration Accelerating Applications the art of maximum performance computing James Spooner Maxeler VP of Acceleration Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How

More information

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD D3D12 & Vulkan: Lessons learned Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD D3D12 What s new? DXIL DXGI & UWP updates Root Signature 1.1 Shader cache GPU validation PIX D3D12 / DXIL DXBC

More information

Fusion Enabled Image Processing

Fusion Enabled Image Processing Fusion Enabled Image Processing I Jui (Ray) Sung, Mattieu Delahaye, Isaac Gelado, Curtis Davis MCW Strengths Complete Tools Port, Explore, Analyze, Tune Training World class R&D team Leading Algorithms

More information

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD AMD APU and Processor Comparisons AMD Client Desktop Feb 2013 AMD SUMMARY 3DMark released Feb 4, 2013 Contains DirectX 9, DirectX 10, and DirectX 11 tests AMD s current product stack features DirectX 11

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE VIGNESH ADHINARAYANAN, INDRANI PAUL, JOSEPH L. GREATHOUSE, WEI HUANG, ASHUTOSH PATTNAIK, WU-CHUN FENG POWER AND ENERGY ARE FIRST-CLASS

More information

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling AMD Radeon Open Compute Platform Felix Kuehling ROCM PLATFORM ON LINUX Compiler Front End AMDGPU Driver Enabled with ROCm GCN Assembly Device LLVM Compiler (GCN) LLVM Opt Passes GCN Target Host LLVM Compiler

More information

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT Bootstrapping the industry for better VR experience Complimentary to HMD SDKs It s all about giving developers the tools they want! AMD LIQUIDVR

More information

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015 SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015 PROBLEM Shaders are compiled in draw calls Emulating certain features in shaders Drivers keep shaders in some intermediate representation And insert

More information

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations

More information

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer STREAMING VIDEO DATA INTO 3D APPLICATIONS Session 2116 Christopher Mayer AMD Sr. Software Engineer CONTENT Introduction Pinned Memory Streaming Video Data How does the APU change the game 3 Streaming Video

More information

viewdle! - machine vision experts

viewdle! - machine vision experts viewdle! - machine vision experts topic using algorithmic metadata creation and heterogeneous computing to build the personal content management system of the future Page 2 Page 3 video of basic recognition

More information

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015 KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015 AGENDA Background & Motivation Challenges Native Page Tables Emulating the OS Kernel 2 KVM CPU MODEL IN SYSCALL EMULATION

More information

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Multi-core processors are here, but how do you resolve data bottlenecks in native code? Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008 Run Anywhere The Hardware Platform Perspective Ben Pollan, AMD Java Labs October 28, 2008 Agenda Java Labs Introduction Community Collaboration Performance Optimization Recommendations Leveraging the Latest

More information

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION 1 3DMark Overview February 2013 Approved for public distribution 2 3DMark Overview February 2013 Approved for public distribution

More information

Vulkan (including Vulkan Fast Paths)

Vulkan (including Vulkan Fast Paths) Vulkan (including Vulkan Fast Paths) Łukasz Migas Software Development Engineer WS Graphics Let s talk about OpenGL (a bit) History 1.0-1992 1.3-2001 multitexturing 1.5-2003 vertex buffer object 2.0-2004

More information

RegMutex: Inter-Warp GPU Register Time-Sharing

RegMutex: Inter-Warp GPU Register Time-Sharing RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer

More information

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO Gestural and Cinematic Interfaces - DX11 David Brebner Unlimited Realities CTO Gestural and Cinematic Interfaces DX11 Making an emotional connection with users 3 Unlimited Realities / Fingertapps About

More information

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer SESSION AGENDA Quick Keywords Abstract and Scope Introduction Current User Interface [ UI

More information

ROCm: An open platform for GPU computing exploration

ROCm: An open platform for GPU computing exploration UCX-ROCm: ROCm Integration into UCX {Khaled Hamidouche, Brad Benton}@AMD Research ROCm: An open platform for GPU computing exploration 1 JUNE, 2018 ISC ROCm Software Platform An Open Source foundation

More information

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected

More information

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager INTRODUCTION WHO WE ARE 3 Highly Parallel Computing in Physics-based

More information

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

MICHAL MROZEK ZBIGNIEW ZDANOWICZ MICHAL MROZEK ZBIGNIEW ZDANOWICZ Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY

More information

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES VERSION 1 - FEBRUARY 2018 CONTACT Address Advanced Micro Devices, Inc 7171 Southwest Pkwy Austin, Texas 78735 United States Phone

More information

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018 CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected

More information

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling

More information

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data Andrew Miller Computer Vision Group Research Developer 3-D TERRAIN RECONSTRUCTION

More information

Designing Natural Interfaces

Designing Natural Interfaces Designing Natural Interfaces So what? Computers are everywhere C.T.D.L.L.C. Computers that don t look like computers. Computers that don t look like Computers Computers that don t look like Computers

More information

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO Desktop Telepresence Arrived! Sudha Valluru ViVu CEO 3 Desktop Telepresence Arrived! Video Collaboration market Telepresence Telepresence Cost Expensive Expensive HW HW Legacy Apps Interactivity ViVu CONFIDENTIAL

More information

Graphics Hardware 2008

Graphics Hardware 2008 AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different? Many Cores Disruptive Change

More information

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide AMD HD3D Technology Setup Guide 1 AMD HD3D TECHNOLOGY: Setup Guide Contents AMD HD3D Technology... 3 Frame Sequential Displays... 4 Supported 3D Display Hardware... 5 AMD Display Drivers... 5 Configuration

More information

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION 1 3DMark Overview April 2013 Approved for public distribution 2 3DMark Overview April 2013 Approved for public distribution

More information

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions FLASH MEMORY SUMMIT 2011 Adoption of Caching & Hybrid Solutions Market Overview 2009 Flash production reached parity with all other existing solid state memories in terms of bites. 2010 Overall flash production

More information

HyperTransport Technology

HyperTransport Technology HyperTransport Technology in 2009 and Beyond Mike Uhler VP, Accelerated Computing, AMD President, HyperTransport Consortium February 11, 2009 Agenda AMD Roadmap Update Torrenza, Fusion, Stream Computing

More information

1 Presentation Title Month ##, 2012

1 Presentation Title Month ##, 2012 1 Presentation Title Month ##, 2012 Malloc in OpenCL kernels Why and how? Roy Spliet Bsc. (r.spliet@student.tudelft.nl) Delft University of Technology Student Msc. Dr. A.L. Varbanescu Prof. Dr. Ir. H.J.

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 TOPICS Data transfer Parallelism Coalesced memory access Best work group size Occupancy branching All the performance numbers come from a W8100 running

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture HPCA 18 Reliability-aware Data Placement for Heterogeneous memory Architecture Manish Gupta Ψ, Vilas Sridharan*, David Roberts*, Andreas Prodromou Ψ, Ashish Venkat Ψ, Dean Tullsen Ψ, Rajesh Gupta Ψ Ψ *

More information

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

Maximizing Six-Core AMD Opteron Processor Performance with RHEL Maximizing Six-Core AMD Opteron Processor Performance with RHEL Bhavna Sarathy Red Hat Technical Lead, AMD Sanjay Rao Senior Software Engineer, Red Hat Sept 4, 2009 1 Agenda Six-Core AMD Opteron processor

More information

AMD SEV Update Linux Security Summit David Kaplan, Security Architect

AMD SEV Update Linux Security Summit David Kaplan, Security Architect AMD SEV Update Linux Security Summit 2018 David Kaplan, Security Architect WHY NOT TRUST THE HYPERVISOR? Guest Perspective o Hypervisor is code I don t control o I can t tell if the hypervisor is compromised

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

High Performance Graphics 2010

High Performance Graphics 2010 High Performance Graphics 2010 1 Agenda Radeon 5xxx Product Family Highlights Radeon 5870 vs. 4870 Radeon 5870 Top-Level Radeon 5870 Shader Core References / Links / Screenshots Questions? 2 ATI Radeon

More information

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s strategy and focus, expected datacenter total

More information

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

AMD S X86 OPEN64 COMPILER. Michael Lai AMD AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components 3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components Xudong An, Manish Arora, Wei Huang, William C. Brantley, Joseph L. Greathouse AMD Research Advanced Micro Devices, Inc. MOTIVATION

More information

Pattern-based analytics to estimate and track yield risk of designs down to 7nm

Pattern-based analytics to estimate and track yield risk of designs down to 7nm DAC 2017 Pattern-based analytics to estimate and track yield risk of designs down to 7nm JASON CAIN, MOUTAZ FAKHRY (AMD) PIYUSH PATHAK, JASON SWEIS, PHILIPPE HURAT, YA-CHIEH LAI (CADENCE) INTRODUCTION

More information

ASYNCHRONOUS SHADERS WHITE PAPER 0

ASYNCHRONOUS SHADERS WHITE PAPER 0 ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped

More information

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015 The Road to the AMD Fiji GPU Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015 Fiji Chip DETAILED LOOK 4GB High-Bandwidth Memory 4096-bit wide interface 512 GB/s

More information

Anatomy of AMD s TeraScale Graphics Engine

Anatomy of AMD s TeraScale Graphics Engine Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream

More information

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management

More information

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide AMD Radeon ProRender plug-in for Unreal Engine Installation Guide This document is a guide on how to install and configure AMD Radeon ProRender plug-in for Unreal Engine. DISCLAIMER The information contained

More information

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level CLICK TO EDIT MASTER TITLE STYLE Second level THE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU PAUL BLINZER, FELLOW, HSA SYSTEM SOFTWARE, AMD SYSTEM ARCHITECTURE WORKGROUP CHAIR, HSA FOUNDATION

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016 Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

AMD AIB Partner Guidelines. Version February, 2015

AMD AIB Partner Guidelines. Version February, 2015 AMD AIB Partner Guidelines Version 1.0 - February, 2015 The Purpose of This Document These guidelines provide direction for our Add-in-Board (AIB) partners and customers to market the benefits of AMD products

More information

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017 PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017 BACKGROUND-- HARDWARE MEMORY ENCRYPTION AMD Secure Memory Encryption (SME) / AMD Secure Encrypted Virtualization (SEV) Hardware AES engine

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU

More information

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK Wei-Lien Hsu AMD SMTS Gongyuan Zhuang AMD MTS OUTLINE Motivation OpenDecode Video deblurring algorithms Acceleration by clamdfft

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1 What s In This Talk? Overview of parallel programming

More information

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs , Inc. Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs Doug Finke Director of Product Marketing September 2016

More information