A LOOK INTO THE FUTURE. 1 HiPEAC January, 2012 Public

Size: px

Start display at page:

Download "A LOOK INTO THE FUTURE. 1 HiPEAC January, 2012 Public"

Victor Nelson
5 years ago
Views:

1 A LOOK INTO THE FUTURE 1 HiPEAC January, 2012 Public

2 THIS PART OF THE TUTORIAL GPGPU programming has been around for some time now Reasonable adoption, but Often quoted as being too hard, limited applications fit model, and so on Lack of composibilty function composition is the true blessing of sequential programming So New and future hardware will address limitations from a silicon point of view Programming models emerging to address restrictions to address the software point of view 2 HiPEAC January, 2012 Public

3 GPGPU PROGRAMMING UNTIL NOW 3 HiPEAC January, 2012 Public

4 TODAY S GPGPU PROGRAMMING EXECUTION MODEL On GPU cores: We need a lot of data parallelism Algorithms that can be mapped to multiple cores and multiple threads per core Approaches that map efficiently to wide SIMD units So a nice simple functional map operation is great! Array OP OP OP OP Array.map(Op) This is basically the OpenCL TM 1.x model 4 HiPEAC January, 2012 Public

5 CPUS FAIR A BIT BETTER On CPU cores: Some data-parallelism for multiple cores Narrow SIMD units simplify the problem: pixels work fine rather than data-parallel pixel clusters Does AVX change this? High clock rates and caches make serial execution efficient So in addition to the simple map (which boils down to a for loop on the CPU), we can do complex task graphs OP OP OP OP OP OP 5 HiPEAC January, 2012 Public

6 HOW DO THE CPU AND GPU MEET TODAY? CPU programs can call CPU programs: vecadd(arg 0,, arg n-1 ); // C++11 CPU host programs can create data-parallel work on the GPU: vecadd<<numthreads, BlockSize>>(arg 0,, arg n-1 ); // Cuda setkernelarg(vecadd, 0, arg 0 ); // OpenCL 1.x setkernelarg(vecadd, n-1, arg n-1 ); clenqueuendrangekernel(vecadd, kernel, NumThreads, BlockSize, ); 6 HiPEAC January, 2012 Public

7 HOW DO THE CPU AND GPU MEET TODAY? GPU programs can call GPU functions: vecadd(arg 0,, arg n-1 ); // OpenCL C, Cuda, C++ AMP GPU device programs calling HOST code: Some research work, c.f. Stuart et al [2], in this area but limited GPU device programs creating data-parallel work on the GPU: Not currently possible (There is some limited application through persistent threads, which we will come to, but no general solution exists today.) 7 HiPEAC January, 2012 Public

8 BUT SO MANY APPLICATIONS Are there really that many interesting data-parallel applications? Answer is probably yes: but they are not necessarily embarrassingly data-parallel Nodes in application execution graph often contain data-parallel work but serialization (i.e. communication) between nodes Node in application execution graph are data-dependent and must be generated dynamically 8 HiPEAC January, 2012 Public

9 SOME EXAMPLE APPLICATIONS INCLUDE Irregular data-parallelism in applications: Poster child: ray tracing Alternative rendering techniques: rasterization, REYES, etc. Some workloads in medical imaging Gesture recognition (think Kinect) Others? 9 HiPEAC January, 2012 Public

BRAIDED PARALLELISM* Data-parallelism first class Task-parallelism first class Flight Simulator User Interface Physics Simulation AI sound keyboard

10 BRAIDED PARALLELISM* Data-parallelism first class Task-parallelism first class Flight Simulator User Interface Physics Simulation AI sound keyboard mouse Network Particle Systems Graphics server player 1 player 2 * To our knowledge this term was first used by Lefohn [4] 10 HiPEAC January, 2012 Public

THE IDEAL PROGRAMMING MODEL Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested

11 THE IDEAL PROGRAMMING MODEL Fine-grain data parallel Code Maps very well to integrated SIMD dataflow (ie: SSE) i=0 i++ load x(i) fmul store cmp i (16) bc Loop 16 times for 16 pieces of data Nested data parallel Code Discrete GPU configurations suffer from communication latency. Nested data parallel/braided parallel code benefits from close coupling. Discrete GPUs don t provide it well. But each individual core isn t great at certain types of algorithm Lots of conditional data parallelism. Benefits from closer coupling between CPU & GPU Coarse-grain data parallel Code Loop 1M times for 1M pieces of data 2D array representing very large dataset Maps very well to Throughput-oriented data parallel engines i=0 i++ load x(i) fmul store cmp i ( ) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc 11 HiPEAC January, 2012 Public

12 FUNCTION COMPOSITION algorithm = f g The ability of decomposing a larger problem into sub-components The ability to replace one sub-component with another sub-component with the same external semantics 12 HiPEAC January, 2012 Public

13 ARE CURRENT GPGPU PROGRAMMING MODELS COMPOSIBLE? No! 13 HiPEAC January, 2012 Public

14 MEMORY ADDRESS SPACES ARE NOT COMPOSIBLE void foo(global int *) { } void bar(global int * x) { foo(x); // works fine } local int a[1]; a[0] = *x; foo(a); // will now not work 14 HiPEAC January, 2012 Public

15 BARRIER ELISION IS NOT COMPOSIBLE Parallel prefix sum (in this case taken from a RadixSort) for level = 0 to n foreach work item i in buffer range if( i > 2^level ) temp = buffer[i-2^(n-1)] + buffer[i]; barrier(); // barrier is not needed within a wavefront or warp // adds overhead so often dropped for optimization reasons buffer[i] = temp; 15 HiPEAC January, 2012 Public

16 HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) 16 HiPEAC January, 2012 Public

HSA FEATURE ROADMAP Physical Integration Optimized Platforms Architectural Integration

Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode

Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU

17 HSA FEATURE ROADMAP Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Extend to Discrete GPU 17 HiPEAC January, 2012 Public

18 TASK QUEUING RUNTIMES Popular pattern for task and data parallel programming on SMP systems today Characterized by: A work queue per core Runtime library that divides large loops into tasks and distributes to queues A work stealing runtime that keeps the system balanced HSA is designed to extend this pattern to run on heterogeneous systems 18 HiPEAC January, 2012 Public

19 TASK QUEUING RUNTIME ON CPUS Work Stealing Runtime Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker x86 CPU x86 CPU x86 CPU x86 CPU CPU Threads GPU Threads Memory 19 HiPEAC January, 2012 Public

20 TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker GPU Manager X86 CPU X86 CPU X86 CPU X86 CPU Radeon GPU CPU Threads GPU Threads Memory 20 HiPEAC January, 2012 Public

Dispatch CPU Threads GPU Threads Memory S I M D

21 TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker GPU Manager Memory X86 CPU X86 CPU X86 CPU X86 CPU Fetch and Dispatch CPU Threads GPU Threads Memory S I M D S I M D S I M D S I M D S I M D 21 HiPEAC January, 2012 Public

22 HSA SOFTWARE EXAMPLE REDUCTION float foo(float); float myarray[ ]; Task<float, ReductionBin> sum([myarray](indexrange<1> index) [[device]] { float sum = 0.; for (size_t I = index.begin(); I!= index.end(); i++) { sum += foo(myarray[i]); } return float; }); float sum = task.enqueuewithreduce(partition<1, Auto>(1920), [] (int x, int y) restict(opp) { return x+y; }, 0.); 22 HiPEAC January, 2012 Public

23 HETEROGENEOUS COMPUTE DISPATCH How compute dispatch operates today in the driver model How compute dispatch improves tomorrow under HSA 23 HiPEAC January, 2012 Public

24 TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer DMA Buffer A GPU HARDWARE Hardware Queue 24 HiPEAC January, 2012 Public

25 TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer DMA Buffer Command Flow Data Flow Application B Direct3D User Mode Driver Soft Queue Kernel Mode Driver A GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Mode Driver Soft Queue Kernel Mode Driver Hardware Queue Command Buffer DMA Buffer 25 HiPEAC January, 2012 Public

26 TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer DMA Buffer Application B Command Flow Direct3D User Mode Driver Data Flow Soft Queue Kernel Mode Driver A C B A B GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Mode Driver Soft Queue Kernel Mode Driver Hardware Queue Command Buffer DMA Buffer 26 HiPEAC January, 2012 Public

27 TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer DMA Buffer Application B Command Flow Direct3D User Mode Driver Data Flow Soft Queue Kernel Mode Driver A C B A B GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Mode Driver Soft Queue Kernel Mode Driver Hardware Queue Command Buffer DMA Buffer 27 HiPEAC January, 2012 Public

scheduling Low dispatch times No APIs Application A Hardware Queue A A A No Soft Queues No User

28 FUTURE COMMAND AND DISPATCH FLOW Application C C C C C C Application codes to the hardware User mode queuing Application B Optional Dispatch Buffer Hardware Queue B B B GPU HARDWARE Hardware scheduling Low dispatch times No APIs Application A Hardware Queue A A A No Soft Queues No User Mode Drivers No Kernel Mode Transitions No Overhead! Hardware Queue 28 HiPEAC January, 2012 Public

29 FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 29 HiPEAC January, 2012 Public

30 FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 30 HiPEAC January, 2012 Public

31 FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 31 HiPEAC January, 2012 Public

32 FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 32 HiPEAC January, 2012 Public

33 HETEROGENEOUS SYSTEM ARCHITECTURE PROGRAMMING MODEL (HSAPM) 33 HiPEAC January, 2012 Public

34 HSAPM - FOUNDATIONS Leverage existing C++ expertise of developers Type Checking C++ abstraction features C++ compilation model Separate compilation units supported Small set of extensions to C++ Task parallel algorithms Data parallel algorithms 34 HiPEAC January, 2012 Public

35 PROGRAMMING MODEL CONCEPTS Kernels are in C++0x source language All features supported, including exceptions Kernels marked with a single attribute: restrict(hsa)* Function calls are to kernel code running on the same device Function calls to CPU are indirect (i.e. via OPP constructs) restrict(opencl) OpenCL C kernel (must follow OpenCL C rules) cpu - CPU target, always generated OpenCL C kernels fully supported but not required * Based on AMD s C++ AMP compiler technology 35 HiPEAC January, 2012 Public

36 PROGRAMMING MODEL CONCEPTS Smart pointers are used for shared virtual address space: Easily emulated template<class T, > class Pointer; Pointer<T> malloc<t>(size_t); free<pointer<t>); Example: Pointer<float> array = malloc<float>(10); 36 HiPEAC January, 2012 Public

37 PROGRAMMING MODEL CONCEPTS Range N-dimensional index space, where N is one, two or three. Examples: Range<1> range(100); // 1D range with extent 0..99, Range<2> range(4,8); // 2D range with extents x = 0..3, y = 0..7 Sub-index spaces are supported (i.e. groups) Example: Range<1,1> range(100, Range<1>(10)); // 1D range with extent and groups of HiPEAC January, 2012 Public

38 PROGRAMMING MODEL CONCEPTS Index Single point in the N-dimensional index space of Range. Examples: Index<1> index(10); // 11 th element in Range<1>(100), for example Index<2> index(4,1); // column x = 4 and row y = 1 Sub-index spaces are support (i.e. local index) Example: Index<1,1> index(2, Index<1>(3)); 38 HiPEAC January, 2012 Public

PROGRAMMING MODEL CONCEPTS DistArray Distributed array providing Memory hierarchy abstraction Memory coherency properties Can be typed

39 PROGRAMMING MODEL CONCEPTS DistArray Distributed array providing Memory hierarchy abstraction Memory coherency properties Can be typed Example: Addresses part 1 of function composition problem DistArray<float> dstarray; // no practical use 39 HiPEAC January, 2012 Public

40 PROGRAMMING MODEL CONCEPTS Regions Typed arrays allocated against a DistArray Example: DistArray<float> dstarray; dstarray.allocregion<float>(10); Memory access descriptors supported (not described in this deck) 40 HiPEAC January, 2012 Public

41 PROGRAMMING MODEL CONCEPTS Bound distributed arrays When bound at dispatch bound arrays represent PGAS address space Example: DistArray<float> dstarray; dstarray.allocregion<float>(10); parallelfor(range<1>(10), dstarray, [] (Index<1> index, BoundDistArray<float>) { }); 41 HiPEAC January, 2012 Public

42 PROGRAMMING MODEL CONCEPTS parallelfor Execute a kernel across a Range Optional DistArray allocation (bound to shared memory) Example with C++ kernel: Pointer<float> input = malloc<float>(1024*1024); Pointer<float> output = malloc<float>(1024*1024); // initialize input parallelfor(range<1>(100), [] (Index<1> index) restrict(hsa) { output[index.getx()] = input[index.getx()] * 2; // square input }); // do something with output 42 HiPEAC January, 2012 Public

43 PROGRAMMING MODEL CONCEPTS Example with OpenCL C kernel: void square( global float * input, global float * output) restrict(opencl) { output[get_global_id(0)] = input[get_global_id(0)]; } Pointer<float> input = Pointer<float>(1024*1024); Pointer<float> output = Pointer<float>(1024*1024); // initialize input parallelfor(range<1>(100), square, input, output); // do something with output 43 HiPEAC January, 2012 Public

Addresses part 2 of output[index.getx()] = input[index.

44 PROGRAMMING MODEL CONCEPTS Task Execute a kernel across a range Optional DistArray allocation Hmm similar to parallelfor! Example (simple): Task<void> ([] (Index<1> index) restrict(hsa){ }); 44 HiPEAC January, 2012 Public Can be nested on device: Addresses part 2 of output[index.getx()] = input[index.getx()] * 2; // square input function composition Future<void> f; task.enqueue(range<1>(20), f); problem f = task.enqeue(range<1>(100)); // future is single assignment task.enqueue(range<1>(30), f);

45 DATA AFFINITY Need the ability to describe affinity of tasks Default is run anyway (CPU or GPU) But what if it depends on local resource, e.g. local? Allow user to specify locality of task(s) Example: DistArray distarray; Region<float> region = dstarray.allocregion<float>(10); parallelfor(range<1,1>(100, Range<1>(10)), distarray, [region] (Index<1,1> i, BoundDistArray<float> da) restrict(hsa) { task<void> ([] (Index<1> index, Region<float> r) { }); task.enqueue(range<1>(100), da(region)); }); 45 HiPEAC January, 2012 Public

46 DEVICE AFFINITY Need the ability to describe affinity of tasks with respect to a device Use affinity rather that queuing Example: task<void> ([] (Index<1> index) restrict(hsa) { output[index.getx()] = input[index.getx()] * 2; // square input }); Future<void> f; task.enqueue(range<1>(20), f, CPU0); f = task.enqeue(range<1>(100)); task.enqueue<device0>(range<1>(30), f); 46 HiPEAC January, 2012 Public

47 THERE IS A LOT MORE This is just a short introduction the concepts that we have been developing for AMD s future programming models. We are currently working on a paper that will (all things being well) be published later this year. The rest of this part of the tutorial will focus in detail on two interesting (and new) aspects of this system: Channels, a data-driven model for dynamic work generation Barrier objects, a solution to the remaining composibilty problem 47 HiPEAC January, 2012 Public

48 DYNAMIC TASK EXECUTION AND CHANNELS 48 HiPEAC January, 2012 Public

49 49 HiPEAC January, 2012 Public MOTIVATING EXAMPLE: RAY COMPACTION

50 PRIMARY RAYS - COHERENT Primary Rays 50 HiPEAC January, 2012 Public

51 SECONDARY RAYS NOT COHERENT Secondary Rays Primary Reflection Refraction Probes 51 HiPEAC January, 2012 Public

52 SHADOW RAYS NOT COHRENT Primary Shadow Ray In Shadow Area Light 52 HiPEAC January, 2012 Public

53 SIMD AND MEMORY DIVERGENCE Secondary and shadow rays, for example, introduce both: SIMD divergence (different kernels run for different type of ray) Memory coherence (different rays have random memory access patterns) Persistent thread model provides a solution to SIMD divergence Originally developed by the ray tracing group at Nvidia We look at an example of this approach and its draw backs Channels (data-flow model) provide a solution to both problems An alternative approach (addressing limitations with persistent thread models) Developed at AMD We look at underlying idea and example of this approach 53 HiPEAC January, 2012 Public

54 BUT OF COURSE All real analysis of Task Parallel Runtimes (TPRs) need to spend a large amount of time and effort on Fib! In fairness it demonstrates many of the important properties that a TPR must have, while of course fitting on single a slide! We will now continue in this vein! 54 HiPEAC January, 2012 Public

55 PARALLEL FIB Excellent example of how a recursive stack based algorithm might be implemented. OPP example fib() implementation. Compiling to OpenCL with persistent threads, using our Kite runtime, produces 3 tasks: TASK_RESULT TASK_ADD TASK_FIB Tasks and scheduler are compiled into an uber-scheduler that generate work into (workstealing) task queues. int parallelfib(int n) { if (n <= 2) { return 1; } opp::future<int> x = Task<int>( [n] () > int restrict(hsa) { return parallelfib(n - 1); }).enqueue(); return x.getdata() + parallelfib(n - 2); } 55 HiPEAC January, 2012 Public

56 OPENCL UBER-SCHEDULER FOR KITE FIB IMPLEMENTATION kernel void fib( global CircularWSDeque * wsdeque, global volatile Task * continuations, global volatile int * nextfreecontinuationslot, global int * results) { global CircularWSDeque * lwsdeque = wsdeque + get_global_id(0); Task t = popbottom(lwsdeque); while (!istaskempty(t)) { switch(t.id_) { case TASK_RESULT: results[get_global_id(0)] = t.params_.x_; break; case TASK_ADD: sendargument(t.continuation_, t.result_, t.params_.x_+ t.params_.y_); break; case TASK_FIB: if (t.params_.x_ < 2) { sendargument( lwsdeque,t.continuation_,t.result_,t.params_.x_); } else { int freecont = atomic_inc(nextfreecontinuationslot); global volatile Task * cond = &continuations[freecont]; *cond = makeadd(t.result_, t.continuation_); pushbottom(lwsdeque, makefib(t.params_.x_ - 1, ARG_OFFSET_FIRST, cond)); pushbottom(lwsdeque, makefib(t.params_.x_ - 2, ARG_OFFSET_SECOND, cond)); } break; } t = popbottom(lwsdeque); } } 56 HiPEAC January, 2012 Public

57 KITE COMPUTE UNIT PERSISTENT THREAD MODEL ACE CP... ACE CP CS Pipe CS Pipe CU 0 Private Private memory memory Uber scheduler -WF wi0... win... CU m Private Private memory memory Uber scheduler -WF wi0... win LDS backed queue LDS backed queue L2 backed queue L2 CACHE L2 backed queue Global memory CACHE x86 thread... CACHE x86 thread 57 HiPEAC January, 2012 Public

UBER-SCHEDULER Given N tasks (i.e. kernels) t 1,, t n it is compiled to: Limitation: As kernels are resource allocated statically: Uber-scheduler is restricted to the maximum resource allocation of its work t i!

58 UBER-SCHEDULER Given N tasks (i.e. kernels) t 1,, t n it is compiled to: Limitation: As kernels are resource allocated statically: Uber-scheduler is restricted to the maximum resource allocation of its work t i! No resource amplification! kernel void fib( global CircularWSDeque * wsdeque, global volatile Task * continuations, global volatile int * nextfreecontinuationslot, global int * results) { global CircularWSDeque * lwsdeque = wsdeque + get_global_id(0); Task t = popbottom(lwsdeque); while (!istaskempty(t)) { switch(t.id_) { case t1_id: case tn_id: } } } 58 HiPEAC January, 2012 Public

59 59 HiPEAC January, 2012 Public GPGPU PROGRAMMING MOVING FORWARD

60 CHANNELS - PERSISTENT CONTROL PROCESSOR THREADING MODEL Add data-flow support to GPGPU We are not primarily notating this as producer/consumer kernel bodies That is that we are not promoting a method where one kernel loops producing values and another loops to consume them That has the negative behavior of promoting long-running kernels We ve tried to avoid this elsewhere by basing in-kernel launches around continuations rather than waiting on children Instead we assume that kernel entities produce/consume but consumer work-items are launched ondemand An alternative to the point to point data flow using of persistent threads, avoiding the uber-kernel 60 HiPEAC January, 2012 Public

61 OPERATIONAL FLOW Kernel Queue A Channel CP Uber-scheduler 61 HiPEAC January, 2012 Public

62 OPERATIONAL FLOW Kernel Queue Channel CP Uber-scheduler 62 HiPEAC January, 2012 Public

63 OPERATIONAL FLOW Kernel Work-items Queue A Channel CP Uber-scheduler Queue B 63 HiPEAC January, 2012 Public

64 OPERATIONAL FLOW Kernel Work-items write Queue A Channel CP Uber-scheduler Queue B 64 HiPEAC January, 2012 Public

65 OPERATIONAL FLOW Kernel Work-items write Queue A Channel CP Uber-scheduler 65 HiPEAC January, 2012 Public

66 OPERATIONAL FLOW Kernel Work-items Queue A Channel Trigger Dispatch CP Uber-scheduler 66 HiPEAC January, 2012 Public

67 OPERATIONAL FLOW Kernel Work-items Queue A Channel Work-items CP Uber-scheduler 67 HiPEAC January, 2012 Public

68 OPERATIONAL FLOW Kernel Work-items Queue A Channel read Work-items CP Uber-scheduler 68 HiPEAC January, 2012 Public

69 OPERATIONAL FLOW Kernel Work-items Queue A Channel read Work-items CP Uber-scheduler 69 HiPEAC January, 2012 Public

70 OPERATIONAL FLOW Kernel Work-items write Queue A Channel CP Uber-scheduler 70 HiPEAC January, 2012 Public

71 OPERATIONAL FLOW Kernel Work-items Queue A Channel Trigger Dispatch CP Uber-scheduler 71 HiPEAC January, 2012 Public

72 OPERATIONAL FLOW Kernel Work-items Queue A Channel Work-items CP Uber-scheduler 72 HiPEAC January, 2012 Public

73 OPERATIONAL FLOW Kernel Work-items Queue A Channel read Work-items CP Uber-scheduler 73 HiPEAC January, 2012 Public

74 OPERATIONAL FLOW Kernel Work-items Queue A Channel read Work-items CP Uber-scheduler 74 HiPEAC January, 2012 Public

75 OPERATIONAL FLOW Kernel Queue A Channel CP Uber-scheduler 75 HiPEAC January, 2012 Public

76 OPP - PERSISTENT CONTROL PROCESSOR THREAD MODEL ACE CP... Scheduler 0 ACE CP Scheduler 1 CS Pipe CS Pipe CU 0 Private Private memory memory Kernel wi0... i i -WF win... Private memory CU m... Kernel wi0... j j -WF Private memory win LDS backed channel LDS backed channel Channel 0... Channel n L2 CACHE Global memory CACHE CACHE x86 thread... x86 thread 76 HiPEAC January, 2012 Public

77 OPP CHANNEL IMPLEMENTATION OF FIB class FibChannel { private: atomic_int& result_; opp::channel<int> channel_; int n_; std::function<bool (opp::channel<int>*)> pred_; public: FibChannel(int n, atomic_int& result) : result_(result), channel_(opp::channel<int>(1)), n_(n), pred_([](opp::channel<int> * c) -> bool restrict(fql) { return true; }) { channel_.executewith( pred_, Range<1>(1), [&] (Index<1>, const int v) { result_ += v; }); } void operator() () restrict(hsa){ if (n_ <= 2) { channel_.write(1); return; } FibChannel x(n_ - 1, result_); FibChannel y(n_ - 2, result_); Task<void>(x).enqueue(); Task<void>(y).enqueue(); } }; int parallelfibwithchannels(int n) { atomic_int result = 0; FibChannel(n, result)(); return result; } 77 HiPEAC January, 2012 Public

78 LIBERATING GPGPU PROGRAMMING MODEL So far we have addressed: Tasks and dynamic work generation from any where in the system (i.e. tasks) Part of the function composition problem, i.e. DistArray address the shared and global memory problem But we have not addressed the issue of nesting barriers within control flow 78 HiPEAC January, 2012 Public

79 MAKING BARRIERS FIRST CLASS We looked at two approaches to solving the barrier composibility problem: 1. Implicit barriers, simply extend the current barrier Problem with this approach is it has surprisingly limited application, for example think a set of waves producing data for another set of waves within the same wavefront. It is not possible to express this relationship in a way that allows the producer to progress for multiple clients and multiple data sets. 2. Barrier objects, introduce barriers as first class values with a set of well define operations: Construction initialize a barrier for some sub-set of work-items within a work-group or across workgroups Arrive work-item marks the barrier as satisfied Skip work-item marks the barrier as satisfied and note that it will no longer take part in barrier operations. Allows early exit from a loop or divergent control where work-item never intends to hit take part in barrier Wait work-item marks the barrier as satisfied and waits for all other work-items to arrive, wait, or skip. 79 HiPEAC January, 2012 Public

80 BARRIER OBJECT EXAMPLE SIMPLE DIVERGENT CONTROL FLOW barrier b(8); parallelfor(range<1>(8), [&b] (Index<1> i) { int val = i.getx(); scratch[i] = val; if( i < 4 ) { b.wait(); x[i] = scratch[i+1]; } else { b.skip(); x[i] = 17; }}); 80 HiPEAC January, 2012 Public

81 BARRIER OBJECT EXAMPLE LOOP FLOW barrier b(8); parallelfor(range<1>(8) [&b] (Index<1> i) { scratch[i.getx()] = i.getx(); if( i.getx() < 4 ) { for( int j = 0; j < i; ++j ) { b.wait(); x[i] += scratch[i+1]; } b.skip(); } else { b.skip(); x[i.getx()] = 17; }}); 81 HiPEAC January, 2012 Public

82 BARRIER OBJECT EXAMPLE CALLING A FUNCTION FROM WITHIN CONTROL FLOW barrier b(8); parallelfor(range<1>(8), [&b] (Index<1> i) { scratch[i] = i.getx(); if( i.getx() < 4 ) { someopaquelibraryfunction(i.getx(), b); } else { b.skip(); x[i.getx()] = 17; }}); void someopaquelibraryfunction( const int i, barrier &b) { for( int j = 0; j < i; ++j ) { b.wait(); x[i] += scratch[i+1]; } b.skip(); } 82 HiPEAC January, 2012 Public

83 BARRIER OBJECT EXAMPLE CALLING A FUNCTION FROM WITHIN CONTROL FLOW barrier b(8); parallelfor(range<1>(8), [&b] (Index<1> i) { scratch[i] = val; if( i.getx() < 4 ) { someopaquelibraryfunction(i.getx(), b); } else { b.wait(); x[i.getx()] = 17; }}); This is INVALID The caller has no way to know how many times the library function will use the barrier. Therefore skip is the only option. 83 HiPEAC January, 2012 Public

84 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. 84 HiPEAC January, 2012 Public

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU