GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

Size: px

Start display at page:

Download "GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!"

Jasmin Shepherd
5 years ago
Views:

1 Advanced Topics on Heterogeneous System Architectures GPU! Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano!

2 2 Introduction! First GPU released in 1999! Used for the purpose of graphics processing! GPU architecture rapidly evolved providing higher computational power by means of parallelization! GPU architecture evolved also to support programmability of their components ( )!

3 3 Introduction! In 2006, NVIDIA introduced GeForce 8800 GPU supporting a new programming language: CUDA! Compute Unified Device Architecture! Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas! Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing! Host CPU issues data-parallel kernels to GP-GPU for execution!

4 4 Introduction! CPU and GPU performance trends! FLOPS FLoa;ng-point OPera;ons per Second

5 5 Graphics pipeline! At the beginning there was the graphics pipeline!

6 Graphics pipeline! 6

7 7 Vertex generation! The host interface is the communication bridge between the CPU and the GPU! It receives commands from the CPU and also pulls geometry information from system memory! It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color, etc.)!

8 8 Vertex processing! The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space! Transformations are based on matrix multiplications! No new vertices are created in this stage, and no vertices are discarded (input/output has 1:1 mapping)!

9 9 Vertex processing! 1. Model to world! coordinates! 2. World to eye coordinates! 3. Eye to clip! coordinates! Textures may be also used for advanced transformations (they provide height maps for displacement mapping)!

10 10 Primitive generation! The primitive assembler groups vertices forming one primitive (i.e. a triangle)!

11 11 Primitive processing! Various elaborations are performed! Perspective division and viewpoint transformation! Clipping!

12 12 Fragment generation! Geometry information is transformed in raster information (pixels in output)! Determine which pixels a primitive overlaps! Aliasing and other issues!

13 13 Fragment processing! Assign colors to pixels! Shades the fragment by simulating the interaction of light and material!

14 14 Fragment processing! Effects of tessellation! Texture mapping! Lightning and texture!

15 15 Pixel operations! Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests! Finally pixels are copied in the framebuffer (the memory space connected to the screen controller)!

16 16 Graphics pipeline! Single-stage elaborations can be performed in parallel on each chunk of data (vertex, fragment, pixel, )! Single-stage elaborations are mainly data-intensive, based on matrix elaborations and other arithmetical operations! Few branch instructions! Few data dependencies! PARALLELISM!

17 17 Evolution of the graphics pipeline! Pre GPU! Fixed function GPU! Programmable GPU! Unified shader processors!

18 Early 90s pre GPU! 18

19 19 Early 90s pre GPU! SGI RealityEngine (1993)! Graphics supercomputer! Various boards! Programmable, just not by application! Geometry engine contained Intel i860xp processor!

20 20 Exploit parallelism! Pipeline parallel! Data-parallel! Goals of GPUs?! CPU and GPU executing in parallel! Specific hardware accelerators! Texture filtering, rasterization, MAC, sqrt,...!

21 21 Fixed function rasterization, texture mapping, depth testing, etc.! 3dfx voodoo (1996)! Required separate VGA card for 2D!

22 22 NVIDIA GeForce 256 (1999)! All stages implemented in hardware! Fixed function rasterization, texture mapping, depth testing, etc.!

23 23 NVIDIA GeForce 3 (2001)! Optionally bypass fixed-function with a programmable vertex shader! Shader: a miniprogram defining the logic of a pipeline stage! A specific shading language has to be used (e.g. OpenGL)! Programmable

24 24 NVIDIA GeForce 6 (2004)! Improved programmability in fragment shader! Vertex shader can read textures! Dynamic branches! Programmable

25 25 NVIDIA GeForce 6 (2004)! Pipelined architecture! Multiple cores for each stage! Programmable stages! The introduction of programmable stages requires fetch and decode units!

26 26 NVIDIA GeForce 7800 (2005)! Vertex Fixed stages Programmable stages Fragment The introduc;on of programmable stages requires fetch and decode units Composite

27 27 NVIDIA GeForce 8 (2006)! Ground-up architecture redesign! New geometry shader after the vertex shader! Introduction of the unified shader processor! Introduction of CUDA! Employment of GPU for general purpose computing: GP-GPU! Geometry shader Programmable

28 28 NVIDIA GeForce Tesla (2006)! Introduc;on of Issue Units for managing threads genera;on and scheduling Fixed stages

29 29 Why a single shader processor?! Non-unified shader processors! Vertex shader bosleneck Pixel shader Heavy pixel workload Vertex shader Pixel shader Problems in balancing workload in pipeline stages Heavy geometry workload

30 30 Why a single shader processor?! Non-unified shader processors! Unified shader Heavy pixel workload Unified shader Heavy geometry workload Op;mal usage of processing resources

31 31 Unified shader processor! How the unified shared processor works! Three key ideas:! Instantiate many shader processors! Replicate ALU inside the shader processor to enable SIMD processing! Interleave the execution of many groups of SIMD threads! SIMD: single instruction multiple data!

32 32 Example: a diffuse reflectance shader! Shader programming model: fragments (or more in general work items) are processed independently! The function has to be executed for each fragment!

33 33 Example: a diffuse reflectance shader! Shader programming model: fragments (or more in general work items) are processed independently! The function has to be executed for each fragment! One instruc;on stream per fragment

34 Basic architecture of a modern CPU! 34

35 35 Basic architecture of a modern GPU! Remove components that help a single instruction stream run faster!

36 36 Replicate cores! Replicate cores to run several threads in parallel! 2 cores process 2 instruction streams in parallel!

37 37 Replicate cores! Replicate cores to run several threads in parallel! 4 cores process 4 instruction streams in parallel!

38 38 Replicate cores! Replicate cores to run several threads in parallel! 16 cores process 16 instruction streams in parallel!

39 39 Replicate cores! Replicate cores to run several threads in parallel! 16 cores process 16 instruction streams in parallel! PROBLEM: many cores should share the same instruc;on stream Since each unit has its own fetch and decode unit, we rather prefer to run different instruc;on streams

40 40 Replicate ALUs within the core! SIMD processing!

41 41 Replicate ALUs within the core! SIMD processing! Original compiled shader: Processing one item using scalar opera;ons on scalar registers

42 42 Replicate ALUs within the core! SIMD processing! New compiled shader: Processing 8 items using vector opera;ons on vector registers

43 43 Replicate ALUs within the core! SIMD processing!

44 44 Replicate ALUs within the core! SIMD processing does not imply SIMD instructions! Option 1: Explicit vector instructions! Intel/AMD x86 SSE and AVX, IBM Altivec (explicit vector length)! Option 2: Scalar instructions with implicit HW vectorization! HW determines instruction stream sharing across ALUs! NVIDIA GeForce ( SIMT warps), ATI Radeon architectures! SIMT: single instruction multiple threads! Split identical independent work items over multiple threads executed in lockstep! An instruction stream of scalar instructions is shared among the various threads!

45 45 Replicate ALUs within the core! SIMT processing!

46 46 Merging two-level replications! Result: multicore architecture where each core is a SIMD architecture! 16 cores, each one having 8 ALUs = 128 simultaneous threads

47 47 Branches! Branches have to be accurately handled!

48 48 Branches! Branches have to be accurately handled!

49 49 Branches! Branches have to be accurately handled!

50 50 Stalls! The execution of an instruction may have a data dependency with a previous one (still running) -> stall!! Access to the texture memory (100x slower than ALU instruc;ons)

51 51 Stalls! Stalls due to data dependencies have to be managed as well! Memory accesses cause many stalls due to the considerably higher execution time with respect to ALU instructions (x100/ x1000)! Fancy caches and logic avoiding stalls in CPUs have been removed! However! On GPU we can run concurrently MANY independent instructions streams!

52 52 Stalls! Interleave processing of many streams on a single core to hide stalls caused by latency operations!

53 53 Stalls! Interleave processing of many streams on a single core to hide stalls caused by latency operations!

54 54 Stalls! Interleave processing of many streams on a single core to hide stalls caused by latency operations!

55 55 Stalls! Interleave processing of many streams on a single core to hide stalls caused by latency operations!

56 56 Stalls! Interleaving between contexts can be managed by HW or SW or both! NVIDIA/AMD Radeon GPUs approach! HW schedules and manages all contexts! Special on-chip storage holds work item state!

57 57 How to dimension the context! Maximal latency hiding ability Low latency hiding ability

58 58 Basic architecture of a modern GPU! Summary:! Use many slimmed down cores to run in parallel! Pack cores full of ALUs (by sharing instruction streams across group of work items)! Option 1: explicit SIMD vector instructions! Option 2: implicit sharing managed by HW! Avoid latency stalls by interleaving execution of many groups of work items/threads/! When a group stalls, work on another group!

59 59 NVIDIA Fermi (2010)! 16 streaming multiprocessors! Each multiprocessor has 32 streaming cores! SIMT, single instruction multi threads! 6 64-bit memory partitions! 1 global scheduler!

60 60 Streaming Multiprocessor (SM)! 32 streaming cores (SC)! 32 bit pipelined integer arithmetic unit with support for 64 bit operations (1 cycle)! IEEE single/doubleprecision floating point unit providing multiply-add! instructions (1/2 cycles)! 16 load/store units! Concurrent access to data in each! address of the cache or DRAM! (1 cycle with cache hit)! Memory seen as a 2D array! Int-float (and viceversa) casts performed while copying data from memory to register file!

61 61 Streaming Multiprocessor (SM)! 4 special function units (SFUs)! For transcendent functions (sine, cosine, square root, )! Slower than other units! (8 cycles)! Decoupled from the dispatching! units to improve performance! Shader clock = 2x GPU clock to optimize area/performance!

62 62 Streaming Multiprocessor (SM)! PTX (Parallel Thread Execution) is a scalar instruction set architecture! Arithmetic/logic operations! Special functions! Memory load/store! Texture fetch! Memory atomic instructions! Branch, call, return! Barrier synchronization!

63 Streaming Multiprocessor (SM)! Threads are grouped in 32 threads sharing an instruction stream, called warp! The SM has 2 scheduling and dispatching units!

63 63 Streaming Multiprocessor (SM)! Threads are grouped in 32 threads sharing an instruction stream, called warp! The SM has 2 scheduling and dispatching units! Two warps are selected each clock cycle (fetch, decode and execute two warps in parallel)! The register file may host up to 48 interleaved warps! 1536 threads per SM! Globally threads!

64 Streaming Multiprocessor (SM)! Each scheduler may execute an instruction on 16 ALU cores, 16 load/store units, or 4 SFUs! Each double precision FPU instruction requires 2 ALU cores!

64 64 Streaming Multiprocessor (SM)! Each scheduler may execute an instruction on 16 ALU cores, 16 load/store units, or 4 SFUs! Each double precision FPU instruction requires 2 ALU cores! Each clock cycle the scheduler selects a warp that is ready to be executed! Warp are independent -> no dependency check is required! Double precision instructions do not support dual dispatching!

65 65 Streaming Multiprocessor (SM)! An example of scheduling of 32 instructions! A scoreboard is used to keep track of the status of the thread warps!

66 66 Other features of Fermi! 2-level distributed scheduler! At chip level a global workload distribution engine dispatches thread blocks to various SMs! As already discussed, at SM level each warp scheduler distributes warps! Support to fast context switch among different applications (around 25us)!

67 67 Other features of Fermi! Support concurrent execution of multiple compute kernels from the same application!

68 68 NVIDIA Kepler (2012)! Architecture similar to the Fermi one with performance and power efficiency improvements!

69 69 Streaming Multiprocessor (SMX)! New SM architecture (now called SMX)! 192 single-precision streaming core/integer instructions! 64 double-precision streaming cores! 32 special floating point units! 32 load/store units! 8 texture filtering units! Shader clock = GPU clock!

70 Streaming Multiprocessor (SMX)! New SM architecture (now called SMX)! 4 warp schedulers! 2 dispatchers per warp (each cycle 2 independent instructions per warp are executed)!

70 70 Streaming Multiprocessor (SMX)! New SM architecture (now called SMX)! 4 warp schedulers! 2 dispatchers per warp (each cycle 2 independent instructions per warp are executed)! Warp scheduler has been enhanced! Register scoreboarding for long latency operations (texture and load)! Inter warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates)! Thread-block-level scheduling!

71 Streaming Multiprocessor (SMX)! New SM architecture (now called SMX)! 4 warp schedulers! 2 dispatchers per warp (each cycle 2 independent instructions per warp are executed)!

71 71 Streaming Multiprocessor (SMX)! New SM architecture (now called SMX)! 4 warp schedulers! 2 dispatchers per warp (each cycle 2 independent instructions per warp are executed)! Warp scheduler has been enhanced! Register scoreboarding for long latency operations (texture and load)! Inter warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates)! Thread block level scheduling!

72 72 Dynamic parallelism! A kernel can launch another kernel! Kernel launch, synchronization on results and work scheduling are managed directly by the GPU device! This allows the optimization of recursive and data-dependent executions patters!

73 73 Several work queues! Hyper-Q mechanism offering 32 HW-managed work queues! Hyper Q enables multiple CPU cores (and applications) to launch! work on a single GPU simultaneously! Advantages:! Avoidance of false! intra stream dependencies! Dramatically increase of! GPU utilization and! reduction of CPU idle times!

74 74 NVIDIA Maxwell (2014)! Similar architecture of Kepler! New streaming multicore (SMM)! SCs are partitioned in 4 groups each one assigned to a single wrap scheduler! Performance/power improvements!

75 75 NVIDIA Pascal (2017)! Similar architecture of Maxwell! Further performance/ power improvements! Introduction of halfprecision floating point instructions! (x2 throughput w.r.t. single-precision one)! Preemption at single instruction granularity!

76 76 Cache organization! CPU cores run efficiently when data is resident in cache! Caches reduce latency and proved high bandwidth!

77 77 Cache organization! Initially GPU core was not provided with caches! GPU core required a high-bandwidth connection to memory!

78 Limited bandwidth! A high-end GPU (e.g. Radeon HD 6970-2010) has...! Over twenty times (2.7 TFLOPS) the compute performance of quad-core CPU! No large cache hierarchy to absorb memory requests!

78 78 Limited bandwidth! A high-end GPU (e.g. Radeon HD ) has...! Over twenty times (2.7 TFLOPS) the compute performance of quad-core CPU! No large cache hierarchy to absorb memory requests! GPU memory system is designed for throughput! Wide bus (150 GB/sec)! Repack/reorder/! interleave memory! requests to optimize! use of memory bus! Still, this is only 5 times! the bandwidth available! to CPU!

79 79 Limited bandwidth! A more recent system s configuration (Intel Core i NVIDIA GeForce GTX )!

80 80 Limited bandwidth! If processors request data at too high a rate, the memory system cannot keep up! Overcoming bandwidth limits are a common challenge for GPUcompute application developers!! Request data less often (instead, do more math)! arithmetic intensity! Might be quicker to calculate something from scratch on device instead of copying from host! Fetch data from memory less often (share/reuse data across fragments! On-chip communication or storage! Graphics elaborations fit well with these issues! More ALU operations that memory accesses!

81 81 Modern GPU memory hierarchy! Modern GPUs are provided with! Local memories (not synched with main memory)! Texture caches (read-only)! Moreover L1-L2 caches have been added (later )! Do consider that GPU applications present high spatial locality but very low temporal locality! Cache coherency only on L2 cache!

82 82 Transmission cost! Another relevant aspect is the CPU/GPU transmission bandwidth! PCIe bandwidth: 16GB/s on each direction! Attempt to pipeline/multi-buffer uploads and downloads!

83 83 NVIDIA GeForce 8 (2006)! Each SM is provided with! 16KB shared memory! 64KB constant cache! 8KB texture cache! Each process can access all memory locations at 86Gb/s with different latencies:! Shared: 2 cycles! Device: 300 cycles!

84 84 NVIDIA Fermi (2010)! Each SM is provided with 64KB local shared memory used by thread blocks to cooperate! Reduction of off-chip traffic! The shared memory is a usermanaged scratchpad memory! Shared memory can be configured by the programmer to obtained also a L1 cache! Introduced a chip-level L2 cache! Unified address space on private, shared and global memory!

85 85 NVIDIA Fermi (2010)! Texture cache has been removed from L1 since not efficient for general purpose computing! Register spilling on L1! Fast atomic memory operations on L2 memory! Read-modify-write, compareand-swap! Efficient sorting and building of data structures!

86 86 NVIDIA Kepler (2012)! Doubled Fermi cache size: 128KB L1, 1536KB L2! Introduced a Read-only cache (similar to a texture cache)! Added shuffle instructions to exchange data among threads without using shared memory! Optimizes synchronization time!

87 87 NVIDIA Pascal (2017)! Several improvements in caches! Dedicated 64KB per SM shared memory! Unified virtual addressing between CPU and GPU! Memory pages are transparently transmitted between CPU and GPU memories! Automatic handling of page fault and global data coherence!

88 88 NVIDIA Pascal (2017)! NVLink: new high-speed interface (160GB/s bidirectional) to enable multi-gpu architectures for high performance computing!

89 89 Other vendors! We have analyzed NVIDIA GPUs so far! There are many other GPU vendors! E.g.: AMD, ARM,! The overall GPU architecture is quite similar to the NVIDIA one!

90 90 AMD Radeon HD 6970 Cayman (2010)! SIMD func;on unit, control shared across 16 func;onal units (Up to 4 MUL- ADDs per clock cycle) VLIW processing! The streaming multiprocessor is here called compute unit! Groups of 64 [fragments/vertices/etc.] share instruction stream (called wavefront)! Four clocks to execute an instruction for all fragments in a wavefront!

91 91 AMD Radeon HD 6970 Cayman (2010)! There are 24 of these cores (compute units) on the 6970 able to handle up to ~32,000 fragments!

92 92 AMD Radeon HD 6970 Cayman (2010)! Wavefront size is 64 work-items! Vector instructions performed with one lane per work-item! Scalar instructions are performed once for entire wavefront! Vector load/store instructions supply one address per work-item! A SIMD lane (SL) executes one vector instruction! 16 stream cores execute 16 vector instructions on each cycle! A quarter of the wavefront (16 workitems) is issued on each cycle to the SIMD! The entire wavefront is issued over four consecutive cycles! The entire wavefront issues to the other units in a single cycle! Compute Unit: wavefront view Wavefront Scheduler Instruction Issue Branch Unit Local Data Share Scalar Unit Load/Store unit SL 0 SL 1 SIMD unit SL 2 SL15

93 93 AMD Radeon HD 6970 Cayman (2010)! The single SIMD lane has a VLIW architecture! VLIW parallelism extracted by the compiler!

94 94 AMD Radeon HD 6970 Cayman (2010)! Memory bandwidths:! Higher L2-L1 throughput helps reducing the overhead due to more frequent L1 misses w.r.t. CPUs (lower temporal locality)! Larger register file helps reducing register spilling! Same considera;ons for NVIDIA GPUs

95 95 AMD Radeon HD 6970 Cayman (2010)! Local memory and registers are persistent within compute unit once a work-group is scheduled! Traditional context switching is not used! Allows for no-overhead wavefront interleaving! Number of active wavefronts supported per compute unit is limited (496). Decided by! Local memory required per work-group! Register usage per work-item!

96 96 AMD Radeon R9 290X (2013)! 44 compute units! 4 SIMD cores per compute unit! Multi-bank coherent L2 cache (100s of GB/s)! Each unit contains L1 cache and a user-managed scratchpad (~TB/s)! NO VLIW architecture!

97 97 AMD Radeon R9 290X (2013)! 4 SIMDs per compute unit! A SIMD lane (SL) per SIMD unit! 1 Scalar Unit to handle instructions common to wavefront! Loop iterators, constant variable accesses, branches! Has a single, integeronly ALU unit! Separate branch unit used for some conditional instructions!

98 98 AMD Radeon R9 290X (2013)! Wavefronts are associated with a SIMD unit and a subset of the vector registers! Up to 10 wavefronts for each SIMD (in total 40 wavefronts)! All hardware units except for the SIMDs are shared by all wavefronts on a compute unit!

99 99 AMD Radeon R9 290X (2013)! Each cycle, wavefronts targeting one of the SIMDs are allowed to issue instructions! Every 4th cycle a wavefront will be active! An instruction takes 4 cycles to enter the SIMD pipeline (4 subwavefronts per wavefront)! Scalar unit and branch unit can take 1 instruction per cycle! All hardware units can remain fully utilized with a simplified front-end using this round-robin technique!

100 100 AMD Radeon R9 290X (2013)! Up to 5 instructions can be issued per cycle! Only 1 per wavefront! Only 1 per instruction type (i.e., per hardware unit)! Need multiple instructions types present to fully utilize hardware units! Instruction types! Vector Arithmetic Logic Unit (ALU)! Scalar ALU or Scalar Memory Read! Vector memory access (Read/Write/Atomic)! Branch/Message! Local Data Share (LDS)! Export or Global Data Share (GDS)! Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio)!

16 partitions with 64 KB/ partition! Write back (dirty-byte mask)!

101 101 AMD Radeon R9 290X (2013)! R/W L1 caches! 16 KB per compute unit! Write through (dirty-byte mask)! 64B lines! R/W L2 caches! 16 partitions with 64 KB/ partition! Write back (dirty-byte mask)! 64B lines! Local Data Share (LDS)! 64 KB per compute unit! 32 banks! Contains integer atomic units!

102 102 AMD Radeon R9 290X (2013)! Memory bandwidths:! Same considerations of previous architectures! AMD FX-8350 CPU! AMD R9 290X GPU!

103 103 ARM Mali 628 (2014)! Targeted for embedded computing!

104 104 CPU and GPU within the same chip! The trend in the last years has been to integrate the GPU within the same chip of the CPU! Opportunities:! Reduce offload cost! Reduce memory copy/transfers! Power management! Steps:! Remove the external communication link! Define a unified memory architecture!

105 105 AMD Llano (2011)! Targeted for mobile and desktop computing! Subsequent solutions integrated in Microsoft Xbox and Sony PlayStation 4! Architecture:! CPU: AMD K10 quad-core! GPU: AMD Radeon HD 6000!

106 106 Intel Sandy Bridge (2011)! First Intel generation (after Intel Westmere) with integrated GPU!

107 107 Solutions from mobile, desktop and server market! Architecture:! Intel Skylane (2015)! CPU: Intel multi-core (from m3 dual-core to i7 octacore to Xeon E3 octa-core)! GPU: Intel HD Graphics! (up to 24 execution units)! or Iris Graphics (up to 72! execution units)!

108 108 Samsung Exynos 5422 (2014)! Targeted for mobile computing! Architecture:! CPU: ARM big.little octa-core! GPU: ARM Mali-T628 MP6!

109 109 NVIDIA Tegra X1 (2015)! Targeted for mobile computing! Architecture:! CPU: ARM big.little octa-core! GPU: NVIDIA Maxwell with 256 CUDA cores!

110 110 Final notes! Generic many-core GPU! Less space devoted to control logic and caches! Large register files to support multiple thread contexts! Low latency hardware managed thread switching! Large number of ALU per core with small usermanaged cache per core! Memory bus optimized for bandwidth! ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously! High Bandwidth bus to ALUs On Board System Memory Simple ALUs Cache Support for general purpose compu;ng!

111 111 Final notes! An efficient GPU workload! Has thousands of independent pieces of work! Uses many ALUs on many cores! Supports massive interleaving for latency hiding! Is amenable to instruction stream sharing! Maps to SIMD execution well! Is compute-heavy: the ratio of math operations to memory access is high! Not limited by bandwidth!

112 112 Final notes! Original role: implementing the graphics pipeline for 3D rendering! Nevertheless, GPUs can be also used for accelerating general purpose computations (GP-GPU!)! Some languages have been developed (CUDA, OpenCL, C++AMP)!

113 113 References! Material taken from other university courses on computer architectures, computer graphics and parallel computing, in particular! NVIDIA website:! Benedict R. Gaster, Lee Howes, David Kaeli, Perhaad Mistry, Dana Schaa, Heterogeneous Computing with OpenCL, Morgan Kaufmann, 2012! David Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang, Heterogeneous Computing with OpenCL 2.0, Morgan Kaufmann, 2015!

Antonio R. Miele Marco D. Santambrogio

Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First