Antonio R. Miele Marco D. Santambrogio

Size: px

Start display at page:

Download "Antonio R. Miele Marco D. Santambrogio"

Robert Willis
5 years ago
Views:

1 Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

2 2 Introduction First GPU released in 1999 Used for the purpose of graphics processing GPU architecture rapidly evolved providing higher computational power by means of parallelization GPU architecture evolved also to support programmability of their components ( )

3 3 Introduction In 2006, NVIDIA introduced GeForce 8800 GPU supporting a new programming language: CUDA Compute Unified Device Architecture Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Host CPU issues data-parallel kernels to GP-GPU for execution

4 4 Introduction CPU and GPU performance trends FLOPS FLoating-point OPerations per Second

5 5 Graphics pipeline At the beginning there was the graphics pipeline

6 Graphics pipeline 6

7 7 Vertex generation The host interface is the communication bridge between the CPU and the GPU It receives commands from the CPU and also pulls geometry information from system memory It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc)

8 8 Vertex processing The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space Transformations are based on matrix multiplications No new vertices are created in this stage, and no vertices are discarded (input/output has 1:1 mapping)

9 9 Vertex processing 1. Model to world coordinates 2. World to eye coordinates 3. Eye to clip coordinates Textures may be also used for advanced transformations (they provide height maps for displacement mapping)

10 10 Primitive generation The primitive assembler groups vertices forming one primitive (i.e. a triangle)

11 11 Primitive processing Various elaborations are performed Perspective division and viewpoint transformation Clipping

12 12 Fragment generation Geometry information is transformed in raster information (pixels in output Determine what pixels a primitive overlaps Aliasing and other issues

13 13 Fragment processing Assign colors to pixels Shades the fragment by simulating the interaction of light and material

14 14 Fragment processing Effects of tessellation Texture mapping Lightning and texture

15 15 Pixel operations Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests Finally pixels are copied in the framebuffer (the memory space connected to the screen controller)

16 16 Graphics pipeline In each stage elaborations can be performed in parallel on each chunk of data (vertex, fragment, pixel, ) PARALLELISM!

17 17 Evolution of the graphics pipeline Pre GPU Fixed function GPU Programmable GPU Unified shader processors

18 Early 90s pre GPU 18

19 19 Exploit parallelism Goals of GPUs? Pipeline parallel Data-parallel CPU and GPU executing in parallel Specific hardware accelerators Texture filtering, rasterization, MAD, sqrt,...

20 20 Fixed function rasterization, texture mapping, depth testing, etc. 3dfx voodoo (1996) Required separate VGA card for 2D

21 21 NVIDIA GeForce 256 (1999) All stages implemented in hardware Fixed function rasterization, texture mapping, depth testing, etc.

22 22 NVIDIA GeForce 3 (2001) Optionally bypass fixedfunction with a programmable vertex shader Shader: a miniprogram defining the logic of a pipeline stage A specific shading language has to be used (e.g. OpenGL) Programmable

23 23 NVIDIA GeForce 6 (2004) Improved programmability in fragment shader Vertex shader can read textures Dynamic branches Programmable

24 24 Pipelined architecture NVIDIA GeForce 6 (2004) Multiple cores for each stage Programmable stages The introduction of programmable stages requires fetch and decode units

25 25 NVIDIA GeForce 7800 (2005) Vertex Fixed stages Programmable stages Fragment The introduction of programmable stages requires fetch and decode units Composite

26 26 NVIDIA GeForce 8 (2006) Ground-up architecture redesign New geometry shader after the vertex shader Introduction of the unified shader processor Geometry shader Introduction of CUDA Employment of GPU for general purpose computing: GP-GPU Programmable

27 27 NVIDIA GeForce 8800 (2006) Introduction of Issue Units for managing threads generation and scheduling Fixed stages

28 28 Why a single shader processor? Non-unified shader processors Vertex shader bottleneck Pixel shader Heavy pixel workload Vertex shader Pixel shader Problems in balancing workload in pipeline stages Heavy geometry workload

29 29 Why a single shader processor? Non-unified shader processors Unified shader Heavy pixel workload Unified shader Heavy geometry workload Optimal usage of processing resources

30 30 Unified shader processor How the unified shared processor works Three key ideas: Instantiate many shader processors Replicate ALU inside the shader processor to enable SIMD processing Interleave the execution of many groups of SIMD threads

31 31 Example: a diffuse reflectance shader Shader programming model: fragments (or more in general work items) are processed independently The function has to be executed for each fragment

32 32 Example: a diffuse reflectance shader Shader programming model: fragments (or more in general work items) are processed independently The function has to be executed for each fragment One instruction stream per fragment

33 Basic architecture of a modern CPU 33

34 34 Basic architecture of a modern GPU Remove components that help a single instruction stream run faster

35 35 Replicate cores Replicate cores to run several threads in parallel 2 cores process 2 instruction streams in parallel

36 36 Replicate cores Replicate cores to run several threads in parallel 4 cores process 4 instruction streams in parallel

37 37 Replicate cores Replicate cores to run several threads in parallel 16 cores process 16 instruction streams in parallel

38 38 Replicate cores Replicate cores to run several threads in parallel 16 cores process 16 instruction streams in parallel PROBLEM: many cores should share the same instruction stream Since each unit has its own fetch and decode unit, we rather prefer to run different instruction streams

39 39 Replicate ALUs within the core SIMD processing

40 40 Replicate ALUs within the core SIMD processing Original compiled shader: Processing one item using scalar operations on scalar registers

41 41 Replicate ALUs within the core SIMD processing New compiled shader: Processing 8 items using vector operations on vector registers

42 42 Replicate ALUs within the core SIMD processing

43 43 Replicate ALUs within the core SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions Cray, Intel/AMD x86 SSE, IBM Altivec (explicit vector length) Option 2: Scalar instructions with implicit HW vectorization HW determines instruction stream sharing across ALUs NVIDIA GeForce ( SIMT warps), ATI Radeon architectures SIMT: single instruction multiple threads Split identical independent work items over multiple threads executed in lockstep An instruction stream of scalar instructions is shared among the various threads

44 44 Merging two-level replications Result: multicore architecture where each core is a SIMD architecture 16 cores, each one having 8 ALUs = 128 simultaneous threads

45 45 Branches Branches have to be accurately handled

46 46 Branches Branches have to be accurately handled

47 47 Branches Branches have to be accurately handled

48 48 Stalls The execution of an instruction may have a data dependency with a previous one (still running) -> stall! Access to the texture memory (100x slower than ALU instructions)

49 49 Stalls Stalls due to data dependencies have to be managed as well Memory accesses cause many stalls due to the considerably higher execution time with respect to ALU instructions (x100/x1000) Fancy caches and logic avoiding stalls in CPUs have been removed However On GPU we can run concurrently MANY independent instructions streams

50 50 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations

51 51 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations

52 52 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations

53 53 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations

54 54 Stalls Interleaving between contexts can be managed by HW or SW or both NVIDIA/AMD Radeon GPUs approach HW schedules and manages all contexts Special on-chip storage holds work item state

55 55 How to dimension the context Maximal latency hiding ability Low latency hiding ability

56 56 Basic architecture of a modern GPU Summary: Use many slimmed down cores to run in parallel Pack cores full of ALUs (by sharing instruction streams across group of work items) Option 1: explicit SIMD vector instructions Options 2: implicit sharing managed by HW Avoid latency stalls by interleaving execution of many groups of work items/threads/ When a group stalls, work on another group

57 57 16 streaming multiprocessors NVIDIA Fermi (2009) Each multiprocessor has 32 streaming cores SIMT, single instruction multi threads 6 memory ports 1 global scheduler

58 58 Streaming Multiprocessor (SM) 32 streaming cores 32 bit pipelined integer arithmetic unit (with support for 64 bit operations 1 cycle) IEEE single/doubleprecision floating point unit providing multiply-add instructions (1 cycles) 16 load/store units Concurrent access to data in each address of the cache or DRAM (1 cycle) 4 special function units (SFUs) For transcendent functions (sine, cosine, square root, ) Slower than other units (4 cycles) Decoupled from the dispatching units to improve performance

59 59 Streaming Multiprocessor (SM) Threads are grouped in 32 threads sharing an instruction stream, called warp The SM has 2 scheduling and dispatching units Two warps are selected each clock cycle (fetch, decode and execute two warps in parallel) The register file may host up to 48 interleaved warps 1536 threads per SM! Globally threads!

60 Streaming Multiprocessor (SM) Each scheduler may execute an instruction on 16 ALU cores, 16 load/store units, or 4 SFUs Each double precision FPU

60 60 Streaming Multiprocessor (SM) Each scheduler may execute an instruction on 16 ALU cores, 16 load/store units, or 4 SFUs Each double precision FPU instruction requires 2 ALU cores Each clock cycle the scheduler selects a warp that is ready to be executed Warp are independent -> no dependency check is required

61 61 Other features of Fermi 2-level distributed scheduler At chip level a global workload distribution engine dispatches thread blocks to various SMs At SM level each warp scheduler distributes warps Support to fast context switch (around 25us)

62 62 Other features of Fermi Support to concurrent kernel execution

63 NVIDIA Kepler (2012) Same architecture of Fermi with performance and power efficiency improvements Increased to 192 streaming core per SM 32

63 63 NVIDIA Kepler (2012) Same architecture of Fermi with performance and power efficiency improvements Increased to 192 streaming core per SM 32 special floating point units Improved warp scheduling (4 schedulers per SM) Other improvements Maxwell architecture (2014) presents further improvements

64 64 Cache organization CPU cores run efficiently when data is resident in cache Caches reduce latency and proved high bandwidth

65 65 Cache organization Initially GPU core was not provided with caches GPU core required a high-bandwidth connection to memory

66 66 Limited bandwidth A high-end GPU (e.g. Radeon HD 6970) has... Over twenty times (2.7 TFLOPS) the compute performance of quad-core CPU No large cache hierarchy to absorb memory requests GPU memory system is designed for throughput Wide bus (150 GB/sec) Repack/reorder/interleave memory requests to maximize use of memory bus Still, this is only 5 times the bandwidth available to CPU

67 67 Limited bandwidth If processors request data at too high a rate, the memory system cannot keep up Overcoming bandwidth limits are a common challenge for GPUcompute application developers Request data less often (instead, do more math) arithmetic intensity Might be quicker to calculate something from scratch on device instead of copying from host Fetch data from memory less often (share/reuse data across fragments on-chip communication or storage Graphics elaborations fit well with these issues More ALU operations that memory accesses

68 68 Modern GPU memory hierarchy Modern GPUs are provided with local memories (not synched with main memory) texture caches (read-only) Moreover L1-L2 caches have been added In NVIDIA architectures only L2 is coherent!

69 69 Transmission cost Another relevant aspect is the CPU/GPU transmission bandwidth PCIe bandwidth: 8GB/s on each direction Attempt to pipeline/multi-buffer uploads and downloads

70 70 NVIDIA GeForce 8 (2006) Each SM is provided with 16K shared memory 64K constant cache 8K texture cache Each process can access all memory locations at 86Gb/s with different latencies: Shared: 2 cycles Device: 300 cycles

71 71 NVIDIA Fermi (2009) Each SM is provided with 64K local shared memory used by thread blocks to cooperate Reduction of off-chip traffic Shared memory can be configured by the programmer to obtained also a L1 cache Introduced a chip-level L2 cache

72 72 NVIDIA Fermi (2009) Texture cache has been removed from L1 since not efficient for general purpose computing Fast atomic memory operations Read-modify-write, compare-and-swap Efficient sorting and building of data structures

73 73 NVIDIA Kepler (2012) Doubled Fermi cache size: 128K L1, 1536KB L2 Introduced a Read-only cache (similar to a texture cache) Added shuffle instructions

74 74 CPU/GPU interaction The CPU and GPU inside the PC work in parallel with each other There are two threads going on, one for the CPU and one for the GPU, which communicate through a command buffer: GPU reads commands from here Pending GPU commands CPU writes commands here

75 75 CPU/GPU interaction Communications between CPU and GPU are nonblocking (or asynchronous) In the CPU program below, the object is not drawn after statement A and before statement B: Statement A API call to draw object Statement B Instead, all the API call does is to add the command to draw the object to the GPU command buffer

76 76 CPU/GPU interaction This leads to a number of synchronization considerations In the figure below, the CPU must not overwrite the data in the yellow block until the GPU is done with the black command, which references that data: GPU reads commands from here CPU writes commands here data

77 77 CPU/GPU interaction Modern graphics APIs implement semaphore style operations to keep this from causing problems If the CPU attempts to modify a piece of data that is being referenced by a pending GPU command, it will have to spin around waiting, until the GPU is finished with that command While this ensures correct operation it is not good for performance since there are a million other things we would rather do with the CPU instead of spinning The GPU will also drain a big part of the command buffer thereby reducing its ability to run in parallel with the CPU

78 Inlining data One way to avoid these problems is to inline all data to the command buffer and avoid references to separate data: GPU reads commands from here data CPU writes commands here However, this is also bad for performance, since we may need to copy several Mbytes of data instead of merely passing around a pointer

79 GPU readbacks The output of a GPU is a rendered image on the screen, what will happen if the CPU tries to read it? Pending GPU commands CPU writes commands here GPU reads commands from here The GPU must be synchronized with the CPU, i.e. it must drain its entire command buffer, and the CPU must wait while this happens When the GPU is used for general purpose computing, the programmer has to explicitly manage memory transfers and synchronization

80 81 Other vendors We have analyzed NVIDIA GPUs so far There are many other GPU vendors E.g.: AMD, ARM, The overall GPU architecture is quite similar to the NVIDIA one

81 82 AMD Radeon HD 6970 (Cayman) 2010 SIMD function unit, control shared across 16 units (Up to 4 MUL-ADDs per clock) VLIW processing! Groups of 64 [fragments/vertices/etc.] share instruction stream Four clocks to execute an instruction for all fragments in a group

82 83 AMD Radeon HD 6970 (Cayman) 2010 There are 24 of these cores on the 6970: that s about 32,000 fragments!

83 84 ARM Mali 628 (2014) Targeted for embedded computing

84 85 CPU and GPU within the same chip The trend in the last years has been to integrate the GPU within the same chip of the CPU Opportunities: Reduce offload cost Reduce memory copy/transfers Power management Steps: Remove the external communication link Define a unified memory architecture

85 86 Targeted for mobile and desktop computing Subsequent solutions integrated in Microsoft Xbox and Sony PlayStation 4 Architecture: AMD Llano (2011) CPU: AMD K10 quad-core GPU: AMD Radeon HD 6000

86 87 Intel Sandy Bridge (2011) First Intel generation (after Intel Westmere) with integrated GPU

87 88 Solutions from mobile, desktop and server market Architecture: Intel Skylane (2015) CPU: Intel multi-core (from m3 dual-core to i7 octacore to Xeon E3 octa-core) GPU: Intel HD Graphics (up to 24 execution units) or Iris Graphics (up to 72 execution units)

88 89 Samsung Exynos 5422 (2014) Targeted for mobile computing Architecture: CPU: ARM big.little octa-core GPU: ARM Mali-T628 MP6

89 90 NVIDIA Tegra X1 (2015) Targeted for mobile computing Architecture: CPU: ARM big.little octa-core GPU: NVIDIA Maxwell with 256 CUDA cores

90 91 Final notes Generic many-core GPU Less space devoted to control logic and caches Large register files to support multiple thread contexts Low latency hardware managed thread switching Large number of ALU per core with small usermanaged cache per core Memory bus optimized for bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously Simple ALUs Cache High Bandwidth bus to ALUs On Board System Memory Support for general purpose computing!

91 92 Final notes GPUs are massively parallel devices originally used for implementing the graphics pipeline GPUs can be also used for accelerating general purpose computations (GP-GPU!) Some languages have been developed (CUDA, OpenCL, C++AMP)

92 93 Final notes An efficient GPU workload Has thousands of independent pieces of work Uses many ALUs on many cores Supports massive interleaving for latency hiding Is amenable to instruction stream sharing Maps to SIMD execution well Is compute-heavy: the ratio of math operations to memory access is high Not limited by bandwidth

93 94 References Material taken from other university course on computer architectures, computer graphics and parallel computing GPUs.pdf f11/www/ NVIDIA website:

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Advanced Topics on Heterogeneous System Architectures GPU! Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Introduction!