GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

Size: px
Start display at page:

Download "GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!"

Transcription

1 Advanced Topics on Heterogeneous System Architectures GPU! Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano!

2 2 Introduction! First GPU released in 1999! Used for the purpose of graphics processing! GPU architecture rapidly evolved providing higher computational power by means of parallelization! GPU architecture evolved also to support programmability of their components ( )!

3 3 Introduction! In 2006, NVIDIA introduced GeForce 8800 GPU supporting a new programming language: CUDA! Compute Unified Device Architecture! Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas! Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing! Host CPU issues data-parallel kernels to GP-GPU for execution!

4 4 Introduction! CPU and GPU performance trends! FLOPS FLoa;ng-point OPera;ons per Second

5 5 Graphics pipeline! At the beginning there was the graphics pipeline!

6 Graphics pipeline! 6

7 7 Vertex generation! The host interface is the communication bridge between the CPU and the GPU! It receives commands from the CPU and also pulls geometry information from system memory! It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color, etc.)!

8 8 Vertex processing! The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space! Transformations are based on matrix multiplications! No new vertices are created in this stage, and no vertices are discarded (input/output has 1:1 mapping)!

9 9 Vertex processing! 1. Model to world! coordinates! 2. World to eye coordinates! 3. Eye to clip! coordinates! Textures may be also used for advanced transformations (they provide height maps for displacement mapping)!

10 10 Primitive generation! The primitive assembler groups vertices forming one primitive (i.e. a triangle)!

11 11 Primitive processing! Various elaborations are performed! Perspective division and viewpoint transformation! Clipping!

12 12 Fragment generation! Geometry information is transformed in raster information (pixels in output)! Determine which pixels a primitive overlaps! Aliasing and other issues!

13 13 Fragment processing! Assign colors to pixels! Shades the fragment by simulating the interaction of light and material!

14 14 Fragment processing! Effects of tessellation! Texture mapping! Lightning and texture!

15 15 Pixel operations! Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests! Finally pixels are copied in the framebuffer (the memory space connected to the screen controller)!

16 16 Graphics pipeline! Single-stage elaborations can be performed in parallel on each chunk of data (vertex, fragment, pixel, )! Single-stage elaborations are mainly data-intensive, based on matrix elaborations and other arithmetical operations! Few branch instructions! Few data dependencies! PARALLELISM!

17 17 Evolution of the graphics pipeline! Pre GPU! Fixed function GPU! Programmable GPU! Unified shader processors!

18 Early 90s pre GPU! 18

19 19 Early 90s pre GPU! SGI RealityEngine (1993)! Graphics supercomputer! Various boards! Programmable, just not by application! Geometry engine contained Intel i860xp processor!

20 20 Exploit parallelism! Pipeline parallel! Data-parallel! Goals of GPUs?! CPU and GPU executing in parallel! Specific hardware accelerators! Texture filtering, rasterization, MAC, sqrt,...!

21 21 Fixed function rasterization, texture mapping, depth testing, etc.! 3dfx voodoo (1996)! Required separate VGA card for 2D!

22 22 NVIDIA GeForce 256 (1999)! All stages implemented in hardware! Fixed function rasterization, texture mapping, depth testing, etc.!

23 23 NVIDIA GeForce 3 (2001)! Optionally bypass fixed-function with a programmable vertex shader! Shader: a miniprogram defining the logic of a pipeline stage! A specific shading language has to be used (e.g. OpenGL)! Programmable

24 24 NVIDIA GeForce 6 (2004)! Improved programmability in fragment shader! Vertex shader can read textures! Dynamic branches! Programmable

25 25 NVIDIA GeForce 6 (2004)! Pipelined architecture! Multiple cores for each stage! Programmable stages! The introduction of programmable stages requires fetch and decode units!

26 26 NVIDIA GeForce 7800 (2005)! Vertex Fixed stages Programmable stages Fragment The introduc;on of programmable stages requires fetch and decode units Composite

27 27 NVIDIA GeForce 8 (2006)! Ground-up architecture redesign! New geometry shader after the vertex shader! Introduction of the unified shader processor! Introduction of CUDA! Employment of GPU for general purpose computing: GP-GPU! Geometry shader Programmable

28 28 NVIDIA GeForce Tesla (2006)! Introduc;on of Issue Units for managing threads genera;on and scheduling Fixed stages

29 29 Why a single shader processor?! Non-unified shader processors! Vertex shader bosleneck Pixel shader Heavy pixel workload Vertex shader Pixel shader Problems in balancing workload in pipeline stages Heavy geometry workload

30 30 Why a single shader processor?! Non-unified shader processors! Unified shader Heavy pixel workload Unified shader Heavy geometry workload Op;mal usage of processing resources

31 31 Unified shader processor! How the unified shared processor works! Three key ideas:! Instantiate many shader processors! Replicate ALU inside the shader processor to enable SIMD processing! Interleave the execution of many groups of SIMD threads! SIMD: single instruction multiple data!

32 32 Example: a diffuse reflectance shader! Shader programming model: fragments (or more in general work items) are processed independently! The function has to be executed for each fragment!

33 33 Example: a diffuse reflectance shader! Shader programming model: fragments (or more in general work items) are processed independently! The function has to be executed for each fragment! One instruc;on stream per fragment

34 Basic architecture of a modern CPU! 34

35 35 Basic architecture of a modern GPU! Remove components that help a single instruction stream run faster!

36 36 Replicate cores! Replicate cores to run several threads in parallel! 2 cores process 2 instruction streams in parallel!

37 37 Replicate cores! Replicate cores to run several threads in parallel! 4 cores process 4 instruction streams in parallel!

38 38 Replicate cores! Replicate cores to run several threads in parallel! 16 cores process 16 instruction streams in parallel!

39 39 Replicate cores! Replicate cores to run several threads in parallel! 16 cores process 16 instruction streams in parallel! PROBLEM: many cores should share the same instruc;on stream Since each unit has its own fetch and decode unit, we rather prefer to run different instruc;on streams

40 40 Replicate ALUs within the core! SIMD processing!

41 41 Replicate ALUs within the core! SIMD processing! Original compiled shader: Processing one item using scalar opera;ons on scalar registers

42 42 Replicate ALUs within the core! SIMD processing! New compiled shader: Processing 8 items using vector opera;ons on vector registers

43 43 Replicate ALUs within the core! SIMD processing!

44 44 Replicate ALUs within the core! SIMD processing does not imply SIMD instructions! Option 1: Explicit vector instructions! Intel/AMD x86 SSE and AVX, IBM Altivec (explicit vector length)! Option 2: Scalar instructions with implicit HW vectorization! HW determines instruction stream sharing across ALUs! NVIDIA GeForce ( SIMT warps), ATI Radeon architectures! SIMT: single instruction multiple threads! Split identical independent work items over multiple threads executed in lockstep! An instruction stream of scalar instructions is shared among the various threads!

45 45 Replicate ALUs within the core! SIMT processing!

46 46 Merging two-level replications! Result: multicore architecture where each core is a SIMD architecture! 16 cores, each one having 8 ALUs = 128 simultaneous threads

47 47 Branches! Branches have to be accurately handled!

48 48 Branches! Branches have to be accurately handled!

49 49 Branches! Branches have to be accurately handled!

50 50 Stalls! The execution of an instruction may have a data dependency with a previous one (still running) -> stall!! Access to the texture memory (100x slower than ALU instruc;ons)

51 51 Stalls! Stalls due to data dependencies have to be managed as well! Memory accesses cause many stalls due to the considerably higher execution time with respect to ALU instructions (x100/ x1000)! Fancy caches and logic avoiding stalls in CPUs have been removed! However! On GPU we can run concurrently MANY independent instructions streams!

52 52 Stalls! Interleave processing of many streams on a single core to hide stalls caused by latency operations!

53 53 Stalls! Interleave processing of many streams on a single core to hide stalls caused by latency operations!

54 54 Stalls! Interleave processing of many streams on a single core to hide stalls caused by latency operations!

55 55 Stalls! Interleave processing of many streams on a single core to hide stalls caused by latency operations!

56 56 Stalls! Interleaving between contexts can be managed by HW or SW or both! NVIDIA/AMD Radeon GPUs approach! HW schedules and manages all contexts! Special on-chip storage holds work item state!

57 57 How to dimension the context! Maximal latency hiding ability Low latency hiding ability

58 58 Basic architecture of a modern GPU! Summary:! Use many slimmed down cores to run in parallel! Pack cores full of ALUs (by sharing instruction streams across group of work items)! Option 1: explicit SIMD vector instructions! Option 2: implicit sharing managed by HW! Avoid latency stalls by interleaving execution of many groups of work items/threads/! When a group stalls, work on another group!

59 59 NVIDIA Fermi (2010)! 16 streaming multiprocessors! Each multiprocessor has 32 streaming cores! SIMT, single instruction multi threads! 6 64-bit memory partitions! 1 global scheduler!

60 60 Streaming Multiprocessor (SM)! 32 streaming cores (SC)! 32 bit pipelined integer arithmetic unit with support for 64 bit operations (1 cycle)! IEEE single/doubleprecision floating point unit providing multiply-add! instructions (1/2 cycles)! 16 load/store units! Concurrent access to data in each! address of the cache or DRAM! (1 cycle with cache hit)! Memory seen as a 2D array! Int-float (and viceversa) casts performed while copying data from memory to register file!

61 61 Streaming Multiprocessor (SM)! 4 special function units (SFUs)! For transcendent functions (sine, cosine, square root, )! Slower than other units! (8 cycles)! Decoupled from the dispatching! units to improve performance! Shader clock = 2x GPU clock to optimize area/performance!

62 62 Streaming Multiprocessor (SM)! PTX (Parallel Thread Execution) is a scalar instruction set architecture! Arithmetic/logic operations! Special functions! Memory load/store! Texture fetch! Memory atomic instructions! Branch, call, return! Barrier synchronization!

63 63 Streaming Multiprocessor (SM)! Threads are grouped in 32 threads sharing an instruction stream, called warp! The SM has 2 scheduling and dispatching units! Two warps are selected each clock cycle (fetch, decode and execute two warps in parallel)! The register file may host up to 48 interleaved warps! 1536 threads per SM! Globally threads!

64 64 Streaming Multiprocessor (SM)! Each scheduler may execute an instruction on 16 ALU cores, 16 load/store units, or 4 SFUs! Each double precision FPU instruction requires 2 ALU cores! Each clock cycle the scheduler selects a warp that is ready to be executed! Warp are independent -> no dependency check is required! Double precision instructions do not support dual dispatching!

65 65 Streaming Multiprocessor (SM)! An example of scheduling of 32 instructions! A scoreboard is used to keep track of the status of the thread warps!

66 66 Other features of Fermi! 2-level distributed scheduler! At chip level a global workload distribution engine dispatches thread blocks to various SMs! As already discussed, at SM level each warp scheduler distributes warps! Support to fast context switch among different applications (around 25us)!

67 67 Other features of Fermi! Support concurrent execution of multiple compute kernels from the same application!

68 68 NVIDIA Kepler (2012)! Architecture similar to the Fermi one with performance and power efficiency improvements!

69 69 Streaming Multiprocessor (SMX)! New SM architecture (now called SMX)! 192 single-precision streaming core/integer instructions! 64 double-precision streaming cores! 32 special floating point units! 32 load/store units! 8 texture filtering units! Shader clock = GPU clock!

70 70 Streaming Multiprocessor (SMX)! New SM architecture (now called SMX)! 4 warp schedulers! 2 dispatchers per warp (each cycle 2 independent instructions per warp are executed)! Warp scheduler has been enhanced! Register scoreboarding for long latency operations (texture and load)! Inter warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates)! Thread-block-level scheduling!

71 71 Streaming Multiprocessor (SMX)! New SM architecture (now called SMX)! 4 warp schedulers! 2 dispatchers per warp (each cycle 2 independent instructions per warp are executed)! Warp scheduler has been enhanced! Register scoreboarding for long latency operations (texture and load)! Inter warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates)! Thread block level scheduling!

72 72 Dynamic parallelism! A kernel can launch another kernel! Kernel launch, synchronization on results and work scheduling are managed directly by the GPU device! This allows the optimization of recursive and data-dependent executions patters!

73 73 Several work queues! Hyper-Q mechanism offering 32 HW-managed work queues! Hyper Q enables multiple CPU cores (and applications) to launch! work on a single GPU simultaneously! Advantages:! Avoidance of false! intra stream dependencies! Dramatically increase of! GPU utilization and! reduction of CPU idle times!

74 74 NVIDIA Maxwell (2014)! Similar architecture of Kepler! New streaming multicore (SMM)! SCs are partitioned in 4 groups each one assigned to a single wrap scheduler! Performance/power improvements!

75 75 NVIDIA Pascal (2017)! Similar architecture of Maxwell! Further performance/ power improvements! Introduction of halfprecision floating point instructions! (x2 throughput w.r.t. single-precision one)! Preemption at single instruction granularity!

76 76 Cache organization! CPU cores run efficiently when data is resident in cache! Caches reduce latency and proved high bandwidth!

77 77 Cache organization! Initially GPU core was not provided with caches! GPU core required a high-bandwidth connection to memory!

78 78 Limited bandwidth! A high-end GPU (e.g. Radeon HD ) has...! Over twenty times (2.7 TFLOPS) the compute performance of quad-core CPU! No large cache hierarchy to absorb memory requests! GPU memory system is designed for throughput! Wide bus (150 GB/sec)! Repack/reorder/! interleave memory! requests to optimize! use of memory bus! Still, this is only 5 times! the bandwidth available! to CPU!

79 79 Limited bandwidth! A more recent system s configuration (Intel Core i NVIDIA GeForce GTX )!

80 80 Limited bandwidth! If processors request data at too high a rate, the memory system cannot keep up! Overcoming bandwidth limits are a common challenge for GPUcompute application developers!! Request data less often (instead, do more math)! arithmetic intensity! Might be quicker to calculate something from scratch on device instead of copying from host! Fetch data from memory less often (share/reuse data across fragments! On-chip communication or storage! Graphics elaborations fit well with these issues! More ALU operations that memory accesses!

81 81 Modern GPU memory hierarchy! Modern GPUs are provided with! Local memories (not synched with main memory)! Texture caches (read-only)! Moreover L1-L2 caches have been added (later )! Do consider that GPU applications present high spatial locality but very low temporal locality! Cache coherency only on L2 cache!

82 82 Transmission cost! Another relevant aspect is the CPU/GPU transmission bandwidth! PCIe bandwidth: 16GB/s on each direction! Attempt to pipeline/multi-buffer uploads and downloads!

83 83 NVIDIA GeForce 8 (2006)! Each SM is provided with! 16KB shared memory! 64KB constant cache! 8KB texture cache! Each process can access all memory locations at 86Gb/s with different latencies:! Shared: 2 cycles! Device: 300 cycles!

84 84 NVIDIA Fermi (2010)! Each SM is provided with 64KB local shared memory used by thread blocks to cooperate! Reduction of off-chip traffic! The shared memory is a usermanaged scratchpad memory! Shared memory can be configured by the programmer to obtained also a L1 cache! Introduced a chip-level L2 cache! Unified address space on private, shared and global memory!

85 85 NVIDIA Fermi (2010)! Texture cache has been removed from L1 since not efficient for general purpose computing! Register spilling on L1! Fast atomic memory operations on L2 memory! Read-modify-write, compareand-swap! Efficient sorting and building of data structures!

86 86 NVIDIA Kepler (2012)! Doubled Fermi cache size: 128KB L1, 1536KB L2! Introduced a Read-only cache (similar to a texture cache)! Added shuffle instructions to exchange data among threads without using shared memory! Optimizes synchronization time!

87 87 NVIDIA Pascal (2017)! Several improvements in caches! Dedicated 64KB per SM shared memory! Unified virtual addressing between CPU and GPU! Memory pages are transparently transmitted between CPU and GPU memories! Automatic handling of page fault and global data coherence!

88 88 NVIDIA Pascal (2017)! NVLink: new high-speed interface (160GB/s bidirectional) to enable multi-gpu architectures for high performance computing!

89 89 Other vendors! We have analyzed NVIDIA GPUs so far! There are many other GPU vendors! E.g.: AMD, ARM,! The overall GPU architecture is quite similar to the NVIDIA one!

90 90 AMD Radeon HD 6970 Cayman (2010)! SIMD func;on unit, control shared across 16 func;onal units (Up to 4 MUL- ADDs per clock cycle) VLIW processing! The streaming multiprocessor is here called compute unit! Groups of 64 [fragments/vertices/etc.] share instruction stream (called wavefront)! Four clocks to execute an instruction for all fragments in a wavefront!

91 91 AMD Radeon HD 6970 Cayman (2010)! There are 24 of these cores (compute units) on the 6970 able to handle up to ~32,000 fragments!

92 92 AMD Radeon HD 6970 Cayman (2010)! Wavefront size is 64 work-items! Vector instructions performed with one lane per work-item! Scalar instructions are performed once for entire wavefront! Vector load/store instructions supply one address per work-item! A SIMD lane (SL) executes one vector instruction! 16 stream cores execute 16 vector instructions on each cycle! A quarter of the wavefront (16 workitems) is issued on each cycle to the SIMD! The entire wavefront is issued over four consecutive cycles! The entire wavefront issues to the other units in a single cycle! Compute Unit: wavefront view Wavefront Scheduler Instruction Issue Branch Unit Local Data Share Scalar Unit Load/Store unit SL 0 SL 1 SIMD unit SL 2 SL15

93 93 AMD Radeon HD 6970 Cayman (2010)! The single SIMD lane has a VLIW architecture! VLIW parallelism extracted by the compiler!

94 94 AMD Radeon HD 6970 Cayman (2010)! Memory bandwidths:! Higher L2-L1 throughput helps reducing the overhead due to more frequent L1 misses w.r.t. CPUs (lower temporal locality)! Larger register file helps reducing register spilling! Same considera;ons for NVIDIA GPUs

95 95 AMD Radeon HD 6970 Cayman (2010)! Local memory and registers are persistent within compute unit once a work-group is scheduled! Traditional context switching is not used! Allows for no-overhead wavefront interleaving! Number of active wavefronts supported per compute unit is limited (496). Decided by! Local memory required per work-group! Register usage per work-item!

96 96 AMD Radeon R9 290X (2013)! 44 compute units! 4 SIMD cores per compute unit! Multi-bank coherent L2 cache (100s of GB/s)! Each unit contains L1 cache and a user-managed scratchpad (~TB/s)! NO VLIW architecture!

97 97 AMD Radeon R9 290X (2013)! 4 SIMDs per compute unit! A SIMD lane (SL) per SIMD unit! 1 Scalar Unit to handle instructions common to wavefront! Loop iterators, constant variable accesses, branches! Has a single, integeronly ALU unit! Separate branch unit used for some conditional instructions!

98 98 AMD Radeon R9 290X (2013)! Wavefronts are associated with a SIMD unit and a subset of the vector registers! Up to 10 wavefronts for each SIMD (in total 40 wavefronts)! All hardware units except for the SIMDs are shared by all wavefronts on a compute unit!

99 99 AMD Radeon R9 290X (2013)! Each cycle, wavefronts targeting one of the SIMDs are allowed to issue instructions! Every 4th cycle a wavefront will be active! An instruction takes 4 cycles to enter the SIMD pipeline (4 subwavefronts per wavefront)! Scalar unit and branch unit can take 1 instruction per cycle! All hardware units can remain fully utilized with a simplified front-end using this round-robin technique!

100 100 AMD Radeon R9 290X (2013)! Up to 5 instructions can be issued per cycle! Only 1 per wavefront! Only 1 per instruction type (i.e., per hardware unit)! Need multiple instructions types present to fully utilize hardware units! Instruction types! Vector Arithmetic Logic Unit (ALU)! Scalar ALU or Scalar Memory Read! Vector memory access (Read/Write/Atomic)! Branch/Message! Local Data Share (LDS)! Export or Global Data Share (GDS)! Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio)!

101 101 AMD Radeon R9 290X (2013)! R/W L1 caches! 16 KB per compute unit! Write through (dirty-byte mask)! 64B lines! R/W L2 caches! 16 partitions with 64 KB/ partition! Write back (dirty-byte mask)! 64B lines! Local Data Share (LDS)! 64 KB per compute unit! 32 banks! Contains integer atomic units!

102 102 AMD Radeon R9 290X (2013)! Memory bandwidths:! Same considerations of previous architectures! AMD FX-8350 CPU! AMD R9 290X GPU!

103 103 ARM Mali 628 (2014)! Targeted for embedded computing!

104 104 CPU and GPU within the same chip! The trend in the last years has been to integrate the GPU within the same chip of the CPU! Opportunities:! Reduce offload cost! Reduce memory copy/transfers! Power management! Steps:! Remove the external communication link! Define a unified memory architecture!

105 105 AMD Llano (2011)! Targeted for mobile and desktop computing! Subsequent solutions integrated in Microsoft Xbox and Sony PlayStation 4! Architecture:! CPU: AMD K10 quad-core! GPU: AMD Radeon HD 6000!

106 106 Intel Sandy Bridge (2011)! First Intel generation (after Intel Westmere) with integrated GPU!

107 107 Solutions from mobile, desktop and server market! Architecture:! Intel Skylane (2015)! CPU: Intel multi-core (from m3 dual-core to i7 octacore to Xeon E3 octa-core)! GPU: Intel HD Graphics! (up to 24 execution units)! or Iris Graphics (up to 72! execution units)!

108 108 Samsung Exynos 5422 (2014)! Targeted for mobile computing! Architecture:! CPU: ARM big.little octa-core! GPU: ARM Mali-T628 MP6!

109 109 NVIDIA Tegra X1 (2015)! Targeted for mobile computing! Architecture:! CPU: ARM big.little octa-core! GPU: NVIDIA Maxwell with 256 CUDA cores!

110 110 Final notes! Generic many-core GPU! Less space devoted to control logic and caches! Large register files to support multiple thread contexts! Low latency hardware managed thread switching! Large number of ALU per core with small usermanaged cache per core! Memory bus optimized for bandwidth! ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously! High Bandwidth bus to ALUs On Board System Memory Simple ALUs Cache Support for general purpose compu;ng!

111 111 Final notes! An efficient GPU workload! Has thousands of independent pieces of work! Uses many ALUs on many cores! Supports massive interleaving for latency hiding! Is amenable to instruction stream sharing! Maps to SIMD execution well! Is compute-heavy: the ratio of math operations to memory access is high! Not limited by bandwidth!

112 112 Final notes! Original role: implementing the graphics pipeline for 3D rendering! Nevertheless, GPUs can be also used for accelerating general purpose computations (GP-GPU!)! Some languages have been developed (CUDA, OpenCL, C++AMP)!

113 113 References! Material taken from other university courses on computer architectures, computer graphics and parallel computing, in particular! NVIDIA website:! Benedict R. Gaster, Lee Howes, David Kaeli, Perhaad Mistry, Dana Schaa, Heterogeneous Computing with OpenCL, Morgan Kaufmann, 2012! David Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang, Heterogeneous Computing with OpenCL 2.0, Morgan Kaufmann, 2015!

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

GPU A rchitectures Architectures Patrick Neill May

GPU A rchitectures Architectures Patrick Neill May GPU Architectures Patrick Neill May 30, 2014 Outline CPU versus GPU CUDA GPU Why are they different? Terminology Kepler/Maxwell Graphics Tiled deferred rendering Opportunities What skills you should know

More information

Graphics Processing Unit Architecture (GPU Arch)

Graphics Processing Unit Architecture (GPU Arch) Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs

More information

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU. Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147 GRAPHICS HARDWARE Niels Joubert, 4th August 2010, CS147 Rendering Latest GPGPU Today Enabling Real Time Graphics Pipeline History Architecture Programming RENDERING PIPELINE Real-Time Graphics Vertices

More information

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) GPU Architecture Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

GPU Architecture. Michael Doggett Department of Computer Science Lund university

GPU Architecture. Michael Doggett Department of Computer Science Lund university GPU Architecture Michael Doggett Department of Computer Science Lund university GPUs from my time at ATI R200 Xbox360 GPU R630 R610 R770 Let s start at the beginning... Graphics Hardware before GPUs 1970s

More information

Scheduling the Graphics Pipeline on a GPU

Scheduling the Graphics Pipeline on a GPU Lecture 20: Scheduling the Graphics Pipeline on a GPU Visual Computing Systems Today Real-time 3D graphics workload metrics Scheduling the graphics pipeline on a modern GPU Quick aside: tessellation Triangle

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU Architecture and Function. Michael Foster and Ian Frasch GPU Architecture and Function Michael Foster and Ian Frasch Overview What is a GPU? How is a GPU different from a CPU? The graphics pipeline History of the GPU GPU architecture Optimizations GPU performance

More information

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university Graphics Architectures and OpenCL Michael Doggett Department of Computer Science Lund university Overview Parallelism Radeon 5870 Tiled Graphics Architectures Important when Memory and Bandwidth limited

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett Spring 2010 Prof. Hyesoon Kim AMD presentations from Richard Huddy and Michael Doggett Radeon 2900 2600 2400 Stream Processors 320 120 40 SIMDs 4 3 2 Pipelines 16 8 4 Texture Units 16 8 4 Render Backens

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Efficient and Scalable Shading for Many Lights

Efficient and Scalable Shading for Many Lights Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

The NVIDIA GeForce 8800 GPU

The NVIDIA GeForce 8800 GPU The NVIDIA GeForce 8800 GPU August 2007 Erik Lindholm / Stuart Oberman Outline GeForce 8800 Architecture Overview Streaming Processor Array Streaming Multiprocessor Texture ROP: Raster Operation Pipeline

More information

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7. 1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7. Optical Discs 1 Structure of a Graphics Adapter Video Memory Graphics

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Introduction to Modern GPU Hardware

Introduction to Modern GPU Hardware The following content are extracted from the material in the references on last page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Architectures Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Overview of today s lecture The idea is to cover some of the existing graphics

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský Real - Time Rendering Graphics pipeline Michal Červeňanský Juraj Starinský Overview History of Graphics HW Rendering pipeline Shaders Debugging 2 History of Graphics HW First generation Second generation

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Parallel Programming Concepts. GPU Computing with OpenCL

Parallel Programming Concepts. GPU Computing with OpenCL Parallel Programming Concepts GPU Computing with OpenCL Frank Feinbube Operating Systems and Middleware Prof. Dr. Andreas Polze Agenda / Quicklinks 2 Recapitulation Motivation History of GPU Computing

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Memory Issues in CUDA Execution Scheduling in CUDA February 23, 2012 Dan Negrut, 2012 ME964 UW-Madison Computers are useless. They can only

More information

Technical Report on IEIIT-CNR

Technical Report on IEIIT-CNR Technical Report on Architectural Evolution of NVIDIA GPUs for High-Performance Computing (IEIIT-CNR-150212) Angelo Corana (Decision Support Methods and Models Group) IEIIT-CNR Istituto di Elettronica

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

NVIDIA Fermi Architecture

NVIDIA Fermi Architecture Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

Vector Processors and Graphics Processing Units (GPUs)

Vector Processors and Graphics Processing Units (GPUs) Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your

More information