NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Size: px

Start display at page:

Download "NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas"

Chloe Perkins
5 years ago
Views:

1 NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas

2 Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture - Kepler Architecture - Maxwe ll Archite cture - Future Advances

3 CPU vs GPU Architectures Few Cores Lots of Cache Handful of Threads Independent Processes Hundreds of Cores Thousands of Threads Single Proce ss Exe cution

4 Use s of GPUs Gaming Graphics Focuses on High Frames per Second Low Polygon Count Predefined Textures Workstation Computation Focuses on Floating Point Precision CAD Graphics - Billions of Polygons

5 Brief Timeline of NVidia GPUs World s First GPU: GeForce First Programmable GPU: GeForce Scalable Link Inte rface CUDA Architecture Announced Launch of Tesla Computation GPUs with Tesla Microarchitecture Fermi Architecture Introduced Kepler Architecture Launched Pascal Architecture

Tesla - First microarchitecture to implement the unified shader model, which uses the same hardware resources for all fragment processing - Consists of a number of stream processors, which are scalar

6 Tesla - First microarchitecture to implement the unified shader model, which uses the same hardware resources for all fragment processing - Consists of a number of stream processors, which are scalar and can only operate on one component at a time - Increased clock speed in GPUs - Round robin scheduling for warps - Contained local, shared, and global memory - Contains Special-Function Units which are specialized for interpolating points - Allowed for two instructions to execute per clock cycle per SP

7 Fermi Peak Performance Overview: Each SM has 32 CUDA Core s PCI-Express v2 Bus connecting CPU and GPU (8GB/s peak transfer) Up to 6GB GDDR5 DRAM (192GB/s pe ak transfer) Estimated 1.5 Ghz Clock Frequency 2GHz Global Memory Clock Frequency Peak Performance of 1.5 TFLOPS

8 Firmi Continued De sktop GPU Transistor Size : 40 nm Mobile GPU Transistor Size : 40 nm or 28nm Fuse d Multiply-Add A*B + C No loss of precision of addition while being faster than separate operations Two-le ve l, Distribute d Thre ad Sche duling 2 Warps Issued and Executed Double-Pre cision has half performance of Single -Precision on Workstations Limited to ⅛ on Consumer Cards 32K 32-bit Re giste r 6 4KB On-Chip Memory

rs - 32 threads/warp - 1024 Max Threads - 6 4 Max Thre ads/mp - 16

9 Kepler - Implemented Nested Kernels - Allowe d multiple CPU core s to launch work on a single GPU simultane ously - 64k 32-bit re giste rs - 32 threads/warp Max Threads Max Thre ads/mp - 16 Max Thread Blocks - Each SM contains 192 single-pre cision CUDA core s

10 Kepler Continued - SMX use s the primary GPU clock, 2x slowe r than Fe rmi/te sla - Lower power draw, providing performance per watt - Includes fused multiply-add like Fe rmi, allowing for high pre cision - Each SMX features four warp schedulers, allowing four warps to be issued and executed concurrently. - Had twice as many instruction dispatch units than warps, allowing two independent instructions per warp to begin execution concurrently. - Allows double pre cision instructions to be paire d with other instructions, unlike Fe rmi - Register scoreboarding for long latency operations - Dynamic inte r-warp scheduling - Ability for thre ad block le ve l sche duling

11 Maxwell Focused on Power-Efficiency rather than Additional Features L2 Cache Increased from 256KiB to 2MiB Reduced need of Memory Bus from 192 bit to 128 bit Starting using Tile Re ndering Reduces amount of memory needed when rendering Double-Pre cision Pe rformance is 1/32 of Single -Pre cision Worse than previous versions 64k Registers

12 Pascal - 64 CUDA Cores per streaming multiprocessor - High Bandwidth Memory 2 with a 4096-bit bus and memory bandwidth of 720 GBs - Unifie d Me mory - CPU and GPU can access the same memory with the help of a Page Migration Engine - NVLink - Provides a high bandwidth bus between CPU and GPU, allowing higher transfer speeds than PCI - Twice the amount of registers per CUDA core, more shared memory - Dynamic load balancing, allowing for asynchronous computations - Instruction level and thread level preemption

13 Future of GPUs - Integrated GPU and CPU architecture into one chip - Reduces latency, increases bandwidth, and improves cache coherent memory sharing - Smaller transistor sizes and larger amounts of memory - Specialize d GPUS for tasks such as machine le arning - Improved interconnections between GPUs - NVidia working on better connections in Pascal

14 Questions?

15 References Kirk, D. (2008). Chapter 1: Introduction, CUDA Textbook. (pp 1-13). Retrieved from Introduction.pdf NVidia. (2 0 17). NVidia History. Retrieved from Ieeexplore.ieee.org. (2017). NVIDIA Tesla: A Unified Graphics and Computing Architecture - IEEE Journals & Magazine. [online ] Available at: [Accessed 4 Dec. 2017].

16 References Cont. Nvidia (20 0 9). White paper Nvidia's Ne xt Ge ne ration CUDA Compute Archite cture : Fermi. Retrieved December 2, _Architecture_Whitepaper.pdf Nvidia (20 12). Nvidia Ke ple r GK110 Ne xt-generation CUDA Compute Architecture. Retrieved December 2, _LR.pdf Greengard, S. (2016). GPUs reshape computing. Communications of the ACM, 5 9 (9 ), d o i: /

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D