GRAPHICS PROCESSING UNITS

Similar documents
Portland State University ECE 588/688. Graphics Processors

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Threading Hardware in G80

Parallel Computing: Parallel Architectures Jin, Hai

GPU Fundamentals Jeff Larkin November 14, 2016

Antonio R. Miele Marco D. Santambrogio

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CS427 Multicore Architecture and Parallel Computing

Fundamental CUDA Optimization. NVIDIA Corporation

Mattan Erez. The University of Texas at Austin

Computer Architecture

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

TUNING CUDA APPLICATIONS FOR MAXWELL

Introduction to CUDA (1 of n*)

Fundamental CUDA Optimization. NVIDIA Corporation

TUNING CUDA APPLICATIONS FOR MAXWELL

Computer Architecture

GPU A rchitectures Architectures Patrick Neill May

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Real-Time Rendering Architectures

NVIDIA Fermi Architecture

The NVIDIA GeForce 8800 GPU

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Lecture 2: CUDA Programming

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

B. Tech. Project Second Stage Report on

High Performance Computing on GPUs using NVIDIA CUDA

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTRUCTION LEVEL PARALLELISM

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Lecture 25: Board Notes: Threads and GPUs

Using GPUs to compute the multilevel summation of electrostatic forces

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Scientific Computing on GPUs: GPU Architecture Overview

CUDA Architecture & Programming Model

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Technical Report on IEIIT-CNR

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Processing SIMD, Vector and GPU s cont.

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to GPU programming with CUDA

GPU ARCHITECTURE Chris Schultz, June 2017

ME964 High Performance Computing for Engineering Applications

Advanced CUDA Programming. Dr. Timo Stich

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

ME964 High Performance Computing for Engineering Applications

Exotic Methods in Parallel Computing [GPU Computing]

Tesla Architecture, CUDA and Optimization Strategies

Multi-Processors and GPU

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Programmable Graphics Hardware (GPU) A Primer

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

From Shader Code to a Teraflop: How Shader Cores Work

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Numerical Simulation on the GPU

PowerVR Hardware. Architecture Overview for Developers

Exploitation of instruction level parallelism

CUDA Performance Optimization. Patrick Legresley

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

GPU ARCHITECTURE Chris Schultz, June 2017

CME 213 S PRING Eric Darve

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

GPUs and GPGPUs. Greg Blanton John T. Lubia

Improving Performance of Machine Learning Workloads

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Parallel Systems I The GPU architecture. Jan Lemeire

Modern Processor Architectures. L25: Modern Compiler Design

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Mathematical computations with GPUs

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

332 Advanced Computer Architecture Chapter 7

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Tesla GPU Computing A Revolution in High Performance Computing

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Spring Prof. Hyesoon Kim

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Architecture. Samuli Laine NVIDIA Research

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Windowing System on a 3D Pipeline. February 2005

Graphics Processing Unit Architecture (GPU Arch)

Transcription:

GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Outline 2 Flynn s taxonomy of computer systems Vector processing Single Instruction, Multiple Data Stream (SIMD) processing Implementation Intel SIMD extensions Examples

Types of parallelism 3 Instruction-Level Parallelism Available when multiple instructions can be executed simultaneously Exploitable by pipelining execution and through superscalar and VLIW architectures Data-Level Parallelism Available by applying the same operation over different data or, equivalently, by dividing the data among multiple processing elements Exploitable by: Using SIMD instructions (vector processing) fine-grained parallelism Relying on multiple cores within a single processor coarse-grained parallelism Using multiple machines (cluster-level) coarser-grained parallelism Task-Level Parallelism Available by dividing the workload into multiple operations (functions), which are applied on the same or on different data Exploited by using multiple processing elements (e.g., different cores or machines)

Matrix multiplication example 4 Consider the product of matrices A and B, C=AB A (N x L) B (L x M) = C (N x M) where each element is computed as: c ij = k a ik b kj // Multiply a constant vector by a constant scalar and return the result for (i=0; i<n ; i++) { for (j=0; j<m; j++) { for (k=0, c[i][j]=0; k<l; k++) c[i][j] += a[i][k]*b[k][j]; } } Is there any parallelism in the computation? What type of parallelism? Data-level? Instruction-Level? Task- Level?

5 Matrix multiplication example Exploiting data-level parallelism Consider the product of matrices A and B, C=AB a b c d 0 1 2 3 4 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 A (N x L) 4 4 4 4 B (L x M) = a b c d C (N x M) where each element is computed as: c ij = k a ik b kj // Multiply a constant vector by a constant scalar and return the result for (i=0; i<n ; i++) { for (j=0; j<m; j+=8) { vectcum = _mm256_setzero_ps(); for (k=0; k<l; k++) { vect1 = _mm256_set1_ps( a[i][k] ); vect2 = _mm_load256_ps( &(b[k][j] ); vectcum = _mm_fmadd256_ps( vect1, vect2, vectcum ); } _mm256_store_ps( &(c[i][j]), vectcum ); } }

6 Matrix multiplication example Exploiting task-level parallelism Consider the product of matrices A and B, C=AB A (N x L) B (L x M) = C (N x M) where each element is computed as: c ij = k a ik b kj // Multiply a constant vector by a constant scalar and return the result #parallel task (i,j) for (k=0, c[i][j]=0; k<l; k++) c[i][j] += a[i][k]*b[k][j]; Exploiting task-level parallelism: Divide the computation in tasks (threads) Each thread computes the value of one element

7 Graphics Processing Pipeline What are the computational requirements? Which parallelism levels can be exploited? How to design the architecture?

Graphics processing Micro-architecture design goals 8 Design Goals General Purpose Processors (GPPs) Support for a wide range of applications Special Purpose Processors (SPPs) Efficient support for a reduced set of applications Requires a flexible ISA Can use a specialized ISA Design Constraints Speed (Delay/Throughtput) Cost Performance Power and Energy consumption App 1 App 2 App 3 App 1 App 2 ISA ISA Micro-architecture Micro-architecture

Graphics processing DirectX11 Graphics Pipeline 9 3D 3D 2D 2D Supplies data (triangles, lines and points) to the pipeline. Performs operations on single vertices such as transformations, skinning, and lighting, producing a flow of output vertices. Converts higherorder surfaces to triangles for rendering. Processes primitives using multiple vertices (three for a triangle, two for a line, or a single vertex for a point); additional vertices can be used to apply edgeadjacent primitives. Clips and prepares primitives for the pixel shader. Also determines how to invoke pixel shaders. Receives interpolated data for a primitive and generates per-pixel data such as color. Combines various types of output data (pixel shader values, depth and stencil information) with the contents of the render target and depth/ stencil buffers to generate the final pipeline result. Input Assembler stage Vertex Shader Stage Tessellation Stages Hull-Shader Tesellator Geometry Shader Stage Rasterizer Stage Pixel-Shader Stage Output Merger Stage Domain-Shader Can discard the primitive, or emit one or more new primitives. Memory

Graphics processing Exploitable parallelism 10 Millions of parallel computations to perform In general the computation of a given pixel is independent from the adjacent pixels Massive data-level parallelism The modern graphics pipeline is complex, having to support different operations, over different display areas Requires support for task-level parallelism

Graphics processing Exploitable parallelism 11 Millions of parallel computations to perform In general the computation of a given pixel is independent from the adjacent pixels Massive data-level parallelism The modern graphics pipeline is complex, having to support different operations, over different display areas Requires support for task-level parallelism Modern GPUs provide a uniform way of exploiting both parallelism levels by means of thread-level parallelism

12 Graphics processing Micro-architecture single-core design Design goals: Exploit highly parallel and data intensive computations Provide support for a wide number of threads Requires high throughput for high definition displays What should the be main characteristics? Out-of-order execution? Branch Prediction? Pipeline Length? Forwarding mechanisms? Register file size? SIMD instructions? VLIW instructions?

Graphics processing Micro-architecture single-core design 13 Design goals: Exploit highly parallel and data intensive computations Provide support for a wide number of threads Requires high throughput for high definition displays What should the be main characteristics? Out-of-order execution? No! Branch Prediction? No! Pipeline Length? Long! Forwarding mechanisms? No! (resolve hazards by interleaving execution with other threads) Register file size? Very Large! (simultaneous support for multiple running threads) SIMD instructions? Yes! (Each thread corresponds to a different vector element) VLIW instructions? Yes! (Can be used by allow multiple threads per VLIW)

Graphics processing Micro-architecture single-core design 14 SIMD-Like: At each cycle, each core issues the same instruction for a group of threads A single instruction is performed on multiple data Reduces pressure on the IF stages Simpler control If there is at least one execution unit per vector element, there are no structural (execution) hazards If one thread stalls (e.g., memory accesses) all threads (in the same vector) stall Low functional unit utilization e.g., when using FP operations, the integer execution units are stalled

Graphics processing Micro-architecture single-core design 15 VLIW-Like: At each cycle, the core issues a bundle of independent instructions from one or more threads Stalls on one thread have limited impact on the execution of other threads Reduced number of execution units Reduces static power consumption Increases hardware resource usage Requires a high instruction memory (cache) throughput Requires inter-thread parallelism to be extracted dynamically

Graphics processing Micro-architecture single-core design 16 MIMD-Like: At each cycle, the core issues multiple instructions from multiple vectors (each vector is composed of a group of n threads) Allows increasing the functional unit utilization, regarding a pure SIMD-like approach If one thread stalls, issue a different set of threads Threads are statically bound to a given group to ease thread control NVIDIA uses the terminology warp to designate a group of 32 threads AMD uses the naming wavefront to represent a group of 64 threads Requires the co-existence of multiple hardware schedulers

Graphics processing Micro-architecture single-core design 17 Design goals: Exploit highly parallel and data intensive computations Provide support for a wide number of threads Requires high throughput for high definition displays How can we scale the performance? Use simple cores Allow the co-existence of multiple cores, each supporting a massive number of threads Modern GPPs (e.g., Intel Core i7 6700K) are composed of a small number of cores (4), each supporting a small number of running threads (up to 2) Modern GPUs (e.g., NVIDIA GeForce GTX 980) are composed of multiple cores (16 SMs), each supporting the parallel execution of hundreds of threads (2048), typically executed in groups (1 warp = 32 threads)

Graphics processing Micro-architecture single-core design 18 Design goals: Exploit highly parallel and data intensive computations Provide support for a wide number of threads Requires high throughput for high definition displays Instruction Set Architecture? Modern GPPs (e.g., Intel Core i7 6700K) use standard scalar ISAs, which are further extended to support a set of vector instructions Modern GPUs (e.g., NVIDIA GeForce GTX 980) use special purpose ISA, which are oriented for parallel thread (warp) execution Generally supports most general purpose programming operations, although usually relying on predicated instructions to overcome control operations (branches) that lead to warp-level divergence

Graphics processing Micro-architecture single-core design 19 Design goals: Exploit highly parallel and data intensive computations Provide support for a wide number of threads Requires high throughput for high definition displays Memory hierarchy? Modern GPPs (e.g., Intel Core i7 6700K) use multiple cache memories (L1, L2, L3) to minimize data access latency and avoid read-after-load hazards Modern GPUs (e.g., NVIDIA GeForce GTX 980) typically use only 2 cache levels (L1 and L2), together with a main (global) memory Typical GPUs do not have access to the CPU global memory, hence requiring explicit data transfers between CPU and GPU Each GPU core usually provides a low latency (reduced size) scratchpad memory, as well as a constant (texture) memory The GPU L2 cache memory is shared among all cores The GPU L1 cache is special: it can be used for register caching and optionally for data caching

Graphics processing Micro-architecture single-core design 20 Design goals: Exploit highly parallel and data intensive computations Provide support for a wide number of threads Requires high throughput for high definition displays Syncronization? Modern GPPs (e.g., Intel Core i7 6700K) provide mechanisms for explicit synchronization and cache coherence among cores Modern GPUs (e.g., NVIDIA GeForce GTX 980) typically do not provide for such synchronization mechanisms: Warps/Wavefronts are grouped in blocks and the execution of all threads within a block is assigned to a given core. Synchronization within a block of threads is possible, but not among different blocks (since blocks may be scheduled to different cores).

Graphics processing Micro-architecture single-core design 21 Design goals: Exploit highly parallel and data intensive computations Provide support for a wide number of threads Requires high throughput for high definition displays Problems? GPU micro-architectures are specially efficient for massively parallel applications but they struggle when Application requires periodic synchronization with the host processor The overheads related with memory transfers between the host and the GPU cannot be efficiently hidden behind kernel execution Kernels have limited parallelism (remember that GPUs use parallelism to hide data hazards) Kernels have too much control (there is no branch prediction) Kernels are too small and have a very fast execution time (overhead of launching the kernel limits processing performance)

22 NVIDIA GPU micro-architectures

NVIDIA G80 micro-architecture 23 In November 2006, the G80 micro-architecture introduced: Support for C programming Unification of the several processing pipelines, in order to allow support for vertex, geometry, pixel and computing programs. Introduction of the single-instruction multiple-thread (SIMT) execution model Shared memory and barrier synchronization mechanisms for inter-thread communication In June 2008, NVIDIA made a revision of the G80 architecture (codenamed GT200) The number of stream processing cores increased from 128 to 240 Increased the size of the physical register file Introduced hardware memory access coalescing Introduced double precision floating point format

Fermi Micro-architecture Design goals (vs GT200) 24 Increase performance (especially double precision) New Stream Multiprocessor (SM) architecture Dual Warp Scheduler to allow simultaneously fetching and dispatching instructions from two independent warps Provide ECC support, to protect data against memory errors Introduce a true cache hierarchy Increase shared memory Faster context switching between programs Faster atomic operations

Fermi Micro-architecture Architecture overview 25 Each core is named a SM (Simultaneous Multiprocessor) A high-level GigaThread engine schedules the execution of the grid of blocks of threads to SMs Workload (total number of threads)

Fermi Micro-architecture Block scheduling 26 Each core is named a SM (Simultaneous Multiprocessor) B0 B1 B2 B3 B4 B5 A high-level GigaThread engine schedules the execution of the grid of blocks of threads to SMs Workload (total number of threads) Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Multiple blocks of threads may be allocated to the same SM (as long as it has enough hardware resources to support it). In typical workloads, there are more blocks than what is physically supported by the SM. In such cases, some blocks may have to await for the execution of another block to finish, before being assigned to a specific SM. Once a block is assigned to a given SM, it remains at that SM until completion.

27 Fermi Micro-architecture Block scheduling Each core is named a SM (Simultaneous Multiprocessor) A high-level GigaThread engine schedules the execution of the grid of blocks of threads to SMs B0 B1 B2 B3 B4 B5 B5 B0 Workload (total number of threads) Kernel 1 Kernel 2 Un-scheduled B1 B2 B3 B4 B5 B6 B7 B8 Multiple kernels (up to 16) may execute concurrently has long as they are from the same context However, blocks are dispatched in-order. Thus, all blocks from one kernel have to be dispatched (i.e., assigned to a given SM), before starting dispatching blocks from the next kernel.

Fermi Micro-architecture Simultaneous Multiprocessor (SM) Architecture 28 Each SM is composed of: Register file with 32K 32-bit registers 32 SPs, each composed of pipelined FP and integer ALU units Supports up to 16 double precision fusedmultiply and add (FMA) operations per clock 16 Load/Store units A warp of 32 threads takes 2 cycles to execute 4 specialized processing units (SPUs) Supports transcendental instructions such as sin, cosine, reciprocal, and square root Executes one thread per clock, i.e., a warp of 32 threads takes 8 cycles to execute) 2 warp schedulers capable of issuing up to one instruction per clock cycle from a single warp of threads Because threads are assumed to be independent no inter-warp dependency check is performed May lead to conflicts if two threads access the same memory position

Fermi Micro-architecture Warp scheduling 29 The warp scheduler relies on a multi port register scoreboard: The scoreboard keeps track of any registers that are not yet ready a dependency checker block analyzes the scoreboard, to determine which warps are eligible to be issued.

Fermi Micro-architecture Warp scheduling 30 Each cycle: Each warp scheduler selects one warp to execute If there is 1 (or less) warps ready to execute, one (or both) of the warp schedulers stall; this stalls may be overcome by increasing the number of threads in the same SM. If there are 3 (or more) warps ready to execute, the warp stalls; the warp may also be stalled because the next assembly instruction has not yet been fetched, or because of a intra-block synchronization call (CUDA _syncthreads() function) Whenever a double-precision operation is issued, the second warp scheduler must always stall. This is likely due to bandwidth limitations when fetching data from the register file. Structural hazards may also occur because the required HW resource is busy, e.g.: The 4 SFUs are already operating over another warp The warp requires loading data from memory, but too many outstanding requests have already been issued

Fermi Micro-architecture Instruction Execution 31 Each warp (group of 32 threads) is executed as 16 integer or FP units 2 cycles per warp 16 LD/ST units 2 cycles per warp 4 SFUs 8 cycles per warp However, in Fermi the CUDA cores work at twice the operating frequency This allows saving processing resources (area), but leads to a higher energy consumption

Fermi Micro-architecture Memory Hierarchy 32 The SM memory hierarchy is composed of: An L1 texture cache memory A 64 KB L1 memory, which can be configured as: 48KB for shared memory and 16 KB for L1 cache, or 16KB for shared memory and 48 KB for L1 cache A 768KB of L2 cache, that serves all load, store and texture requests A global RAM memory used to store all data

Kepler Micro-architecture Design goals 33 Introduces dynamic parallelism Kernels (calls to GPU functions) can now generate new work, generating a new grid of blocks, which are subsequently sent for execution Not only it allows to mitigate kernel call overheads, but also frees the CPU, allowing it to perform other tasks (task-level parallelism) This also requires the Grid Management Unit to allow suspending the execution of ongoing work

Kepler Micro-architecture Design goals 34 Introduces dynamic parallelism Provide support for multiple work-queues, allowing different CPU cores to issue commands to the GPU on different CUDA streams.

Kepler Micro-architecture Design goals 35 Introduces dynamic parallelism Kernels (calls to GPU functions) can now generate new work, generating a new grid of blocks, which are subsequently sent for execution Not only it allows to mitigate kernel call overheads, but also frees the CPU, allowing it to perform other tasks (task-level parallelism) This also requires the Grid Management Unit to allow suspending the execution of ongoing work Provide support for multiple work-queues, allowing different CPU cores to issue commands to the GPU on different CUDA streams. Allow for data transfers between different GPU devices without needing to go to the CPU/system memory. Improve the performance per Watt Design of a new stream multiprocessor architecture (now named SMX)

36 Kepler Micro-architecture Redesigned Simultaneous Multiprocessor (SMX) Large increase in the number of ALUs 192 single-precision (or integer) CUDA cores (from 32 in Fermi) 64 double-precision FP units (from 16 in Fermi) 32 SFUs (from 4 in Fermi) 32 LD/ST units (from 16 in Fermi) Each dispatch port is composed of 16 functional units Warps are executed in 2 groups of 16 threads (half-warp) Execution units now operate at the standard SM frequency The increase in the number of ALUs results in a linear increase in power consumption; however, the reduction in half of the core operating frequency reduces power consumption by 8x Power increases linearly with area, but decreases cubically with operating frequency.

Kepler Micro-architecture Redesigned Simultaneous Multiprocessor (SMX) 37 Four warp schedulers and 8 dispatch units Each warp scheduler now selects one warp per cycle, and issues up to 2 instructions (inorder) from that warp Double precision instructions can now be paired with other instructions; however, double-precision instructions still require twice the RF bandwidth A double-precision instruction probably still uses 2 dispatchers Warp scheduling is aided by compiler information, which provides the warp scheduler with the latency for a result to be generated

Kepler Micro-architecture Improved ISA 38 New ISA Each thread can now use up to 255 registers Reduces the number of register spills to memory New shuffle instruction that allows swapping thread values within a warp New and improved atomic operations: atomicmin atomicmax atomicand AtomicOr AtomicXor Improvements in Texture Memory to improve performance

Kepler Micro-architecture Memory Hierarchy 39 Improvements at the memory hierarchy: Introduction of a new 48KB cache for constant data In Fermi the L1 constant cache could only be accessed through the Texture path Supports unaligned memory accesses The shared/l1 memory now supports an additional configuration: 16KB shared memory + 48KB L1 cache 32KB shared memory + 32KB L1 cache 48KB shared memory + 16KB L1 cache Double data access bandwidth From 128B/clock in Fermi (16 LD/ST units x 32 bits) To 256B/clock in Kepler (to support 32 LD/ST units) 2x increase in the L2 cache: From 768KB (Fermi) to 1536KB (Kepler)

Kepler Micro-architecture Comparison 40

NVIDIA Maxwell Micro-architecture 41 The Simultaneous Multiprocessor was redesigned Named SM (Fermi), SMX (Kepler), SMM (Maxwell) The new SMM was redesigned to optimize the used resources. Hence, the Maxwell SMM features a reduced number of 128 CUDA cores (vs 192 in Kepler) The L2 cache increased to 2MB, distributed in 4 blocks of 512KB (one per Graphics Processing Cluster)

Maxwell Micro-architecture Simultaneous Multiprocessor Arch. (SMM) 42 The SM was redesigned in order to simplify control: While in previous designs the schedulers shared execution resources, each scheduler now has dedicated execution units Greatly simplifies control Requires additional hardware resources for the execution pipeline, which are now generally less utilized However, this likely decreases the critical path allowing for a increase in operating frequency

Maxwell Micro-architecture Warp scheduling 43 Several improvements related with the warp scheduling mechanisms Stalls down the pipeline are marked at the scoreboard (unlike in Fermi) This suggests a more centralized mechanisms that is able to predict whether the selection of a given warp (which has all operands ready) would result in a pipeline stall

NVIDIA Pascal 44

45 AMD GPU micro-architectures

AMD GPUs Recent history 46

AMD GPUs VLIW4 SIMD vs Quad SIMD 47

AMD GPUs GCN Compute Unit 48

Architecture comparison 49