Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Similar documents
Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2017

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015

Instruction and Data Streams

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

Computer Graphics Hardware An Overview

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

Design of Digital Circuits Lecture 19: Approaches to Concurrency. Prof. Onur Mutlu ETH Zurich Spring May 2017

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

Appendix D. Controller Implementation

Computer Architecture Lecture 15: Dataflow and SIMD. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

Chapter 4 The Datapath

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Multiprocessors. HPC Prof. Robert van Engelen

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Threads and Concurrency in Java: Part 1

Isn t It Time You Got Faster, Quicker?

Threads and Concurrency in Java: Part 1

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Uniprocessors. HPC Prof. Robert van Engelen

Elementary Educational Computer

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

UNIVERSITY OF MORATUWA

Python Programming: An Introduction to Computer Science

Data diverse software fault tolerance techniques

One advantage that SONAR has over any other music-sequencing product I ve worked

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

Lecture 1: Introduction and Strassen s Algorithm

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

Computer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff

Computer Architecture

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Computers and Scientific Thinking

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

COMPUTER ORGANIZATION AND DESIGN

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Lecture 1: Introduction and Fundamental Concepts 1

CS2410 Computer Architecture. Flynn s Taxonomy

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

Chapter 5: Processor Design Advanced Topics. Microprogramming: Basic Idea

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Bank-interleaved cache or memory indexing does not require euclidean division

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

Benchmarking SpMV on Many-Core Architecture

SPIRAL DSP Transform Compiler:

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Ones Assignment Method for Solving Traveling Salesman Problem

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

How do we evaluate algorithms?

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Computer Architecture ELEC3441

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014

EE123 Digital Signal Processing

Computer Systems - HS

Mindmapping: A General Purpose (Test) Planning Tool

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Analysis of Algorithms

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

MOTIF XF Extension Owner s Manual

. Written in factored form it is easy to see that the roots are 2, 2, i,

ETH, Design of Digital Circuits, SS17 Practice Exercises II - Solutions

Threads and Concurrency in Java: Part 2

Data Structures and Algorithms. Analysis of Algorithms

Transcription:

Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018

New Course: Bachelor s Semiar i Comp Arch Fall 2018 2 credit uits Rigorous semiar o fudametal ad cuttig-edge topics i computer architecture Critical presetatio, review, ad discussio of semial works i computer architecture We will cover may ideas & issues, aalyze their tradeoffs, perform critical thikig ad braistormig Participatio, presetatio, report ad review writig Stay tued for more iformatio 2

Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed Microarchitectures Pipeliig Issues i Pipeliig: Cotrol & Data Depedece Hadlig, State Maiteace ad Recovery, Out-of-Order Executio Other Executio Paradigms 3

Readigs for Today Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro 1996. Lidholm et al., "NVIDIA Tesla: A Uified Graphics ad Computig Architecture," IEEE Micro 2008. 4

Other Approaches to Cocurrecy (or Istructio Level Parallelism)

Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 6

SIMD Processig: Exploitig Regular (Data) Parallelism

Recall: Fly s Taxoomy of Computers Mike Fly, Very High-Speed Computig Systems, Proc. of IEEE, 1966 SISD: Sigle istructio operates o sigle data elemet SIMD: Sigle istructio operates o multiple data elemets Array processor Vector processor MISD: Multiple istructios operate o sigle data elemet Closest form: systolic array processor, streamig processor MIMD: Multiple istructios operate o multiple data elemets (multiple istructio streams) Multiprocessor Multithreaded processor 8

Recall: SIMD Processig Sigle istructio operates o multiple data elemets I time or i space Multiple processig elemets Time-space duality Array processor: Istructio operates o multiple data elemets at the same time usig differet spaces Vector processor: Istructio operates o multiple data elemets i cosecutive time steps usig the same space 9

Recall: Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same op @ same time LD0 LD1 LD2 LD3 LD0 Differet ops @ time AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet ops @ same space AD3 MU2 ST1 MU3 ST2 Same op @ space ST3 Space Space 10

Recall: Memory Bakig Memory is divided ito baks that ca be accessed idepedetly; baks share address ad data buses (to miimize pi cost) Ca start ad complete oe bak access per cycle Ca sustai N parallel accesses if all N go to differet baks Bak 0 Bak 1 Bak 2 Bak 15 MDR MAR MDR MAR MDR MAR MDR MAR Data bus Address bus Picture credit: Derek Chiou CPU 11

Some Issues Stride ad bakig As log as they are relatively prime to each other ad there are eough baks to cover bak access latecy, we ca sustai 1 elemet/cycle throughput Storage of a matrix Row major: Cosecutive elemets i a row are laid out cosecutively i memory Colum major: Cosecutive elemets i a colum are laid out cosecutively i memory You eed to chage the stride whe accessig a row versus colum 12

Matrix Multiplicatio A ad B, both i row-major order A0 0 1 2 3 4 5 B0 0 1 2 3 4 5 6 7 8 9 6 7 8 9 10 11 10 11 12 13 14 15 16 17 18 19 20 30 A 4x6 B 6x10 C 4x10 Dot products of rows ad colums of A ad B 40 50 A: Load A 0 ito vector register V 1 Each time, icremet address by oe to access the ext colum Accesses have a stride of 1 B: Load B 0 ito vector register V 2 Each time, icremet address by 10 Accesses have a stride of 10 Differet strides ca lead to bak coflicts How do we miimize them? 13

Miimizig Bak Coflicts More baks Better data layout to match the access patter Is this always possible? Better mappig of address to bak E.g., radomized mappig Rau, Pseudo-radomly iterleaved memory, ISCA 1991. 14

Recall: Questios (II) What if vector data is ot stored i a strided fashio i memory? (irregular memory access to a vector) Idea: Use idirectio to combie/pack elemets ito vector registers Called scatter/gather operatios 15

Gather/Scatter Operatios Wat to vectorize loops with idirect accesses: for (i=0; i<n; i++) A[i] = B[i] + C[D[i]] Idexed istructio (Gather) LV vd, rd # Load idices i D vector LVI vc, rc, vd # Load idirect from rc base LV vb, rb # Load B vector ADDV.D va,vb,vc # Do add SV va, ra # Store result 16

Gather/Scatter Operatios Gather/scatter operatios ofte implemeted i hardware to hadle sparse vectors (matrices) Vector s ad stores use a idex vector which is added to the base register to geerate the addresses Scatter example Idex Vector Data Vector (to Store) Stored Vector (i Memory) 0 3.14 Base+0 3.14 2 6.5 Base+1 X 6 71.2 Base+2 6.5 7 2.71 Base+3 X Base+4 X Base+5 X Base+6 71.2 Base+7 2.71 17

Array vs. Vector Processors, Revisited Array vs. vector processor distictio is a purist s distictio Most moder SIMD processors are a combiatio of both They exploit data parallelism i both time ad space GPUs are a prime example we will cover i a bit more detail 18

Recall: Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same op @ same time LD0 LD1 LD2 LD3 LD0 Differet ops @ time AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet ops @ same space AD3 MU2 ST1 MU3 ST2 Same op @ space ST3 Space Space 19

Vector Istructio Executio VADD A,B à C Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] Time Time C[0] Slide credit: Krste Asaovic C[0] C[1] Space C[2] C[3] 20

Vector Uit Structure Fuctioal Uit Partitioed Vector Registers Elemets 0, 4, 8, Elemets 1, 5, 9, Elemets 2, 6, 10, Elemets 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 21

Vector Istructio Level Parallelism Ca overlap executio of multiple vector istructios Example machie has 32 elemets per vector register ad 8 laes Completes 24 operatios/cycle while issuig 1 vector istructio/cycle Load Uit Multiply Uit Add Uit mul time add mul add Istructio issue Slide credit: Krste Asaovic 22

Automatic Code Vectorizatio Scalar Seuetial Code for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vectorized Code Iter. 1 add Time add add store store store Iter. 2 Iter. 1 Iter. 2 Vector Istructio add store Vectorizatio is a compile-time reorderig of operatio seuecig Þ reuires extesive loop depedece aalysis Slide credit: Krste Asaovic 23

Vector/SIMD Processig Summary Vector/SIMD machies are good at exploitig regular datalevel parallelism Same operatio performed o may data elemets Improve performace, simplify desig (o itra-vector depedecies) Performace improvemet limited by vectorizability of code Scalar operatios limit vector machie performace Remember Amdahl s Law CRAY-1 was the fastest SCALAR machie at its time! May existig ISAs iclude (vector-like) SIMD operatios Itel MMX/SSE/AVX, PowerPC AltiVec, ARM Advaced SIMD 24

SIMD Operatios i Moder ISAs

SIMD ISA Extesios Sigle Istructio Multiple Data (SIMD) extesio istructios Sigle istructio acts o multiple pieces of data at oce Commo applicatio: graphics Perform short arithmetic operatios (also called packed arithmetic) For example: add four 8-bit umbers Must modify ALU to elimiate carries betwee 8-bit values padd8 $s2, $s0, $s1 32 24 23 16 15 8 7 0 Bit positio a 3 a 2 a 1 a 0 $s0 + b 3 b 2 b 1 b 0 $s1 a 3 + b 3 a 2 + b 2 a 1 + b 1 a 0 + b 0 $s2 26

Itel Petium MMX Operatios Idea: Oe istructio operates o multiple data elemets simultaeously À la array processig (yet much more limited) Desiged with multimedia (graphics) operatios i mid No VLEN register Opcode determies data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit uadword Stride is always eual to 1. Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro, 1996. 27

MMX Example: Image Overlayig (I) Goal: Overlay the huma i image 1 o top of the backgroud i image 2 for (i=o: i<image-size; i++) i if (x[il == Blue) ew-image[i] =y[il; else ew-image[il = x[il; 1 Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro, 1996. 28

MMX Example: Image Overlayig (II) Y = Blossom image X = Woma s image Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro, 1996. 29

GPUs (Graphics Processig Uits)

GPUs are SIMD Egies Udereath The istructio pipelie operates like a SIMD pipelie (e.g., a array processor) However, the programmig is doe usig threads, NOT SIMD istructios To uderstad this, let s go back to our parallelizable code example But, before that, let s distiguish betwee Programmig Model (Software) vs. Executio Model (Hardware) 31

Programmig Model vs. Hardware Executio Model Programmig Model refers to how the programmer expresses the code E.g., Seuetial (vo Neuma), Data Parallel (SIMD), Dataflow, Multi-threaded (MIMD, SPMD), Executio Model refers to how the hardware executes the code udereath E.g., Out-of-order executio, Vector processor, Array processor, Dataflow processor, Multiprocessor, Multithreaded processor, Executio Model ca be very differet from the Programmig Model E.g., vo Neuma model implemeted by a OoO processor E.g., SPMD model implemeted by a SIMD processor (a GPU) 32

How Ca You Exploit Parallelism Here? for (i=0; i < N; i++) Scalar Seuetial Code C[i] = A[i] + B[i]; Iter. 1 add store Let s examie three programmig optios to exploit istructio-level parallelism preset i this seuetial code: Iter. 2 1. Seuetial (SISD) add 2. Data-Parallel (SIMD) store 3. Multithreaded (MIMD/SPMD) 33

Prog. Model 1: Seuetial (SISD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Ca be executed o a: Iter. 1 Pipelied processor Out-of-order executio processor add Idepedet istructios executed whe ready store Differet iteratios are preset i the istructio widow ad ca execute i parallel i multiple fuctioal uits Iter. 2 add I other words, the loop is dyamically urolled by the hardware Superscalar or VLIW processor store Ca fetch ad execute multiple istructios per cycle 34

Prog. Model 2: Data Parallel (SIMD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Vector Istructio Vectorized Code VLD A à V1 Iter. 1 VLD B à V2 add add VADD V1 + V2 à V3 store store VST V3 à C Iter. 2 Iter. 1 Iter. 2 Realizatio: Each iteratio is idepedet add store Idea: Programmer or compiler geerates a SIMD istructio to execute the same istructio from all iteratios across differet data Best executed by a SIMD processor (vector, array) 35

Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Iter. 1 Iter. 2 Iter. 1 add store add store add store Iter. 2 Realizatio: Each iteratio is idepedet Idea: Programmer or compiler geerates a thread to execute each iteratio. Each thread does the same thig (but o differet data) Ca be executed o a MIMD machie 36

Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; add store add store Iter. 1 Iter. 2 Realizatio: Each iteratio is idepedet Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig (but Sigle o Program differet data) Multiple Data Ca Ca be executed be executed o a MIMD o a SIMT SIMD machie machie Sigle Istructio Multiple Thread 37

A GPU is a SIMD (SIMT) Machie Except it is ot programmed usig SIMD istructios It is programmed usig threads (SPMD programmig model) Each thread executes the same code but operates a differet piece of data Each thread has its ow cotext (i.e., ca be treated/restarted/executed idepedetly) A set of threads executig the same istructio are dyamically grouped ito a warp (wavefrot) by the hardware A warp is essetially a SIMD operatio formed by hardware! 38

SPMD o SIMT Machie for (i=0; i < N; i++) C[i] = A[i] + B[i]; Warp 0 at PC X Warp 0 at PC X+1 add add Warp 0 at PC X+2 store store Warp 0 at PC X+3 Iter. 1 Iter. 2 Warp: A set of threads that execute Realizatio: Each iteratio is idepedet the same istructio (i.e., at the same PC) Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig Sigle (but o Program differet data) Multiple Data Ca A GPU Ca be executed be executes executed o it a usig MIMD o a SIMD the machie SIMT machie model: Sigle Istructio Multiple Thread 39

Graphics Processig Uits SIMD ot Exposed to Programmer (SIMT)

SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly (o ay type of scalar pipelie) à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 41

Multithreadig of Warps for (i=0; i < N; i++) C[i] = A[i] + B[i]; Assume a warp cosists of 32 threads If you have 32K iteratios, ad 1 iteratio/thread à 1K warps Warps ca be iterleaved o the same pipelie à Fie graied multithreadig of warps Warp 10 at PC X add store add store Warp 20 at PC X+2 Iter. 33 120*32 + 1 Iter. 234 20*32 + 2 42

Warps ad Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) à SIMT (Nvidia-speak) All threads ru the same code Warp: The threads that ru legthwise i a wove fabric Thread Warp Scalar Scalar Scalar Thread Thread Thread W X Y Commo PC Scalar Thread Z Thread Warp 3 Thread Warp 8 Thread Warp 7 SIMD Pipelie 43

High-Level View of a GPU 44

Latecy Hidig via Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) Thread Warp 3 Thread Warp 8 Warps available for schedulig Fie-graied multithreadig Oe istructio per thread i pipelie at a time (No iterlockig) Iterleave warp executio to hide latecies Register values of all threads stay i register file FGMT eables log latecy tolerace Millios of pixels Thread Warp 7 RF ALU All Hit? I-Fetch Decode RF ALU D-Cache Writeback Data RF ALU SIMD Pipelie Warps accessig memory hierarchy Miss? Thread Warp 1 Thread Warp 2 Thread Warp 6 Slide credit: Tor Aamodt 45

Warp Executio (Recall the Slide) 32-thread warp executig ADD A[tid],B[tid] à C[tid] Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] Time Time C[0] Slide credit: Krste Asaovic C[0] C[1] Space C[2] C[3] 46

SIMD Executio Uit Structure Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 47

Warp Istructio Level Parallelism Ca overlap executio of multiple istructios Example machie has 32 threads per warp ad 8 laes Completes 24 operatios/cycle while issuig 1 warp/cycle Load Uit Multiply Uit Add Uit W0 W1 time W2 W3 W4 W5 Warp issue Slide credit: Krste Asaovic 48

SIMT Memory Access Same istructio i differet threads uses thread id to idex ad access differet data elemets + Let s assume N=16, 4 threads per warp à 4 warps 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Threads Data elemets + + + + Warp 0 Warp 1 Warp 2 Warp 3 Slide credit: Hyesoo Kim 48

Sample GPU SIMT Code (Simplified) CPU code for (ii = 0; ii < 100000; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are 100000 threads global void KerelFuctio( ) { it tid = blockdim.x * blockidx.x + threadidx.x; it vara = aa[tid]; it varb = bb[tid]; C[tid] = vara + varb; } Slide credit: Hyesoo Kim 48

Sample GPU Program (Less Simplified) Slide credit: Hyesoo Kim 51

Warp-based SIMD vs. Traditioal SIMD Traditioal SIMD cotais a sigle thread Seuetial istructio executio; lock-step operatios i a SIMD istructio Programmig model is SIMD (o extra threads) à SW eeds to kow vector legth ISA cotais vector/simd istructios Warp-based SIMD cosists of multiple scalar threads executig i a SIMD maer (i.e., same istructio executed by all threads) Does ot have to be lock step Each thread ca be treated idividually (i.e., placed i a differet warp) à programmig model ot SIMD SW does ot eed to kow vector legth Eables multithreadig ad flexible dyamic groupig of threads ISA is scalar à SIMD operatios ca be formed dyamically Essetially, it is SPMD programmig model implemeted o SIMD hardware 52

SPMD Sigle procedure/program, multiple data This is a programmig model rather tha computer orgaizatio Each processig elemet executes the same procedure, except o differet data elemets Procedures ca sychroize at certai poits i program, e.g. barriers Essetially, multiple istructio streams execute the same program Each program/procedure 1) works o differet data, 2) ca execute a differet cotrol-flow path, at ru-time May scietific applicatios are programmed this way ad ru o MIMD hardware (multiprocessors) Moder GPUs programmed i a similar way o a SIMD hardware 53

SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 54

Threads Ca Take Differet Paths i Warp-based SIMD Each thread ca have coditioal cotrol flow istructios Threads ca execute differet cotrol flow paths A B Thread Warp Commo PC C D F Thread 1 Thread 2 Thread 3 Thread 4 E G Slide credit: Tor Aamodt 55

Cotrol Flow Problem i GPUs/SIMT A GPU uses a SIMD pipelie to save area o cotrol logic Groups scalar threads ito warps Brach Brach divergece occurs whe threads iside warps brach to differet executio paths Path A Path B This is the same as coditioal/predicated/masked executio. Recall the Vector Mask ad Masked Vector Operatios? Slide credit: Tor Aamodt 56

Remember: Each Thread Is Idepedet Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig If we have may threads We ca fid idividual threads that are at the same PC Ad, group them together ito a sigle warp dyamically This reduces divergece à improves SIMD utilizatio SIMD utilizatio: fractio of SIMD laes executig a useful operatio (i.e., executig a active thread) 57

Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Form ew warps from warps that are waitig Eough threads brachig to each path eables the creatio of full ew warps Warp X Warp Z Warp Y 58

Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Brach Path A Path B Fug et al., Dyamic Warp Formatio ad Schedulig for Efficiet GPU Cotrol Flow, MICRO 2007. 59

Dyamic Warp Formatio Example B x/1110 y/0011 A x/1111 y/1111 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 E x/1110 y/0011 A D Leged A Executio of Warp x at Basic Block A A ew warp created from scalar threads of both Warp x ad y executig at Basic Block D Executio of Warp y at Basic Block A Baselie Dyamic Warp Formatio G x/1111 y/1111 A A B B C C D D E E F F G G A A A A B B C D E E F G G A A Time Time Slide credit: Tor Aamodt 60

Hardware Costraits Limit Flexibility of Warp Groupig Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Ca you move ay thread flexibly to ay lae? Memory Subsystem Slide credit: Krste Asaovic 61

Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018

We did ot cover the followig slides i lecture. These are for your preparatio for the ext lecture.

A Example GPU

NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT executio Geeric speak: 30 cores 8 SIMD fuctioal uits per core Slide credit: Kayvo Fatahalia 65

NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) = SIMD fuctioal uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 66

NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp) Up to 32 warps are simultaeously iterleaved Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 67

NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 68

Evolutio of NVIDIA GPUs #Stream Processors 6000 5000 4000 3000 2000 1000 16000 14000 12000 10000 8000 6000 4000 2000 GFLOPS Stream Processors GFLOPS 0 GTX 285 (2009) GTX 480 (2010) GTX 780 (2013) GTX 980 (2014) P100 (2016) V100 (2017) 0 69

NVIDIA V100 NVIDIA-speak: 5120 stream processors SIMT executio Geeric speak: 80 cores 64 SIMD fuctioal uits per core Tesor cores for Machie Learig 70

NVIDIA V100 Block Diagram 80 cores o the V100 https://devblogs.vidia.com/iside-volta/ 71

NVIDIA V100 Core 15.7 TFLOPS Sigle Precisio 7.8 TFLOPS Double Precisio 125 TFLOPS for Deep Learig (Tesor cores) https://devblogs.vidia.com/iside-volta/ 72