EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Size: px

Start display at page:

Download "EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)"

Andrea Stanley
5 years ago
Views:

1 EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez 1

2 Processor Make the Compute Core The Focus of the Architecture Processors The future of execute GPUs is computing programmable threads processing Alternative So build the operating architecture mode around specifically the processor for computing 2 Host Input Input Assembler Used to be only one Setup kernel / Rstr / ZCull at a time Execution Manager Vtx Issue Manages thread blocks Geom Issue Pixel Issue Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Texture Texture Texture Texture Texture Texture Texture Texture L1 L1 L1 L1 L1 L1 L1 L1 Load/store L2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 David Kirk/VIDIA and FB FB FB Global Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez FB FB FB

3 Streaming Multiprocessor (SM) 3 Streaming Multiprocessor (SM) 8 Streaming Processors () 2 Super Function Units (SFU) Multi-threaded instruction dispatch Vectors of 32 threads (warps) Up to 16 warps per thread block HW masking of inactive threads in a warp s cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory 32 KB in registers DRAM texture and memory access Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch SFU SFU David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

4 Life Cycle in HW 4 Kernel is launched on the A Kernels known as grids of thread blocks Host Device Grid 1 Blocks are serially distributed to all the SM s Potentially >1 Block per SM At least 96 threads per block Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Each SM launches Warps of s 2 levels of parallelism SM schedules and executes Warps that are ready to run Kernel 2 Block (1, 1) Grid 2 As Warps and Blocks complete, resources are freed A can distribute more Blocks (0, 0) (0, 1) (1, 0) (1, 1) (2, 0) (2, 1) (3, 0) (3, 1) (4, 0) (4, 1) David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

5 SM Executes Blocks 5 t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm Blocks David Kirk/VIDIA and MT IU MT IU Texture L1 Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez L2 Blocks s are assigned to SMs in Block granularity Up to 8 Blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. s run concurrently SM assigns/maintains thread IDs SM manages/schedules thread execution

6 Make the Compute Core The Focus of the Architecture 6 1 Grid (kernel) at a time Host 1 thread per (in warps of 32 across the SM) Input Assembler Execution Manager 1 8 Blocks per SM ( total concurrent blocks) Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store David Kirk/VIDIA and Global Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

7 Scheduling/Execution 7 Each Block is divided into 32-thread Warps This is an implementation decision Warps are scheduling units in SM Block 1 Warps t0 t1 t2 t31 Block 2 Warps t0 t1 t2 t31 If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch SFU SFU David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

SM Warp Scheduling 8 SM hardware implements zerooverhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution All threads in a Warp execute

8 SM Warp Scheduling 8 SM hardware implements zerooverhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution All threads in a Warp execute the same instruction when selected Scoreboard scheduler 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96

9 SM Instruction Buffer Warp Scheduling 9 Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one ready-to-go warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on round-robin/age of warp I $ L 1 Multithreaded Instruction Buffer R F C $ L 1 Operand Select MAD SFU Mem SM broadcasts the same instruction to 32 s of a Warp David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

10 Scoreboarding 10 All register operands of all instructions in the Instruction Buffer are scoreboarded Status becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled /Processor pipelines any thread can continue to issue instructions until scoreboarding prevents issue allows /Processor ops to proceed in shadow of David Kirk/VIDIA and /Processor ops Time TB1 W1 TB2 W1 TB3 W1 Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez TB1, W1 stall TB2, W1 stall TB3 W2 TB2 W1 TB3, W2 stall Instruction: TB1 W1 TB1 W2 TB = Block, W = Warp TB1 W3 TB3 W2

11 Granularity and Resource Considerations 11 For Matrix Multiplication, should I use 8X8, 16X16 or 32X32 tiles (1 thread per tile element)? For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it can take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. For 32X32, we have 1024 threads per Block. ot even one can fit into an SM! David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

12 SM Architecture 12 t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm Blocks David Kirk/VIDIA and MT IU Courtesy: John icols, VIDIA MT IU Texture L1 L2 Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez Blocks Registers in 1K total per shared between thread same per thread in a block) memory in SM 16KB total per SM shared between blocks Global memory Managed by Texture Units read only Managed by LD/ST ROP units Uncached read/write

13 SM Register File 13 Register File (RF) 32 KB (1 Kword per ) Provides 4 operands/clock TEX pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF I $ L 1 Multithreaded Instruction Buffer R F C $ L 1 Operand Select Mem MAD SFU David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

14 Programmer View of Register File 14 There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all Blocks assigned to the SM Once assigned to a Block, the register is OT accessible by threads in other Blocks Each thread in the same Block only access registers assigned to itself David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez 4 blocks 3 blocks

15 Matrix Multiplication Example 15 If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each Block requires 10*256 = 2560 registers 8192 = 3 * change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each Block now requires 11*256 = 2816 registers 8192 < 2816 *3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!! David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

16 More on Dynamic Partitioning 16 Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism and thread level parallelism David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

17 ILP vs. TLP Example 17 Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles 3 Blocks can run on each SM If a Compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load Only two can run on each SM However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency Two Blocks have 16 Warps. The performance can actually be higher! David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

18 SM Architecture 18 t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm Blocks David Kirk/VIDIA and MT IU Courtesy: John icols, VIDIA MT IU Texture L1 L2 Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez Blocks Registers in 1K total per shared between thread same per thread in a block) memory in SM 16KB total per SM shared between blocks Global memory Managed by Texture Units read only Managed by LD/ST ROP units Uncached read/write

19 Constants 19 Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM A constant value can be broadcast to all threads in a Warp Extremely efficient way of accessing a value that is common for all threads in a Block! I $ L 1 Multithreaded Instruction Buffer R F C $ L 1 Operand Select MAD SFU Mem David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

20 Textures 20 Textures are 2D arrays of values stored in global DRAM Textures are cached in L1 and L2 Read-only access s optimized for 2D access: s in a warp that follow 2D locality will achieve better memory performance EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez

21 SM Architecture 21 t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm Blocks David Kirk/VIDIA and MT IU Courtesy: John icols, VIDIA MT IU Texture L1 L2 Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez Blocks Registers in 1K total per shared between thread same per thread in a block) memory in SM 16KB total per SM shared between blocks Global memory Managed by Texture Units read only Managed by LD/ST ROP units Uncached read/write

22 22 Each SM has 16 KB of 16 banks of 32bit words CUDA uses as shared storage visible to all threads in a thread block read and write access I $ L 1 Multithreaded Instruction Buffer R F C $ L 1 Operand Select Mem ot used explicitly for pixel shader programs we dislike pixels talking to each other David Kirk/VIDIA and Urbana-Champaign EE382: Principles of Computer Architecture, Fall Lecture 18 (c) Mattan Erez MAD SFU

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and