DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

Size: px

Start display at page:

Download "DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)"

Gwendoline Maxwell
5 years ago
Views:

1 1 DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) Chapter 4 Appendix A (Computer Organization and Design Book)

2 OUTLINE SIMD Instruction Set Extensions for Multimedia (4.3) Graphical Processing Units (4.4) Detecting and Enhancing Loop-Level Parallelism (4.5) 2

3 SIMD EXTENSIONS SIMD Multimedia extensions started by the observation that media application operate on data types narrower than 32-bit! Pixels(8bits)andaudiosamples(8or16bits) Partition wide HW to handle smaller operands with few additional cost! Similar to vector ISAs, SIMD instruction operate on vector of data However, SIMD instructionsspecifyfeweroperands! 3

4 SIMD EXTENSIONS Compared to Vector ISAs, SIMD extensions Fix the number of data operands in the opcode while vectors ISAs have VLR Increased number of instructions in SIMD Do not support stride access and gather-scatter addressing modes Lower possibility of vectorization Donot offer mask registers Harder for compiler to generate SIMD code and more difficult to program in SIMD assembly language! 4

5 SIMD EXTENSIONS Implementations Intel Multimedia Extensions (MMX) (1996) Repurposed the 64-bit floating-point registers Eight 8-bit integer ops or four 16-bit integer ops simultaneously Streaming SIMD Extensions (SSE) (1999) Added separate 128-bit registers Allow 16 8-bit, 8 16-bit or 4 32-bit operations simultaneously SEE2, SEE3, and SEE4 additional multimedia instructions Advanced Vector Extensions (AVX) (2010) Doubled the width of the registers to 256 These extensions are intended to accelerate carefully written libraries rather than requiring the compiler to generate them. 5

6 6 SIMD EXTENSIONS

7 SIMD EXTENSIONS Why are Multimedia SIMD Extensions so popular? Little cost and easy to add to standard arithmetic unit Little extra state Less memory bandwidth Less virtual memory problems. Fewer operands that are aligned. Vector architectures had issues with caches! 7

8 SIMD EXTENSIONS MIPS SIMD 256-bit SIMD instructions Suffix 4D indicates FP SIMD that operate on four double precision operands at once Have four lanes Reuse the floating-point registers as operands for 4D instructions Example. ShowtheMIPSSIMDcodefor DAXPY! 8

9 SIMD EXTENSIONS L.D F0,a ;load scalar a MOV F1, F0 ;copy a into F1 for SIMD MUL MOV F2, F0 ;copy a into F2 for SIMD MUL MOV F3, F0 ;copy a into F3 for SIMD MUL DADDIU R4,Rx,#512 ;last address to load Loop: L.4D F4,0(Rx) ;load X[i], X[i+1], X[i+2], X[i+3] MUL.4D F4,F4,F0 ;a X[i],a X[i+1],a X[i+2],a X[i+3] L.4D F8,0(Ry) ;load Y[i], Y[i+1], Y[i+2], Y[i+3] ADD.4D F8,F8,F4 ;a X[i]+Y[i],..., a X[i+3]+Y[i+3] S.4D 0(Ry),F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] DADDIU Rx,Rx,#32 ;increment index to X DADDIU Ry,Ry,#32 ;increment index to Y DSUBU R20,R4,Rx ;compute bound BNEZ R20,Loop ;check if done Notas beneficial as VMIPS!But better than MIPS! 9

10 SIMD EXTENSIONS Roofline Performance Model [Williams, 2009] Comparesfloating-point performance of variations of SIMD architectures Tiesfloating-point performance, memory performance, and arithmetic intensity in one graph Arithmetic intensityis the ratio of floating-point operations per byte of memory accessed 10

11 GPUS INTRODUCTION By the end of last century, graphics on a PC were performed using video graphics array (VGA), i.e. memory controller and display generator! VGAs evolved to include more advanced graphics functions(shading, texture mapping,.)! By 2000, the term GPU was coined to reflect that the graphics device has become a processor! Programmable processors replaced graphics fixed logic More precise! Integer double precision GPUs have become massively programmable parallel processors(100s cores and 1000s threads) GPUs implement all forms of parallelism; multithreading, MIMD, SIMD and ILP 11

12 GPUS INTRODUCTION Given the hardware invested to do graphics well, how can it be supplemented to improve performance of a wider range of applications? GPUComputing Using GPU for computing via parallel programming language and API without using traditional graphics API and graphics pipeline. Basic idea Heterogeneous execution model CPU is the host, GPU is the device Develop a C-like programming language for GPU Compute Unified Device Architecture (CUDA) OpenCL for vendor-independent language Unify all forms of GPU parallelism as CUDA thread Programming model is Single Instruction Multiple Thread (SIMT) 12

13 GPUS-CUDA Compute Unified Device Architecture (CUDA) is a scalable parallel programming model for the GPU and parallel processors CUDA addresses the challenge of heterogeneous system and various forms of parallelism CUDA produces C/C++ for the system processor (host) and a C and C++ dialect for the GPU (device) 13

14 GPUS-THREADS ANDBLOCKS A GPU is simply a multiprocessor system composed of multithreaded SIMD processors A thread (CUDA thread) is associated with eachdata element/iteration Threads are organized into thread blocks Up to 512 elements or threads per blocks Each block executes on a multithreaded SIMD Processor 32 elements executed per thread at a time Blocks are organized into a grid Blocks are executed independently and in any order Different blocks cannot communicate directly but can coordinate using atomic memory operations in Global Memory Thread management is through GPU hardware, not applications or OS 14

15 GPUS-THREADS ANDBLOCKS Launch n threads 256 threads per block 15

16 GPUS NVIDIA ARCHITECTURE Example Multiplying two 8192-element vectors The code (for loop in this case) that works on the whole 8192 elements is the grid(vectorized loop) The gird is decomposed into thread blocks (body of vectorized loop) Each has up to512 elements Need 8192/512 or16 blocks Assuming that SIMD instructions process 32 elements at a time Each thread block has 512/32 or 16 threads of CUDA threads(warp) SIMD processors may execute maximum number of threads simultaneously (16 for Tesla, 32 for Fermi) A thread block is assigned to a multithreaded SIMD processor by the thread block scheduler Current-generation GPUs (Fermi) have 7-15 multithreaded SIMD processors 16

17 17 GPUS

18 GPUS NVIDIA ARCHITECTURE Simplified block diagram of a Multithreaded SIMD Processor. It has 16 SIMD lanes. The SIMD Thread Scheduler has, say, 48 independent threads of SIMD instructions that it schedules with a table of 48 PCs. 18

19 GPUS NVIDIA ARCHITECTURE The machine object that the hardware creates, manages, schedules, and executes is a thread of SIMD instructions(warp) Each SIMD thread Contains exclusively SIMD instructions Has its own PC Runs on multithreaded SIMD processor Independent from other threads Threads in a processor are scheduled using the SIMD thread scheduler It has a scoreboard to know which threads of SIMD instructions are ready to run Schedules threads of SIMD instructions Hence, two levels of scheduling! 19

20 GPUS NVIDIA ARCHITECTURE The scheduler selects a ready thread of SIMD instructions and issues an instruction synchronously to all the SIMD Lanes executing the SIMD thread. Because threads of SIMD instructions are independent, the scheduler may select a different SIMD thread each time 20

21 GPUS NVIDIA ARCHITECTURE NVIDIA GPU has 32, bit registers Divided across the SIMD lanes Each SIMD thread is limited to 64 registers 64 vector registers of bit elements 32 vector registers of bit elements Fermi has 16 physical SIMD lanes, each containing 2048 registers Registers are dynamically allocated when threads are created and freed when SIMD threads exits Note that a CUDA thread is just a vertical cut of a thread of SIMD instructions, corresponding to one element executed by one SIMD Lane. 21

22 GPUS NVIDIA ARCHITECTURE Terminology Summary Thread: concurrent code and associated state executed on the CUDA device(in parallel with other threads) Warp: a group of threads executed physicallyin parallel in G80/GT200 Block:a group of threads that are executed together and form the unit of resource assignment Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect Mapping Summary Grid is broken into thread blocks. Blocks are independent and can execute in any order. Thread block consists of CUDA threads. Each 32 of which form a Warp(SIMD thread) Threads in a block execute the same program and are assumed to be independent Blocks are identified by blockidx Threads are identified by threadidx(sequential within a block) 22

23 GPUS NVIDIA ARCHITECTURE Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 SIMD Thread or Warp Block Block Thread Id #: (0, 0) (1, 0) 0123 m Block (0, 1) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread program Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) Courtesy: NDVIA Courtesy: John Nickolls, NVIDIA 23

24 24 NVIDIA GPUS

25 25 NVIDIA GPUSPERFORMANCE

26 NVIDIA GPUS NVIDIA GTX280 Specifications 933 GFLOPS peak performance 10 thread processing clusters (TPC) 3 multiprocessors per TPC 8 cores per multiprocessor registers per multiprocessor 16 KB shared memory per multiprocessor 64 KB constant cache per multiprocessor 6 KB < texture cache < 8 KB per multiprocessor 1.3 GHz clock rate Single and double-precision floating-point calculation 1 GB DDR3 dedicated memory 26

27 27 NVIDIA GPUS

28 THEFERMIGPU ARCHITECTURE Each SIMD processor has Two SIMD thread schedulers, two instruction dispatch units 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 loadstore units, 4 special function units Thus, two threads of SIMD instructions are scheduled every two clock cycles Fast double precision: gen GFLOPs for DAXPY Caches for GPU memory: I/D L1/SIMD proc and shared L2 64-bit addressing and unified address space: C/C++ ptrs Error correcting codes: dependability for long-running apps Faster context switching: hardware support, 10X faster Faster atomic instructions: 5-20X faster than gen- 28

29 THEFERMIGPU ARCHITECTURE 29

30 THEFERMIGPU ARCHITECTURE 30

31 GPUS NVIDIA ISA The instruction set target of NVIDIA compilers is an abstraction of the hardware instruction set Parallel Thread Execution (PTX) provides a stable ISA for compilers. The hardware ISA is hidden! PTX uses virtual registers. Compiler assigns required physical registers. General format of PTX instruction opcode.typed, a, b, c a, b and c are operands while d is the destination Operands are 32-bit or 64-bit registers or constant value Destination d is a register or memory Check p. 299 for PTX instructions! 31

32 GPUS NVIDIA ISA PTX code for one CUDA thread in DAXPY shl.s32 R8, blockidx, 9 ; Thread Block ID * Block size add.s32 R8, R8, threadidx ; R8 = i= my CUDA thread ID shl.u32 R8, R8, 3 ; byte offset ld.global.f64 RD0, [X+R8] ; RD0 = X[i] ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i]) st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i]) 32

33 GPUS NVIDIA ISA Conditional Branching GPU hardware executes an instruction for all threads in the same wrap before moving to next instruction(simt) Works well when all threads in a warp follow the same control flow path! It is not uncommon to have conditional branching within a loop! CUDA threads may take different paths?! Branch divergence! Solution serialize execution paths Example: if-then-else is executed in two passes One pass for threads executing the THEN path Second pass for threads executing the ELSE path Merge threads in the warp once completed 33

34 GPUS NVIDIA ISA Illustration Warp of CUDA Threads Branch Path A Pass 1 Then Part Path B Pass 2 Else Part Merge 34

35 GPUS NVIDIA ISA Implementation Hardware Internal Masks (just like vector processors) Predicate registers (1-bit per SIMD lane) Branch synchronization stack per SIMD lane (nested IF) Instruction markers to control masks (*comp, *push, *pop) Lanes are enabled or disabled based on the 1-bit predicate registers values in each pass. 35

36 GPUS NVIDIA ISA Example if (X[i]!= 0) X[i] = X[i] Y[i]; else X[i] = Z[i]; ld.global.f64 RD0, [X+R8] ; RD0 = X[i] setp.neq.s32 P1, RD0, #0 ; P1 is predicate register braelse1, *Push ; Push old mask, set new mask bits ; if P1 false, go to ELSE1 ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] sub.f64 RD0, RD0, RD2 ; Difference in RD0 st.global.f64 [X+R8], RD0 ; X[i] = bra ENDIF1, *Comp ; complement mask bits ; if P1 true, go to ENDIF1 ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i] st.global.f64 [X+R8], RD0 ; X[i] = RD0 ENDIF1: <next instruction>, *Pop ; pop to restore old mask 36

37 GPUS NVIDIA ISA It is like each element has its own program counter! Illusion that each CUDA thread is acting independently. Vector compliers could do the same tricks! Need scalar instructions to manipulate mask registers GPUs do it at run time! What if all threads take the same path? Optimization! When all mask bits are 0, the THEN part is skipped. Similarly, when the mask bits are all 1, the ELSE part is skipped in all threads. Vector processors can not do it at compile time! 37

38 GPUS NVIDIA ISA Conditional Branching Performance How frequently divergence occurs? In the best case, all masks are the same! Only the THEN or ELSE parts are executed! If at least one CUDA thread diverges, we need two passes! 50% efficiency in case the THEN and ELSE parts are of equal lengths In case of nested IF-THEN-ELSE, the cost is more! Doubly nested 25% Triply nested 12.5% Active research area for optimization Optimization? Avoid divergence when possible? If (threadidx.x > 2)?? If (threadidx.x / WARP_SIZE > 2)?? 38

39 GPUS NVIDIA MEMORYSTRUCTURE Private off-chip, Recently, in L1 and L2 caches For stack and spilling registers Local On-chip One per multithreaded processor Shared between threads in block Dynamically allocated to blocks Global Off-chip Shared by all processors Accessed by host 39

40 GPUS VS. VECTORPROCESSORS Both architectures Designed to execute DLP programs Both have multiple processors However, architecturally GPUs rely on multithreading! (shallow pipelines) Have more registers! Has many lanes (8-16 vs. 2-8) Memory VPs have explicit unit-stride load. GPUs is implicit. (address coalescing) Branch VPs manage masks explicitly in SW. GPUs do that at run time! Strip-mining in VPs requires VLR. GPUs iterate the loop until the last iteration and mask off unused lanes. 40

41 GPUS VS. VECTORPROCESSORS Control Unit In VPs, it handles vector and scalar operations GPUs have no control unit, but the thread block scheduler (less power efficient) Scalar Processor Separate simple scalar processor in VPs None in GPUs. Use single SIMD lane and disable others rather than using the system processor (less power efficient and slower) 41

42 42 GPUS VS. VECTORPROCESSORS

43 GPUS VS. MULTIMEDIASIMD PROCESSORS 43

44 READINGASSIGNMENT Section 4.7 Putting it All Together Appendix A from the Computer Organization and Design Textbook 44

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES Flynn s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) o Vector