COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Size: px

Start display at page:

Download "COSC 6385 Computer Architecture. - Data Level Parallelism (II)"

Ariel Hines
6 years ago
Views:

1 COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length register and partitions the carry chain to allow utilizing the same functional unit for multiple operations E.g. a 64 bit adder can be utilized for two 32-bit add operations simultaneously Instructions originally not intended to be used by compiler, but just for handcrafting specific operations in device drivers All elements in a register have to be on the same memory page to avoid page faults within the instruction 1

2 SIMD Instructions MMX (Mult-Media Extension) Existing 64 bit floating point register could be used for eight 8-bit operations or four 16-bit operations SSE (Streaming SIMD Extension) 1999 Successor to MMX instructions Separate 128-bit registers added for sixteen 8-bit, eight 16-bit, or four 32-bit operations SSE2 2001, SSE3 2004, SSE Added support for double precision operations AVX (Advanced Vector Extensions) bit registers added AVX Instructions AVX Instruction VAPDDPD VSUBPD VMULPD VDIVPD VFMADDPD VFMSUBPD VCMPxx VMOVAPD VBROADCASTSD Description Add four packed double-precision operands Subtract four packed double-precision operands Multiply four packed double-precision operands Divide four packed double-precision operands Multiply and add four packed double-precision operands Multiply and subtract four packed double-precision operands Compare four packed double-precision operands for EQ, NEQ, LT, LTE, GT, GE Move aligned four packed double-precision operands Broadcast one double-precision operand to four locations in a 256-bit register 2

3 Graphics Processing Units (GPU) Hardware in Graphics Units similar to Vector Processors Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor Graphics Processing Units (II) Using NVIDIA GPUs as an example Basic idea: Heterogeneous execution model CPU is the host, GPU is the device Develop a C-like programming language for GPU Unify all forms of GPU parallelism as CUDA thread Programming model is Single Instruction Multiple Thread GPU hardware handles thread management, not applications or OS 3

4 Example: Vector Addition Sequential code: int main ( int argc, char **argv ) { int A[N], B[N], C[N]; for ( i=0; i<n; i++) { C[i] = A[i] + B[i]; } } return (0); CUDA: replace the loop by N threads each executing on element of the vector add operation Example: Vector Addition (II) CUDA: replace the loop by N threads each executing on element of the vector add operation Question: How does each thread know which elements to execute? threadidx : each thread has an id which is unique in the thread block of type dim3, which is a struct { int x,y,z; } dim3; blockdim: Total number of threads in the thread block a thread block can be 1D, 2D or 3D 4

5 Example: Vector Addition (III) Initial CUDA kernel: void vecadd ( int *d_a, int *d_b, int* d_c) { int i = threadidx.x; d_c[i] = d_a[i] + d_b[i]; return; } Assuming a 1-D thread block -> only x-dimension used This code is limited by the maximum number of threads in a thread block CUDA 1.3: 512 threads max. if vector is longer, we have to create multiple thread blocks How does the compiler now which code to compile for CPU and which one for GPU? Specifier tells compiler where function will be executed -> compiler can generate code for corresponding processor Executed on CPU, called form CPU (default if not specified) host void func( ) CUDA kernel to be executed on GPU, called from CPU global void func(...); CUDA kernel to be executed on GPU, called from GPU device void func(...); 5

6 Example: Vector Addition (IV) so the CUDA kernel is in reality: global void vecadd ( int *d_a, int *d_b, int* d_c) { int i = threadidx.x; d_c[i] = d_a[i] + d_b[i]; return; } Note: d_a, d_b, and d_c are in global memory int i is in local memory of the thread If you have multiple thread blocks global void vecadd ( int *d_a, int *d_b, int* d_c) { int i = blockidx.x * blockdim.x + threadidx.x; d_c[i] = d_a[i] + d_b[i]; return; } ID of the thread block that this thread is part of Number of threads in a thread block 6

7 Using more than one element per thread global void vecadd ( int *d_a, int *d_b, int* d_c) { int i = blockidx.x * blockdim.x + threadidx.x; int j; } for ( j=i*numelements; j<(i+1)*numelements; j++) d_c[j] = d_a[j] + d_b[j]; return; NVIDIA Instruction Set Architecture Parallel Thread Execution (PTX) is an abstraction of the hardware instruction set Uses virtual registers Translation to machine code is performed in software Example for one iteration of a loop executing y[i] = a*x[i] + y[i] with a blocksize of 512 threads per block shl.s32 R8, blockidx, 9 ; Thread Block ID * Block size add.s32 R8, R8, threadidx ; R8 = i = my CUDA thread ID ld.global.f64 RD0, [X+R8] ; RD0 = X[i] ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] mul.f64 RD0, RD0, RD4 ; Product in RD0 = RD0 * a add.f64 RD0, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i]) st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i]) 7

8 Conditional Branching Branch hardware uses internal masks Branch synchronization stack to support nested branch instructions Entries consist of masks for each SIMD lane (CUDA thread) Instruction markers to manage when a branch diverges into multiple execution paths Push on divergent branch and when paths converge Act as barriers Pops stack For equal length IF-ELSE conditions, code will operate at 50% efficiency Either IF or the ELSE part is not executing Nvidia GT200 A GT200 is multi-core chip with two level hierarchy focuses on high throughput on data parallel workloads 1 st level of hierarchy: 10 Thread Processing Clusters (TPC) 2 nd level of hierarchy: each TPC has 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1 core in a conventional processor) a texture pipeline (used for memory access) Global Block Scheduler: issues thread blocks to SMs with available capacity simple round-robin algorithm but taking resource availability (e.g. of shared memory) into account 8

Nvidia GT200 Image Source: David Kanter, Nvidia GT200: Inside a Parallel Processor, COSC 6385 Computer http://www.realworldtech.com/page.cfm?

9 Nvidia GT200 Image Source: David Kanter, Nvidia GT200: Inside a Parallel Processor, COSC 6385 Computer Architecture Streaming multi-processor (I) Instruction fetch, decode and issue logic 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a core by Nvidia) 8 branch units a thread encountering a branch will stall until it is resolved (no speculation), branch delay: 4 cycles Two 64bit special units for less frequent operations 64bit operations 8-12 times slower than 32bit operations! 1 special function unit for unusual instructions transcendental functions, interpolations, reciprocal square roots take anywhere from 16 to 32 cycles to execute 9

Streaming multi-processor (II) Single issue with SIMD capabilities Can execute up to 8 thread blocks/1024 threads concurrently Does not support speculative execution or branch prediction Instructions

10 Streaming multi-processor (II) Single issue with SIMD capabilities Can execute up to 8 thread blocks/1024 threads concurrently Does not support speculative execution or branch prediction Instructions are scoreboarded to reduce stalls Each SP has access to 2048 register file entries each with 32 bits a double precision number has to utilize two adjacent registers register file can be used by up to 128 threads concurrently Streaming multi-processor (III) Image Source: David Kanter, Nvidia GT200: Inside a Parallel Processor, 10

11 Streaming multi-processor (IV) Execution units of an SM run at twice the frequency of fetch and issue logic as well as memory and register 64KB register file that is partitioned across alls SPs 16KB shared memory that can be used for communication between the threads running on the SPs of the same SM organized in 4096 entries, 16 banks ( = 32bit bank width) accessing shared memory is as fast as accessing a register! Load/Store operations Generated in SMs, but handled by SM controller in the TPC load pipeline shared hardware with texture pipeline shared by three 3 SMs mutual exclusive usage of load and texture pipelines effective address calculation + mapping of 40byte virtual addresses to physical address by MMU Texture cache: 2-D addressing read only caches without cache coherence entire cache hierarchy invalidated if a data item is modified texture caches used to save bandwidth and power, not really faster than texture memory 11

CUDA Memory Model CUDA Memory Model (II) cudaerror_t cudamalloc(void** devptr, size_t size) Allocates size bytes of device(global) memory pointed to by *devptr Returns cudasuccess for no error

12 CUDA Memory Model CUDA Memory Model (II) cudaerror_t cudamalloc(void** devptr, size_t size) Allocates size bytes of device(global) memory pointed to by *devptr Returns cudasuccess for no error cudaerror_t cudamempy(void* dst, const void* src, size_t count, enum cudamemcpykind kind) Dst = destination memory address Src = source memory address Count = bytes to copy Kind = type of transfer ( HostToDevice, DeviceToHost, DeviceToDevice ) cudaerror_t cudafree(void* devptr) Frees memory allocated with cudamalloc Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo 12

13 Example: Vector Addition (V) int main ( int argc, char ** argv) { float a[n], b[n], c[n]; float *d_a, *d_b, *d_c; cudamalloc( &d_a, N*sizeof(float)); cudamalloc( &d_b, N*sizeof(float)); cudamalloc( &d_c, N*sizeof(float)); cudamemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice); cudamemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice); dim3 threadsperblock(256); // 1-D array of threads dim3 blockspergrid(n/256); // 1-D grid vecadd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c); cudamemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost); cudafree(d_a); cudafree(d_b); cudafree(d_c); return-0; } Nvidia Fermi processor Next generation processors of Nvidia Removed one level of hierarchy contains 16 SM processors, but no notion of TPCs anymore Each SM processor has 32 ALU units (Nvidia cores, SIMD lanes in the book) compared to 8 on the GT200 organized as two units with 16 ALUs each 16 load/store units compared to 1 for three SMs in GT kb local SRAM that can be split into L1 cache and shared memory (16kb/48kb or 48kb/16kb) 4 special function units compared to 1 in GT200 13

Nvidia Fermi SM processor Image Source:Peter N. Glaskowsky, Nvidia s Fermi: The First Complete GPU Architecture http://www.nvidia.com/content/pdf/fermi_white_papers/p.

14 Nvidia Fermi SM processor Image Source:Peter N. Glaskowsky, Nvidia s Fermi: The First Complete GPU Architecture The_First_Complete_GPU_Architecture.pdf Nvidia Fermi processor Can manage up 1,536 threads simultaneously per SM compared to 1024 per SM on the GT200 Register file increased to 128kB (32k entries) New: modified address space using 40bit addresses global, shared and local addresses are ranges within that address space New: support for atomic read-modify-write operation New: support for predicated instructions 14

15 Similarities and Differences between GPU and Vector Processors Memory organization and management All GPU memory accesses are gather-scatter -> special hardware to recognize address coalescing -> hides memory latency due to large number of threads and scoreboarding Loading data into vector register contiguous by default -> special support for gather-scatter operation -> costs of load/store operation amortized due to large number of elements accessed at once Similarities and Differences between GPU and Vector Processors (II) Processor organization and ISA Vector register hold entire vector <-> vector is distributed across registers in different ALUs on GPU Much higher number of ALU/threads supported in GPU than no. of lanes in a vector processor PTX instruction similar to a vector instruction Both approaches use mask registers to handle conditional instructions -> mask set by compiler for vector processors -> mask set at runtime by hardware for GPU 15

16 Similarities and Differences between GPU and Vector Processors (III) Scalar processor executes scalar operations in vector processor GPU could use the regular CPU for scalar operations High costs of data transfer between GPU and CPU memory Scalar code often executed on GPU 16

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M.