COSC 6339 Accelerators in Big Data

Size: px

Start display at page:

Download "COSC 6339 Accelerators in Big Data"

Preston Ross
5 years ago
Views:

1 COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms, many graph problems, etc. How to handle problems that do not have inherent high-level parallelism? sequential processing: time to solution takes too long for large problems exploit low-level parallelism: often groups of very few instructions Problem with instruction level parallelism: costs for exploiting parallelism exceed benefits if using regular threads/processes/tasks 1

2 Historic context: SIMD Instructions Same operation executed for multiple data items Uses a fixed length register and partitions the carry chain to allow utilizing the same functional unit for multiple operations E.g. a 256 bit adder can be utilized for eight 32-bit add operations simultaneously All elements in a register have to be on the same memory page to avoid page faults within the instruction Comparison of instructions Example add operation of eight 32-bit integers with and and without SIMD instruction LOOP: LOAD R2, 0(R4) /* load x(i) */ LOAD R0, 0(R6) /* load y(i) */ ADD R2, R2, R0 /* x(i)+y(i)*/ STORE R2, 0(R4) /* store x(i) */ ADD R4, R4, #4 /* increment x */ ADD R6, R6, #4 /* increment y */ BNEQ R4, R20, LOOP LOAD256 YMM1, 0(R4) /* loads 256 bits of data*/ LOAD256 YMM2, 0(R6) /* ditto */ VADDSP YMM1, YMM1, YMM2 /* AVX ADD operation */ STORE256 YMM1, 0(R4) Note: not actual Intel assembly instructions and registers used 3 instructions required for managing the loop (i.e. not contributing to the actual solution of the problem) Branch instructions typically lead to processor stalls since you have to wait for the outcome of the comparison before you can decide what is the next instruction to execute 2

3 SIMD Instructions MMX (Mult-Media Extension) Existing 64 bit floating point register could be used for eight 8-bit operations or four 16-bit operations SSE (Streaming SIMD Extension) 1999 Successor to MMX instructions Separate 128-bit registers added for sixteen 8-bit, eight 16-bit, or four 32-bit operations SSE2 2001, SSE3 2004, SSE Added support for double precision operations AVX (Advanced Vector Extensions) bit registers added AVX bit registers added Graphics Processing Units (GPU) Hardware in Graphics Units similar to SIMD units Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files Using NVIDIA GPUs as an example 3

4 Graphics Processing Units (II) Basic idea: Heterogeneous execution model CPU is the host, GPU is the device Develop a C-like programming language for GPU Unify all forms of GPU parallelism as CUDA thread Programming model is Single Instruction Multiple Threads GPU hardware handles thread management, not applications or OS Example: Vector Addition Sequential code: int main ( int argc, char **argv ) { int A[N], B[N], C[N]; for ( i=0; i<n; i++) { C[i] = A[i] + B[i]; } } return (0); CUDA: replace the loop by N threads each executing on element of the vector add operation 4

5 Example: Vector Addition (II) CUDA: replace the loop by N threads each executing one element of the vector add operation Question: How does each thread know which elements to execute? threadidx : each thread has an id which is unique in the thread block of type dim3, which is a struct { int x,y,z; } dim3; blockdim: Total number of threads in the thread block a thread block can be 1D, 2D or 3D Example: Vector Addition (III) Initial CUDA kernel: void vecadd ( int *d_a, int *d_b, int* d_c) { int i = threadidx.x; d_c[i] = d_a[i] + d_b[i]; return; } Assuming a 1-D thread block -> only x-dimension used This code is limited by the maximum number of threads in a thread block Upper limit on max. number of threads in one block if vector is longer, we have to create multiple thread blocks 5

6 How does the compiler now which code to compile for CPU and which one for GPU? Specifier tells compiler where function will be executed -> compiler can generate code for corresponding processor Executed on CPU, called form CPU (default if not specified) host void func( ) CUDA kernel to be executed on GPU, called from CPU global void func(...); CUDA kernel to be executed on GPU, called from GPU device void func(...); Example: Vector Addition (IV) so the CUDA kernel is in reality: global void vecadd ( int *d_a, int *d_b, int* d_c) { int i = threadidx.x; d_c[i] = d_a[i] + d_b[i]; return; } Note: d_a, d_b, and d_c are in global memory int i is in local memory of the thread 6

7 If you have multiple thread blocks global void vecadd ( int *d_a, int *d_b, int* d_c) { int i = blockidx.x * blockdim.x + threadidx.x; d_c[i] = d_a[i] + d_b[i]; return; } ID of the thread block that this thread is part of Number of threads in a thread block Using more than one element per thread global void vecadd ( int *d_a, int *d_b, int* d_c) { int i = blockidx.x * blockdim.x + threadidx.x; int j; } for ( j=i*numelements; j<(i+1)*numelements; j++) d_c[j] = d_a[j] + d_b[j]; return; 7

Nvidia GT200 A GT200 is multi-core chip with two level hierarchy focuses on high throughput on data parallel workloads 1 st level of hierarchy: 10 Thread Processing Clusters (TPC) 2 nd level of

8 Nvidia GT200 A GT200 is multi-core chip with two level hierarchy focuses on high throughput on data parallel workloads 1 st level of hierarchy: 10 Thread Processing Clusters (TPC) 2 nd level of hierarchy: each TPC has 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1 core in a conventional processor) a texture pipeline (used for memory access) Global Block Scheduler: issues thread blocks to SMs with available capacity simple round-robin algorithm but taking resource availability (e.g. of shared memory) into account Nvidia GT200 Image Source: David Kanter, Nvidia GT200: Inside a Parallel Processor, 8

Streaming multi-processor (I) Instruction fetch, decode and issue logic 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a core by Nvidia) 8 branch

9 Streaming multi-processor (I) Instruction fetch, decode and issue logic 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a core by Nvidia) 8 branch units: no branch prediction or speculation, branch delay: 4 cycles Can execute up to 8 thread blocks/1024 threads concurrently Each SP has access to 2048 register file entries each with 32 bits a double precision number has to utilize two adjacent registers register file can be used by up to 128 threads concurrently CUDA Memory Model 9

10 CUDA Memory Model (II) cudaerror_t cudamalloc(void** devptr, size_t size) Allocates size bytes of device(global) memory pointed to by *devptr Returns cudasuccess for no error cudaerror_t cudamemcpy(void* dst, const void* src, size_t count, enum cudamemcpykind kind) Dst = destination memory address Src = source memory address Count = bytes to copy Kind = type of transfer ( HostToDevice, DeviceToHost, DeviceToDevice ) cudaerror_t cudafree(void* devptr) Frees memory allocated with cudamalloc Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo Example: Vector Addition (V) int main ( int argc, char ** argv) { float a[n], b[n], c[n]; float *d_a, *d_b, *d_c; cudamalloc( &d_a, N*sizeof(float)); cudamalloc( &d_b, N*sizeof(float)); cudamalloc( &d_c, N*sizeof(float)); cudamemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice); cudamemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice); dim3 threadsperblock(256); // 1-D array of threads dim3 blockspergrid(n/256); // 1-D grid vecadd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c); cudamemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost); cudafree(d_a); cudafree(d_b); cudafree(d_c); return-0; } 10

Nvidia Tesla V100 GPU Most recent Nvidia GPU architecture Architecture: each V100 contains 6 GPU Processing Clusters (GPCs) Each GPC has 7 Texture Processing Clusters (TPCs) 14 Streaming

11 Nvidia Tesla V100 GPU Most recent Nvidia GPU architecture Architecture: each V100 contains 6 GPU Processing Clusters (GPCs) Each GPC has 7 Texture Processing Clusters (TPCs) 14 Streaming Multiprocessors (SMs) Each SM has bit Floating Point cores bit Integer cores bit Floating Point cores 8 Tensor Cores 4 Texture Units Nvidia V100 Tensor Cores Specifically designed to support neural networks Designed to execute D = A B + C for 4x4 matrices Operate on FP16 input data with FP32 accumulation Image source: 11

12 Image source: NVLink GPUs traditionally utilize a PCIe slot for moving data and instructions from CPU to GPU and between GPUs PCIe 3.0 x8: 8 GB/s bandwidth PCIe 3.0 x16: 16 GB/s bandwidth motherboards often restricted in the number of PCIe lanes managed: i.e. using multiple PCIe cards will reduce the bandwidth available for each card NVLink : high speed connection between multiple GPUs higher bandwidth per link (25 GB/sec) than PCIe V100 supports up to 6 NVLINKs per GPU 12

NVLink2 multi-gpu no CPU support Image source: http://images.nvidia.

pdf NVLink2 with CPU support only supports IBM Power 9 CPUs at the moment

13 NVLink2 multi-gpu no CPU support Image source: NVLink2 with CPU support only supports IBM Power 9 CPUs at the moment Image source: 13

14 Other V100 enhancements In earlier GPUs a group of threads (warps) executed a single instruction. A single program counter was used in combination with a active mask that specified which threads of the warp are active at any given point in time Divergent paths (e.g. if-then-else statements) lead to some threads being inactive V100 introduces program counters and call stacks per thread. Independent thread scheduling allows the GPU to yield execution of any thread Schedule-optimizer dynamically detects which how to group active threads into SIMT units Threads can diverge and converge at sub-warp granularity Google Tensor Processing Unit (TPU) Google s DNN ASIC Coprocessor on the PCIe bus Large software-managed scratchpad Scratchpad: high speed memory (similar to cache) content controlled by application instead of system 14

TPU Microarchitecture Matrix multiply unit contains 256x256 ALUs that can perform 8 bit multiply-and-add operations generating 16-bit products Accumulator can be used for updating partial results

15 TPU Microarchitecture Matrix multiply unit contains 256x256 ALUs that can perform 8 bit multiply-and-add operations generating 16-bit products Accumulator can be used for updating partial results Weights for matrix multiply operations provided by an offchip 8GiB weight memory and provided through the Weight FIFO Intermediate results are held in 24 MiB on chip unified memory Host server sends instructions over PCI bus to the TPU Programmable DMA transfers data to or from Host memory Google TPU Architecture Image source: 15

16 TPU Instruction Set Architecture No program counter No branch instructions Contain a repeat field Very high CPI ( 10 20) TPU Instruction Read_Host_Memory Read_Weights MatrixMultiply/Convolve Activate Write_Host_Memory Function Read data from memory Read weights from memory Multiply or convolve with the data and weights, accumulate the results Apply activation functions Write result to memory TPU microarchitecture Goal: hide the costs of other instruction and keep the Matrix Multiply Unit busy Systolic array: 2-D collection of arithmetic units that each independently compute a partial result Data arrives at cells from different directions at regular intervals Data flows through the array similar to a wave front -> systolic execution Image source: 16

Systolic execution: Matrix-Vector Example Image source: https://cloud.google.

17 Systolic execution: Matrix-Vector Example Image source: Systolic execution: Matrix-Matrix Example Image source: 17

Architecture CPU CPU w/ vector extensions GPU TPU No. of instructions per cycle A few Tens - hundreds Tens of thousands Up to 128k Image source: https://cloud.google.

18 Architecture CPU CPU w/ vector extensions GPU TPU No. of instructions per cycle A few Tens - hundreds Tens of thousands Up to 128k Image source: TPU software At this point mostly limited to Tensorflow Code that is expected to run on TPU is compiled using an API that can run on GPU, TPU or CPU 18

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length