CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance model Fallacies, pitfalls 3 1
History of GPUs In addition to Multicore CPU, GPU is another type of multiprocessor Simple video cards in early time Use frame buffer memory for video output Computer-generated, 1998 3D graphics processing later But only available on high-end computers (e.g., SGI workstations) Moore s Law Þ lower cost, higher density Now, 3D graphics cards on PCs Called Graphics Processing Units (GPU) Special processors oriented to 3D Graphics tasks Compute vertex and pixel processing, shading, texture mapping, and rasterization, etc. 4 GPUs in the System 5 2
GPU Architectures GPU processor is highly data-parallel It is highly multithreaded Use thread switching to hide memory latency So, has much less reliance on multi-level caches Graphics memory is wide and high-bandwidth Current Trend: General Purpose GPUs (GPGPU) We now use heterogeneous CPU/GPU systems CPU for sequential code, GPU for parallel code Programming languages DirectX, OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) Compute Unified Device Architecture (CUDA), OpenCL 6 GPGPU Example: NVIDIA Tesla Streaming multiprocesso SIMD processor 8 Streaming processors 9 3
In 2018, Nvidia Volta V100 delivers very high floating-point and integer performance. Its peak computation rates (based on GPU Boost clock rate) are: 7.8 TFLOP/s of double precision floating-point (FP64) performance 15.7 TFLOP/s of single precision (FP32) performance 125 Tensor-TFLOP/s of mixed-precision matrixmultiply-and-accumulate (for deep learning) 13 V100 Architecture The V100 GPU is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GV100 GPU consists of 6 GPCs, 42 TPCs (each including two SMs), 84 Volta SMs, and eight 512-bit memory controllers (4096 bits total). Each SM has 64 FP32 Cores, 64 INT32 Cores, 32 FP64 Cores, and 8 new Tensor Cores: Each SM also includes four texture units 14 4
With 84 SMs, a full GV100 GPU has a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units. Each memory controller is attached to 768 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory controllers. The full GV100 GPU includes a total of 6144 KB of L2 cache. 15 17 5
Multicore CPU vs GPU Feature Multicore with SIMD GPU Number of SIMD processors 4-16 16-60 #SIMD lanes/processor 4 16 Hardware multithreading support for SIMD threads (per cycle) Typical ratio of single precision to doubleprecision performance SMT = 2-4 32 threads 2:1 2:1 Largest cache size 8 MB 0.75 MB Size of memory address 64-bit 64-bit Size of main memory as big as 256 GB 4 GB to 6 GB Memory protection at level of page Yes Yes Virtual Memory Yes No Integrated scalar processor and SIMD processor Cache coherent Yes No Yes No 19 Distributed-Memory Systems and Message Passing Each processor (aka compute node) has private physical address space This is different from SMP Hardware sends/receives messages between processors 20 6
Interconnection Networks Have different network topologies: An arrangement of processors, switches, and links Examples: Bus Ring 2D Mesh N-cube (N = 3) Fully connected 26 How to Model Performance: common concepts Our performance metric of interest is: GFLOPs/sec (or Gflops) Can be measured using benchmarks or computational kernels Given a computational kernel, it has an Arithmetic Intensity That is, FLOPs Per Byte of accessed memory Given a computer, we can determine: Peak GFLOPS (can be calculated based on CPU clock rate, or check its specification) Peak memory bandwidth (i.e. bytes/sec) Measured by using the Stream benchmark: https://www.cs.virginia.edu/stream 31 7
The Roofline Performance Model y=min(kx, constant) Program specific Attainable GFLOPs/sec = Min {Peak Memory BW Arithmetic Intensity, Peak FP Performance} machine specific 32 Comparing Two Systems Opteron X2 CPU VS Opteron X4 CPU 2-core vs 4-core 2 FP performance/core 2.2GHz vs. 2.3GHz Identical memory system same mem bw n Insight: To get higher performance on X4 than performance on X2, n Need a high arithmetic intensity n Or working set must fit in X4 s 2MB L-3 cache (to have a higher $ hit rate) 33 8
Benchmark Results (CPU vs GPU) Not always faster 38 Multi-threaded DGEMM n Use OpenMP pragmas; compiler will generate parallel code void dgemm (int n, double* A, double* B, double* C) { #pragma omp parallel for for ( int sj = 0; sj < n; sj += BLOCKSIZE ) for ( int si = 0; si < n; si += BLOCKSIZE ) for ( int sk = 0; sk < n; sk += BLOCKSIZE ) do_block(n, si, sj, sk, A, B, C); } 40 9
Multi-threaded DGEMM Using 1 to 16 Threads 41 Fallacies Amdahl s Law does not apply to parallel computers, because we are able to achieve linear speedup. Certainly it applies (see next slide) E.g., strong scaling is bounded by the law Peak CPU performance will reflect observed/actual performance Marketing people like this approach! No, you need to be aware of many bottlenecks E.g., Pipeline stall, branch mispredictions, cache miss, memory bus, contention, synchronization, etc 42 10
Multi-threaded DGEMM 43 Pitfalls When developing parallel software, you do not take account of multiprocessors Example: using a single lock for a shared resource This will serialize all accesses, even if they could be done in parallel Hence, you need to use finer-granularity locking or buffers to reduce the contention 44 11
Concluding Remarks The goal of multiprocessors: To achieve higher performance by using multiple processors Difficulties: Developing parallel software Devising appropriate architectures SaaS is growing and clusters are a good match Parallel Multiprocessor will continue to be popular Performance per dollar and performance per Joule are driving both mobile and WSC (Warehouse Scale Computers) 45 In Summary, Looking Back at DGEMM on Intel Core i7 Data level parallelism via AVX (or subword parallelism, SIMD) 3.2x speedup Increase ILP by loop unrolling four times 2.0x more speedup Give CPU hardware more instructions to schedule Cache optimization via blocking 2.4x more speedup TLP (thread level parallelism) on a 16-core machine 14x more speedup Note: you just added 24 lines of code, but the speedup is 200-300X! (from chapter 1 to chapter 6) 47 12
DGEMM Combining all the techniques we have learned in CS402 so far: 220 X faster! 15x 48 13