SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt

I: INTRODUCTION 3 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

LOEWE-CSC An AMD based supercomputer 786 GPU compute nodes 2 x 2.1 GHz 12-core AMD Opteron 1 AMD Radeon 5870 GPU 40 high density compute nodes 4 x 2.1 GHz 12 core AMD Opteron 32 GiB RAM / CPU 16.5 TB/s bisectional network bandwidth 1.62 PB local storage New HPL version and own DGEMM kernel 299 Gflops overall # 22 Top 500 November 2010 # 8 Green 500 November 2010 4 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

DGEMM AND LINPACK Optimizing Step by Step The given optimization problems LINPACK Usually HPL Solves A * x = b Gaussian Elimination with partial pivotization Performance dominated by DGEMM DGEMM General Matrix Multiply C = alpha * A * B + beta * C The Optimization Steps GPU DGEMM Kernel DGEMM on a one GPU system DGEMM on a multi-gpu system Moving from Cypress to Cayman Tuning HPL for a GPU DGEMM Using Interlagos in the Future 5 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

II: DGEMM KERNELS 6 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

DGEMM KERNELS Matrix Blocking DGEMM = General Matrix Multiply C = alpha * A * B + beta * C A * B on GPU A: nxk Matrix, B: kxm Matrix, C: nxm Matrix Complexity: 2 * m * n * k Floating Point Operations 2 * m * n * k Memory Fetches Blocking: Ail = ax1 matrix; Blj = 1Xb matrix; Cij = axb matrix Caching Ai and Bj reduces memory fetches One block per thread More registers required for larger blocking (e.g. 41 for 8x8) Block size vs. Wavefront count Larger Blocking reduces memory fetches Smaller Blocking increases number of Wavefronts 7 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

KERNEL TUNING Parameters of an optimal kernel Assume square C matrix for simplicity Optimal kernel parameters determined Experimentally 4 x 4 blocking B matrix transposed, A matrix not transposed Pixel Shader kernel and no Compute Shader Output using Color Buffers not MemExport Output buffer located either in host or GPU Memory (Depends on GPU and Chipset) Loop unrolled with an unrolling factor of two Texture cache is used in tiled mode Hardcoded K = 1024 Identical for Cypress and Cayman 8 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

MATRIX SIZE DEPENDENCE OF THE KERNEL Cypress Kernel Performance depends on: K = Length of scalar product H = # of lines / rows in C matrix calculated per kernel launch Cypress prefers large values for both Good: K = 1024, H >= 1024 (best >= 3072) H can be chosen depending on M and N of A and B. Kernel Peak Performance: 494 Gflop/s 9 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

MATRIX SIZE DEPENDENCE OF THE KERNEL Cayman Kernel Performance depends on: K = Length of scalar product H = # of lines / rows in C matrix calculated per kernel launch Cypress prefers large values for both Good: K = 1024, H >= 1024 (best >= 3072) Cayman prefers smaller values values for both Larger matrices must be cut into blocks Kernel Peak Performance: 617 Gflop/s 10 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

III: DGEMM SYSTEM PERFORMANCE 11 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

FULL DGEMM Feeding work to the GPU C Matrix does not fit in GPU memory Divide problem into blocks Chose perfect problem sizes for GPU Remainder from block construction processed by CPU For each A-submatrix, all B-submatrices are iterated over this fixed A-submatrix. The B-submatrices stay on the GPU, submatrices of A are not necessarily stored. (halves memory need) Still, each matrix is transferred exactly once. 12 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

GPU SCHEDULING A simple approach GPU only calculates A * B + C done on host GPU requires special memory layout Pre and post processing required DivideBuffer Transform A and B as required MergeBuffer Add GPU result (A*B) and C Performance loss due to GPU idle time 13 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

GPU SCHEDULING Building a pipeline Pipeline minimizes GPU idle time One Divide + one merge DMA must be started before kernel Iterate blocks of B over fixed A Minimize PCIe transfers 14 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

DGEMM SYSTEM PERFORMANCE Hiding PCIe 15 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

DGEMM KERNELS Matrix Tiling Using the CPU only for the remainder part of the tiling wastes CPU power. Distribute workload among GPU/CPU Phase 1b: Remainder Part Phase 1a: Large rectangular Block (size chosen such that CPU finishes slightly before GPU) Phase 2: Second rectangular Block Phase 3: Single Tiles (Work Stealing) Size of rectangular blocks chosen based on vacant matrix size and performance estimations. Processing larger rectangular blocks increases CPU DGEMM performance compared to single tiles. CPU / GPU DGEMM performance continuously monitored. Monitored performance data used to improve estimations for next DGEMM call. 16 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

DGEMM SYSTEM PERFORMANCE Matrix size effects 17 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

IV: MULTI-GPU DGEMM 18 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

MATRIX DISTRIBUTION Matrix distribution among n GPUs B Matrix split in n parts. Each GPU processes only one part of B. Buffer requirement reduced. After the first GPU has processed its part, the remaining tiles are processed in a round-robin fashion. 19 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

DUAL-GPU PERFORMANCE Analysis of dual-gpu version Accumulated Max. Perf. is the accumulated DGEMM performance of all contributing processing elements. The accumulated Max. Perf. is corrected for the CPU cores for GPU pre- and postprocessing to approximate performance of best case implementation. The efficiency is the ratio of the achieved performance and this best case performance. 20 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

MULTI-GPU BANDWIDTH REQUIREMENTS Multi-GPU DGEMM has tremendous memory and PCI-Express throughput requirements Reading from and writing to the C-matrix requires at least: (g Performance in Gflop/s, s size of Element in bytes, i.e. 8 for double precision floating point) p(k) = g * s / 2k m(k) = 2 g * s / k p(1024) = 1.82 * n [GB/s] p(2048) = 0.91 * n [GB/s] m(1024) = 7.27 * n [GB/s] m(2048) = 3.63 * n [GB/s] Additional throughput required for concurrent CPU DGEMM. PCI performance sufficient, even for two GPUs via PCI-express switch. Memory performance possibly insufficient, depends on k. 21 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

MULTI-GPU PERFORMANCE Multi-GPU performance depends greatly on k Larger k decreases memory bandwidth requirements. Performance becomes constant for large k. Predictions for Quad-GPU and more The k parameter gets even more critical. GPU memory requirement scales linearly with k. Bounded k range due to GPU memory limitations. 22 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

TRIPPLE-GPU PERFORMANCE Speed scales almost linearly to three GPUs The faster 5870 GPU core scales worse than the two slower 9350 and 5970. The 5970 works like two 5850. 23 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

CPU UTILIZATION CPU utilization per core during DGEMM and HPL Core 0 performs preprocessing and management of DMA transfer. One core sufficient for preprocessing and management of up to two GPUs. Postprocessing multithreaded for slow CPU / dual GPU Optimized Implementation Preprocessing multithreaded for more then two GPUs. Threads for pre- and postprecissing pinned to the CPU die with closest connection to the GPU. 24 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

DGEMM PERFORMANCE Maximum DGEMM performance achieved: 5870 GPU, GPU only 465 Gflop/s Two Magny-Cours 6172 CPUs, 5870 GPU 625 Gflop/s 8 Nehalem cores 2.26 GHz, 5970 GPU 832 Gflop/s Two Magny-Cours 6174 CPUs, three 5870 GPUs 1432 Gflop/s 25 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

V: GOING FROM CYPRESS TO CAYMAN 26 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

CAYMAN & DMA Multiple Buffers for B-Matrix Two Buffers for A Matrix (round-robin use) Three Buffers for Output (round-robin use) 27 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

CAYMAN & DMA Multiple Buffers for B-Matrix Two Buffers for A Matrix (round-robin use) Three Buffers for Output (round-robin use) Transfer to GPU performed by DMA Engine Two additional page locked buffers per input matrix on host side (round robin use) 28 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

CAYMAN & DMA Multiple Buffers for B-Matrix Two Buffers for A Matrix (round-robin use) Three Buffers for Output (round-robin use) Transfer to GPU performed by DMA Engine Two additional page locked buffers per input matrix on host side (round robin use) Addition of DGEMM done by host PROBLEM: Cayman shows poor performance when using DMA engine. 29 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

CAYMAN & DMA DMA Performance (GPU to CPU) Kernel must write to 128 bit buffer. Kernel can write directly to host memory in 128 bit format bypassing the DMA engine. 30 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

CAYMAN & DMA DMA Performance (CPU to GPU) Cayman shows full DMA speed for 64 bit transfers. Kernel must read from 128 bit buffer for full performance. Different DMA Path (1b) introduced. DMA Performance (GPU to CPU) Kernel must write to 128 bit buffer. Kernel can write directly to host memory in 128 bit format bypassing the DMA engine. Using 64 bit conversion Infeasible. 31 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

CAYMAN & DMA Multi-GPU All buffers replicated on each device. Processing split along B-Matrix. Each GPU caches only a part of the B- Matrix. 32 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

VI: HPL 33 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

HPL (HIGH PERFORMANCE LINPACK) Linpack iteratively factorizes a dense system of linear equations. Each iteration consists of: Panel factorization Panel broadcast Line swapping & U-broadcast U-matrix update (DTRSM) C-matrix update (DGEMM) HPL utilizes BLAS routines Major contribution to workload by DGEMM. (95.6% of total execution time) 34 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

NAIVE HPL-GPU IMPLEMENTATION First naive HPL-GPU implementation offloads the large DGEMM for C-matrix update to GPU. GPU-DGEMM makes up 78% of total execution time. GPU is idling at 22% of the time. As only DGEMM has a considerable contribution to the overall calculation effort, the HPL performance is limited to 78% of DGEMM performance. 35 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

HIDING FACTORIZATION AND BROADCAST TIME (LOOKAHEAD) Factorization requires only the first NB colums of the previous iteration to finish Factorization and broadcast for next iteration can be processed in parallel with the DGEMM. Two issues occur: Running pre- and postprocessing in parallel with factorization leads to memory congestion. During broadcast, only one CPU core is active. 36 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

IMPROVED LOOKAHEAD CPU DGEMM split in two parts to improve utilization during broadcast. CPU cores idle by intention during factorization to avoid memory congestion and ensure full GPU DGEMM performance. Binary patch of the AMD driver changes memory policies and decreases page fault rate. 37 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

HIDING PIVOTIZATION TIME (LOOKAHEAD 2) Also pivotization time can be hidden HPL performance improves significantly with lookahead. 38 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

HPL PERFORMANCE Multi-node and single node performance is constantly higher with lookahead. 39 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

SINGLE GPU PEAK PERFORMANCES Peak performance achieved at LOEWE-CSC. Two 6172 Magny-Cours CPU 2.1 GHz, 1 5870 GPU. Multi node per node performance achieves 93.6% of single-node performance. Discipline Performance Peak Efficiency DGEMM Kernel 494 544 90.9% GPU DGEMM System 465 544 85.5% GPU/CPU DGEMM System 625 745 83.9% Single-node HPL 563 745 75.5% Multi-node HPL 526 745 70.6% 40 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

MULTI-GPU HPL Concept More threads used for pre and post processing. Dynamic factorization thread count improves performance, especially for k = NB >= 2048. 41 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

MULTI-GPU HPL Benchmarks First multi-gpu benchmarks: (2 * 6174 CPU, 3 * 5870 GPU) HPL: 1114 Gflop/s DGEMM: 1432 Gflop/s Multi GPU Efficiency: (2 * 6174 CPU, 3 * V7800 GPU) HPL: 1230 Gflop/s / Watt Efficiency optimized version offloads as much workload as possible to the GPU, even if this leaves the CPU idling. 42 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

VII: INTERLAGOS 43 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

INTERLAGOS Looking ahead Expectations Optimized BLAS libaries achieve peak performance for any CPU. Since two cores of a module share floating point unit, DGEMM may not have to run on all cores. These cores can be used for GPU pre/postprocessing. No big changes necessary, CPU contribution to overall performance small, at least for multi-gpu. Problems Support for 3DNow! dropped. GotoBLAS no longer adopted to new CPUs. Patches to GotoBLAS for CPU core reservation need to be ported to another BLAS library. 44 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

HETEROGENEOUS SYSTEMS Work Distribution Interlagos as update option For LOEWE-CSC as additional nodes Traditional distribution targets homogeneous systems LOEWE-CSC already heterogenous Quad nodes without GPU 45 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

HETEROGENEOUS SYSTEMS Work Distribution Groups nodes into performance classes Size submatrices according to performance Skip process rows during allocation 46 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

HETEROGENEOUS SYSTEMS Benchmarks Benchmarks on 6 node setup 2 Quads + 2 GPUs @ 750 MHz + 2 GPUs @ 700 MHz Original version speed ~ 6 times Quad speed ~ 25% granularity loss Optimized version ~ 3 % granularity loss 47 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

SUMMARY 48 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

SUMMARY High Performance DGEMM Kernels 494 Gflops - 90.9% of peak High Performance CPU/GPU DGEMM Automatic load balancing 625 Gflops 84% of peak Modified HPL Keeping GPU busy 563 Gflops 76% of peak Scaling to multi-gpu nodes Scaling to many nodes Minimum granularity loss on heterogeneous systems Open Source: http://code.compeng.uni-frankfurt.de/projects/caldgemm http://code.compeng.uni-frankfurt.de/projects/hpl 49 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

QUESTIONS 50 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied. 51 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011