Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Size: px

Start display at page:

Download "Thinking Outside of the Tera-Scale Box. Piotr Luszczek"

Carmel Bruce
5 years ago
Views:

1 Thinking Outside of the Tera-Scale Box Piotr Luszczek

2 Brief History of Tera-flop: ASCI Red

3 Brief History of Tera-flop: 2007 Intel Polaris ASCI Red

4 Brief History of Tera-flop: GPGPU GPGPU Intel Polaris ASCI Red

5 Tera-flop: a Dictionary Perspective Belongs to supercomputer expert Consumes 500 kw Costs over $1M Requires distributed memory programming Belongs to gaming geek Consumes under 500 W Costs under $200 Requires only a CUDA kernel Memory: 1 GiB or more Memory: 1 GiB or more

6 Double-Dip Recession? Productivity Multicore Cilk, Intel TBB, Intel Ct, Apple GCD, MS Concrt, Hybrid Multicore HMPP, PGI Accelerator, PathScale? Time

7 Locality Space in HPC Challenge Parallel-Transpose STREAM HPL Mat-Mat-Mul RandomAccess FFT

8 Locality Space in HPC Challenge: Bounds Bandwidth- bound Parallel-Transpose STREAM HPL Mat-Mat-Mul AORSA2D Compute-bound PCIe Latency-bound RandomAccess FFT

9 CUDA vs. OpenCL: Similar Software Stacks

10 Matrix-Matrix Multiply: the Simple Form for i = 1:M for j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j) CUBLAS Volkov et al. MAGMA Nakasato

11 700 Fermi C2050: from CUDA to OpenCL 600 Performance [Gflop/s] CUDA, texture CUDA, NO texture OpenCL, no change OpenCL, good unrolling OpenCL, bad unrolling 0 Problem size

12 Fermi C2050: Simple OpenCL Porting Gflop/s Native OpenCL Mat-mul Ported OpenCL Mat-mul Problem size

13 1600 ATI 5870: Simple OpenCL Porting Gflop/s Native OpenCL Mat-mul Ported OpenCL Mat-mul (inlined indexing) Ported OpenCL Problem size

14 Making Case for Autotuning: Use of Texture Gflop/s Fermi C2050 OpenCL texture Fermi C2050 OpenCL NO texture Problem size

15 Making Case for Autotuning: Blocking Factors Provided by NVIDIA

16 Mat-Mul Portability Summary Achieved Performance Single precision Achieved Performance Double precision ATI OpenCL 52% 57% ATI IL 74% 87% NVIDIA CUDA 63% 58% NVIDIA OpenCL 57% 49% Multicore (vendor) 79% 79% Multicore (autotuned) 46% 40%

17 Now, let s think outside of the box

18 Recipe for Performance Before Multicore Load/Store Instr. scheduling Wider superscalar instr. issue Increase clock frequency Increase number of transistor

19 Recipe for Performance After Multicore More cores Parallel software

20 Recipe for Future Performance More cores Better sequential software DAG Runtime

From Assembly to DAG: PLASMA Typical loop in

j]) Sequential task stream for k = i+1:n

21 From Assembly to DAG: PLASMA Typical loop in assembly load R0, #1000 divf F1, F2, F3 load R1, #1000 divf F2, F4, F5 dec R1 jump Sequential instruction stream Instr. scheduling LU factorization for i = 1:N GETR2(A[i, i]) for j = i+1:n TRSM(A[i, i], A[i, j]) Sequential task stream for k = i+1:n GEMM(A[k,i],A[i,j],A[k,j]) GETRF2 TRSM GEMM GEMM GETRF2 DAG scheduling TRSM GEMM GEMM

45 40 35 30 25 20 15 Scalability of LU Factorization on Multicore 6 cores = 1 socket 120 100 80 60 40 24 cores = 4 sockets 10 5 20 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000

22 Scalability of LU Factorization on Multicore 6 cores = 1 socket cores = 4 sockets cores = 8 sockets PLASMA MKL LAPACK AMD Istanbul 2.8 GHz

23 Combining Multicore and GPU for QR Fact. AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX Tflop/s CPUs + GPU CPUs 0.1 GPU

24 Tflop/s Performance on Distributed Memory QR Number of cores Tflop/s Peak LU Number of cores Cholesky DPLASMA Tflop/s Number of cores Cray LibSci ORNL Kraken Cray XT5 AMD Istanbul 2.6 GHz Dual-socet 6-core Interconnect: Seastar, 3D torus

25 Jack Dongarra Mathieu Faverge Jakub Kurzak Hatem Ltaief Stan Tomov Asim YarKhan Peng Du Rick Weber People and Projects MAGMA PLASMA DPLASMA and DAGuE George Bosilca Aurelien Bouteiller Anthony Danalis Thomas Herault Pierre Lemarinier

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010