Moving Towards Exascale with Lessons Learned from GPU Computing. Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urbana-Champaign

Size: px

Start display at page:

Download "Moving Towards Exascale with Lessons Learned from GPU Computing. Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urbana-Champaign"

Muriel Murphy
5 years ago
Views:

1 Moving Towards Exascale with Lessons Learned from GPU Computing Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urana-Champaign

2 Agenda Blue Waters and recent progress in petascale GPU computing Upcoming innovations in exascale computing Architecture Programming Systems Conclusion and discussions

3 Blue Waters - Operational at Illinois since 3/ PF 1.6 PB DRAM $25M 12+ G/sec 1/4/1 G Ethernet Switch 1 GB/sec IB Switch >1 TB/sec WAN Spectra Logic: 3 PBs Sonexion: 26 PBs

4 Two Petascale Computing Systems NCSA ORNL System Attriute Blue Waters Titan Vendors Cray/AMD/NVIDIA Cray/AMD/NVIDIA Processors Interlagos/Kepler Interlagos/Kepler Total Peak Performance (PF) Total Peak Performance (CPU/GPU) 7.7/ /24.5 Numer of CPU Chips 49,54 18,688 Numer of GPU Chips 4,224 18,688 Amount of CPU Memory (TB) Interconnect 3D Torus 3D Torus Amount of On-line Disk Storage (PB) Sustained Disk Transfer (TB/sec) > Amount of Archival Storage Sustained Tape Transfer (GB/sec) 1 7

Blue Waters contains 4,224 Cray XK7 compute nodes. Cray XK7 Nodes Dual-socket Node One AMD Interlagos chip 8 core modules, 32 threads 156.

5 Blue Waters contains 4,224 Cray XK7 compute nodes. Cray XK7 Nodes Dual-socket Node One AMD Interlagos chip 8 core modules, 32 threads GFs peak performance 32 GBs memory 51 GB/s andwidth One NVIDIA Kepler chip 1.3 TFs peak performance 6 GBs GDDR5 memory 25 GB/sec andwidth Gemini Interconnect Same as XE6 nodes

6 Cray XE6 Nodes Blue Waters contains 22,64 Cray XE6 compute nodes. Dual-socket Node Two AMD Interlagos chips 16 core modules, 64 threads 313 GFs peak performance 64 GBs memory 12 GB/sec memory andwidth Gemini Interconnect Router chip & network interface Injection Bandwidth (peak) 9.6 GB/sec per direction

7 Science Area Numer of Teams Codes Struct Grids Unstruct Grids Dense Matrix Sparse Matrix N- Body Monte Carlo FFT PIC Sig I/O Climate and Weather 3 CESM, GCRM, CM1/WRF, HOMME X X X X X Plasmas/ Magnetosphere 2 H3D(M),VPIC, OSIRIS, Magtail/UPIC X X X X Stellar Atmospheres and Supernovae 5 PPM, MAESTRO, CASTRO, SEDONA, ChaNGa, MS-FLUKSS X X X X X X Cosmology 2 Enzo, pgadget X X X Comustion/ Turulence 2 PSDNS, DISTUF X X General Relativity 2 Cactus, Harm3D, LazEV X X Molecular Dynamics 4 AMBER, Gromacs, NAMD, LAMMPS X X X Quantum Chemistry 2 SIAL, GAMESS, NWChem X X X X X Material Science 3 NEMOS, OMEN, GW, QMCPACK Earthquakes/ Seismology 2 AWP-ODC, HERCULES, PLSQR, SPECFEM3D X X X X X X X X Quantum Chromo Dynamics 1 Chroma, MILC, USQCD X X X Social Networks 1 EPISIMDEMICS Evolution 1 Eve Engineering/System of Systems 1 GRIPS,Revisit X Computer Science 1 X X X X X

8 NAMD Initial Production Use Results 1 million atom enchmark with Langevin dynamics and PME once every 4 steps, from launch to finish, all I/O included 768 nodes, Kepler+Interlagos is 3.9X faster over Interlagos-only 768 nodes, XK7 is 1.8X XE6 Chroma Lattice QCD parameters: grid size of 48 3 x 512 running at the physical values of the quark masses 768 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only 768 nodes, XK7 is 2.4X XE6 QMCPACK Full run Graphite 4x4x1 (256 electrons), QMC followed y VMC 7 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only 7 nodes, XK7 is 2.7X XE6

Blue Waters Science Breakthrough Example Determination of the structure of the HIV capsid at atomic-level Collaorative effort of experimental groups, at the U. of Pittsurgh and Vanderilt U.

9 Blue Waters Science Breakthrough Example Determination of the structure of the HIV capsid at atomic-level Collaorative effort of experimental groups, at the U. of Pittsurgh and Vanderilt U., and the Schulten s computational team at the U. of Illinois. 64-million-atom HIV capsid simulation of the process through which the capsid disassemles, releasing its genetic material, is a critical step in HIV infection and a potential target for antiviral drugs.

10 LESSON #1 Production GPU code often involve complex tradeoffs. 1

11 Tridiagonal Solver Implicit finite difference methods, cuic spline interpolation, pre-conditioners An algorithm to find a solution of Ax = d, where A is an n- y-n tridiagonal matrix and d is an n-element vector

12 A Efficient Storage Format for A a 1 c a 1 2 c c 1 c a c c 2 3 Stored as 3 1D arrays No need for column indices No need for row pointers 1 2 3

13 GPU Tridiagonal System Solver Building Blocks Hyrid Methods PCR-Thomas (Kim 211, Davidson 211) PCR-CR (CUSPARSE 212) Etc. CUSPARSE is supported y NVIDIA Thomas (sequential) Cyclic Reduction (1 step) PCR (1 step) a c a c a c e e e e a c a a c a c e e e e a c a c a c e e e e a c a c a a c c e e e e a c a c a c e e e e Interleaved layout after partitioni

14 GPU Performance Advantage Runtime of solving an 8M-row matrix (ms) Our SPIKE-diag_pivoting (GPU) Our SPIKE-Thomas (GPU) Random Diagonally dominant CUSPARSE (GPU) Data transfer (pageale) Data transfer (pinned) MKL dgtsv(sequential, CPU) SAMOS 213

15 Relative Backward Error Matrix type SPIKE-diag_pivoting SPIKE-Thomas CUSPARSE MKL Intel SPIKE Matla E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-4 1.5E E E E E-5 3.9E E E-5 9.7E E E E E E E E E E E E-4 2.2E E E E E E E E E E E E E E E E E E E E E+6 Nan Nan 1.47E+59 Fail 3.69E Inf Nan Nan Inf Fail 4.68E+171 SA MO S

16 Numerically Stale Partitioning SPIKE (Polizzi et al), diagonal pivoting for numerical staility SX = Y (5) DY = F (6) AiYi = Fi (7) SAMOS 213

17 Fast Transposition on GPU Chang, et al, SC 212 SAMOS 213

18 Dynamic Tiling Runtime (ms) data layout only dynamic tiling (with data layout) x random diagonally dominant zero diagonal SAMOS 213

19 Chang, et al, SC 212 SAMOS 213 Relative Backward Error Matrix type SPIKE-diag_pivoting SPIKE-Thomas CUSPARSE MKL Intel SPIKE Matla E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-4 1.5E E E E E-5 3.9E E E-5 9.7E E E E E E E E E E E E-4 2.2E E E E E E E E E E E E E E E E E E E E E+6 Nan Nan 1.47E+59 Fail 3.69E Inf Nan Nan Inf Fail 4.68E+171

20 Chang, et al, SC 212 SAMOS 213 GPU Performance Advantage Runtime of solving an 8M matrix (ms) Our SPIKE-diag_pivoting (GPU) Our SPIKE-Thomas (GPU) Random Diagonally dominant CUSPARSE (GPU) Data transfer (pageale) Data transfer (pinned) MKL dgtsv(sequential, CPU)

21 Some Important Trends Driven y moile market Computing devices are quickly ecome SOCs Integrated CPU-GPU-I/O with DRAM Channels Compute outside CPU cores gaining importance Data movement off-chip must e minimized CUDA and HSA (Heterogeneous System Architecture) are leading the trends New OS/runtime and programming systems are emerging to take advantage of these advances 21

22 NEW INNOVATIONS IN EXASCALE COMPUTING

23 UNIFIED ADDRESS SPACE AND SOC DISTRIBUTED COMPUTE for acceleration of at scale data communication and compute - Innovation in HSA and OS

24 State of the Art Data Transfer Behavior Main Memory (DRAM) CPU PCIe Network I/O with Compression engine Disk I/O DMA Device Memory GPU card (or other Accelerator cards)

25 Desired SoC Data Transfer Behavior Main Memory (DRAM) SOC LLC DMA Network I/O with compression engine CPU/GPU Disk I/O

26 UNIFIED COHERENT MEMORY for acceleration of pointer-ased data structures - Innovation in CUDA and HSA

27 Large 3D spatial data structure Legacy Access to Large Structures Legacy SYSTEM MEMORY GPU GPU MEMORY GPU Memory is smaller Have to copy and process in chunks

28 Large 3D spatial data structure Legacy Access to Large Structures Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 GPU MEMORY

29 Large 3D spatial data structure COPY One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL GPU MEMORY Copy of top 2 levels of hierarchy

30 Process One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 FIRST KERNEL GPU MEMORY

31 Process One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 FIRST KERNEL GPU MEMORY

32 Process One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 FIRST KERNEL GPU MEMORY

33 Copy One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL GPU MEMORY Copy of ottom 3 levels of one ranch of the hierarchy

34 Process One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL SECOND KERNEL GPU MEMORY

35 Large 3D spatial data structure Large Spatial Data structure HSA and full OpenCL 2. SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

36 GPU can Traverse entire hierarchy HSA SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

37 GPU can Traverse entire hierarchy HSA SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

38 GPU can Traverse entire hierarchy HSA SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

39 GPU can Traverse entire hierarchy HSA SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

40 LIBRARY-DRIVEN COMPILER OPTIMIZATION Enaling high-performance programming using high-level languages - Innovation in Triolet and Tangram Blue Waters SETAC Meeting - Fe 214 4

41 Current State of Programming Heterogeneous Systems CPU Multicore Xeon Phi GPU FPGA SoC for HPC

42 Current State of Programming Heterogeneous Systems C/FORTRA N CPU Multicore Xeon Phi GPU FPGA SoC for HPC

43 Current State of Programming Heterogeneous Systems C/FORTRAN + Directives (OpenMP/ TBB) CPU Multicore Xeon Phi GPU FPGA SoC for HPC

44 Current State of Programming Heterogeneous Systems C/FORTRAN + Directives (OpenMP/ TBB) + SIMD Intrinsics CPU Multicore Xeon Phi GPU FPGA SoC for HPC

45 Current State of Programming Heterogeneous Systems C/FORTRAN OpenCL + Directives (OpenMP/ TBB) + SIMD Intrinsics CPU Multicore Xeon Phi GPU FPGA SoC for HPC

46 Current State of Programming Heterogeneous Systems C/FORTRAN OpenCL + Directives (OpenMP/ TBB) + SIMD Intrinsics Verilog/VHDL CPU Multicore Xeon Phi GPU FPGA SoC for HPC

47 Programming heterogeneous systems requires too many versions of code! C/FORTRAN OpenCL + Directives (OpenMP/ TBB) + SIMD Intrinsics Verilog/VHDL CPU Multicore Xeon Phi GPU FPGA SoC for HPC

48 Triolet Triolet Compiler C/FORTRAN OpenCL + Directives (OpenMP/ TBB) + SIMD Intrinsics Verilog/VHDL CPU Multicore Xeon Phi GPU FPGA SoC for HPC

49 Programming in Triolet Python Nonuniform FT (real part) SoC for HPC

50 Programming in Triolet Python Nonuniform FT (real part) Inner loop Inner loop ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] SoC for HPC

51 Programming in Triolet Python Nonuniform FT (real part) Inner loop Outer loop Inner loop Outer loop ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] SoC for HPC

Programming in Triolet Python Nonuniform FT (real part) Inner loop Outer loop Inner loop Outer loop ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] map and

52 Programming in Triolet Python Nonuniform FT (real part) Inner loop Outer loop Inner loop Outer loop ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] map and reduce style programming no new paradigm to learn Parallel details are implicit easy to use Automated data partitioning, MPI rank generation, MPI messaging, OpenMP, etc. SoC for HPC

53 Lirary-Driven Optimization The asic idea: Compilers are smart (inlining, unoxing,...) Software astractions have no overhead when the compiler can evaluate them statically Just write components as lirary code! Emody optimizations in the lirary code as alternative function implementations, use auto-tuning to select a final arrangement Triolet uses this approach [Rodrigues, et al 14] Prior work demonstrates loop fusion, shared-memory parallelism, vectorization [Coutts et al. 7, Keller et al. 1, Mainland et al. 13]

54 Parallel Loop Fusion Based on Keller et al. 1 sum(map(square, range(k))) range(k) There are k outputs The n th output is n map(square,...) sum(...) Loop over inputs The n th output is square(n th input) Add the n th input to result Fusion There are k iterations In each iteration, add square(n) to result

55 Triolet Productivity and Performance ys = [sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] Lirary functions factor out data decomposition, parallelism, and communication 128-way Speedup (16 cores 8 nodes) Triolet C with MPI+OpenMP C with MPI+OpenM!!"#$!%&"'#&()*+,!%&"'"-.!!/1'2)%%'+"345/1'26//'7689:,!;%&"'#&()*+<.!!/1'2)%%'(=#>5/1'26//'7689:,!;%&"'"-<.!!*)#+$!"#$!(=#>?!@!%&"'"-!@@!?.!!/1'A*=+$5;+"34'B,!C,!/1'1DE,!?,!/1'26//'7689:<.!!/1'A*=+$5;+"34'>,!C,!/1'1DE,!?,!/1'26//'7689:<.!!*)#+$!"#$!*FG#>'+"34'B!@!*4"H-"I5+"34'B,!%&"'#&()*+<.!!*)#+$!"#$!&=--4-'+"34'B!@!*FG#>'+"34'B!J!%&"'#&()*+.!!KH)=$!J>+,!JB+.!!"K!5(=#>?<!L!!!!>+!@!"#&G$'>+.!!!!B+!@!"#&G$'B+.!!M!!4H+4!L!!!!>+!@!%=HH)*5+"34'>!J!+"34)K5KH)=$<<.!!!!B+!@!%=HH)*5+"34'>!J!+"34)K5KH)=$<<.!!M!!KH)=$!J(+'*FG#>!@!%=HH)*5*FG#>'+"34'B!J!+"34)K5KH)=$<<.!!KH)=$!JN+'*FG#>!@!%=HH)*5*FG#>'+"34'B!J!+"34)K5KH)=$<<.!!"K!5(=#>?<!L!!!!"#$!#O)(>4(+!@!%&"'#&()*+PC.!!!!/1'84QG4+$!J(4Q+!@!%=HH)*5#O)(>4(+!J!R!J!+"34)K5/1'84QG4+$<<.!!!!"#$!O.!!!!K)(!5O!@!?.!O!S!#O)(>4(+.!OTT<!L!!!!!!"#$!O)(>4('"-!@!OTC.!!!!!!/1'1+4#-5>+,!+"34'>,!/1'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/1'26//'7689:,!;(4Q+WOX<.!!!!!!/1'1+4#-5B+,!+"34'>,!/1'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/1'26//'7689:,!;(4Q+W#O)(>4(+TOX<.!!!!!!/1'1+4#-5(+!T!O)(>4('"-J*FG#>'+"34'B,!*FG#>'+"34'B,!/1'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/1'26//'7689:,!;(4Q+WYJ#O)(>4(+TOX<.!!!!M!!!!%4%*&N5(+'*FG#>,!(+,!*FG#>'+"34'B!J!+"34)K5KH)=$<<.!!!!/1'7="$=HH5#O)(>4(+JR,!(4Q+,!/1'ZEVE[Z\Z'1]D68\<.!!!!K(445(4Q+<.!!M!!4H+4!L!!!!/1'84*I5>+,!+"34'>,!/1'U96VE,!?,!!!!!!!!!!!!!?,!/1'26//'7689:,!/1'ZEVE[Z'1]D68\<.!!!!/1'84*I5B+,!+"34'>,!/1'U96VE,!?,!!!!!!!!!!!!!?,!/1'26//'7689:,!/1'ZEVE[Z'1]D68\<.!!!!/1'84*I5(+'*FG#>,!*FG#>'+"34'B,!/1'U96VE,!?,!!!!!!!!!!!!!?,!/1'26//'7689:,!/1'ZEVE[Z'1]D68\<.!!M!!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^!!L!!!!"#$!". _&(=`%=!)%&!&=(=HH4H!K)(!+*F4-GH45+$=$"*<!!!!K)(!5"!@!?.!"!S!*FG#>'+"34'B.!"TT<!L!!!!!!KH)=$!+!@!?.!!!!!!"#$!a.!!!!!!K)(!5a!@!?.!a!S!+"34'>.!aTT<!!!!!!!!+!T@!B+WaX!J!*)+K5(+'*FG#>W"X!J!>+WaX<.!!!!!!N+'*FG#>W"X!@!+.!!!!M!!M!!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^!!/1']=$F4(5N+'*FG#>,!*FG#>'+"34'B,!/1'U96VE,!N+,!*FG#>'+"34'B,!/1'U96VE,!!!!!!!!!!!!!?,!/1'26//'7689:<.!!K(445(+'*FG#><.!!K(445N+'*FG#><.!!"K!5(=#>?<!L!!!!K(445(+<.!!M!!4H+4!L!!!!K(445B+<.!!!!K(445>+<.!!M Setup Cleanup SoC for HPC

56 Conclusion Blue Waters and related HPC systems have demonstrated that GPU computing can have narrow ut deep impact on science New architectural innovations enale much easier work migration etween CPUs and accelerators New programming systems innovations enale much faster application development and lowered maintenance cost 56

57 IWC SE 21 Acknowledgements D. August (Princeton), S. Baghsorkhi (Illinois), N. Bell (NVIDIA), D. Callahan (Microsoft), J. Cohen (NVIDIA), B. Dally (Stanford), J. Demmel (Berkeley), P. Duey (Intel), M. Frank (Intel), M. Garland (NVIDIA), Isaac Gelado (BSC), M. Gschwind (IBM), R. Hank (Google), J. Hennessy (Stanford), P. Hanrahan (Stanford), M. Houston (AMD), T. Huang (Illinois), D. Kaeli (NEU), K. Keutzer (Berkeley), W. Kramer (NCSA), I. Gelado (NVIDIA), B. Gropp (Illinois), D. Kirk (NVIDIA), D. Kuck (Intel), S. Mahlke (Michigan), T. Mattson (Intel), N. Navarro (UPC), J. Owens (Davis), D. Padua (Illinois), S. Patel (Illinois), Y. Patt (Texas), D. Patterson (Berkeley), C. Rodrigues (Illinois), P. Rogers (AMD), S. Ryoo (ZeroSoft), B. Sander (AMD), K. Schulten (Illinois), B. Smith (Microsoft), M. Snir (Illinois), I. Sung (Illinois), P. Stenstrom (Chalmers), J. Stone (Illinois), S. Stone (Harvard) J. Stratton (Illinois), H. Takizawa (Tohoku), M. Valero (UPC) And many others!

Rethinking Computer Architecture for Energy Limited Computing

Rethinking Computer Architecture for Energy Limited Computing Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urana-Champaign Agenda Blue Waters and recent progress in petascale GPU computing