Moving Towards Exascale with Lessons Learned from GPU Computing. Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urbana-Champaign

Size: px
Start display at page:

Download "Moving Towards Exascale with Lessons Learned from GPU Computing. Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urbana-Champaign"

Transcription

1 Moving Towards Exascale with Lessons Learned from GPU Computing Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urana-Champaign

2 Agenda Blue Waters and recent progress in petascale GPU computing Upcoming innovations in exascale computing Architecture Programming Systems Conclusion and discussions

3 Blue Waters - Operational at Illinois since 3/ PF 1.6 PB DRAM $25M 12+ G/sec 1/4/1 G Ethernet Switch 1 GB/sec IB Switch >1 TB/sec WAN Spectra Logic: 3 PBs Sonexion: 26 PBs

4 Two Petascale Computing Systems NCSA ORNL System Attriute Blue Waters Titan Vendors Cray/AMD/NVIDIA Cray/AMD/NVIDIA Processors Interlagos/Kepler Interlagos/Kepler Total Peak Performance (PF) Total Peak Performance (CPU/GPU) 7.7/ /24.5 Numer of CPU Chips 49,54 18,688 Numer of GPU Chips 4,224 18,688 Amount of CPU Memory (TB) Interconnect 3D Torus 3D Torus Amount of On-line Disk Storage (PB) Sustained Disk Transfer (TB/sec) > Amount of Archival Storage Sustained Tape Transfer (GB/sec) 1 7

5 Blue Waters contains 4,224 Cray XK7 compute nodes. Cray XK7 Nodes Dual-socket Node One AMD Interlagos chip 8 core modules, 32 threads GFs peak performance 32 GBs memory 51 GB/s andwidth One NVIDIA Kepler chip 1.3 TFs peak performance 6 GBs GDDR5 memory 25 GB/sec andwidth Gemini Interconnect Same as XE6 nodes

6 Cray XE6 Nodes Blue Waters contains 22,64 Cray XE6 compute nodes. Dual-socket Node Two AMD Interlagos chips 16 core modules, 64 threads 313 GFs peak performance 64 GBs memory 12 GB/sec memory andwidth Gemini Interconnect Router chip & network interface Injection Bandwidth (peak) 9.6 GB/sec per direction

7 Science Area Numer of Teams Codes Struct Grids Unstruct Grids Dense Matrix Sparse Matrix N- Body Monte Carlo FFT PIC Sig I/O Climate and Weather 3 CESM, GCRM, CM1/WRF, HOMME X X X X X Plasmas/ Magnetosphere 2 H3D(M),VPIC, OSIRIS, Magtail/UPIC X X X X Stellar Atmospheres and Supernovae 5 PPM, MAESTRO, CASTRO, SEDONA, ChaNGa, MS-FLUKSS X X X X X X Cosmology 2 Enzo, pgadget X X X Comustion/ Turulence 2 PSDNS, DISTUF X X General Relativity 2 Cactus, Harm3D, LazEV X X Molecular Dynamics 4 AMBER, Gromacs, NAMD, LAMMPS X X X Quantum Chemistry 2 SIAL, GAMESS, NWChem X X X X X Material Science 3 NEMOS, OMEN, GW, QMCPACK Earthquakes/ Seismology 2 AWP-ODC, HERCULES, PLSQR, SPECFEM3D X X X X X X X X Quantum Chromo Dynamics 1 Chroma, MILC, USQCD X X X Social Networks 1 EPISIMDEMICS Evolution 1 Eve Engineering/System of Systems 1 GRIPS,Revisit X Computer Science 1 X X X X X

8 NAMD Initial Production Use Results 1 million atom enchmark with Langevin dynamics and PME once every 4 steps, from launch to finish, all I/O included 768 nodes, Kepler+Interlagos is 3.9X faster over Interlagos-only 768 nodes, XK7 is 1.8X XE6 Chroma Lattice QCD parameters: grid size of 48 3 x 512 running at the physical values of the quark masses 768 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only 768 nodes, XK7 is 2.4X XE6 QMCPACK Full run Graphite 4x4x1 (256 electrons), QMC followed y VMC 7 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only 7 nodes, XK7 is 2.7X XE6

9 Blue Waters Science Breakthrough Example Determination of the structure of the HIV capsid at atomic-level Collaorative effort of experimental groups, at the U. of Pittsurgh and Vanderilt U., and the Schulten s computational team at the U. of Illinois. 64-million-atom HIV capsid simulation of the process through which the capsid disassemles, releasing its genetic material, is a critical step in HIV infection and a potential target for antiviral drugs.

10 LESSON #1 Production GPU code often involve complex tradeoffs. 1

11 Tridiagonal Solver Implicit finite difference methods, cuic spline interpolation, pre-conditioners An algorithm to find a solution of Ax = d, where A is an n- y-n tridiagonal matrix and d is an n-element vector

12 A Efficient Storage Format for A a 1 c a 1 2 c c 1 c a c c 2 3 Stored as 3 1D arrays No need for column indices No need for row pointers 1 2 3

13 GPU Tridiagonal System Solver Building Blocks Hyrid Methods PCR-Thomas (Kim 211, Davidson 211) PCR-CR (CUSPARSE 212) Etc. CUSPARSE is supported y NVIDIA Thomas (sequential) Cyclic Reduction (1 step) PCR (1 step) a c a c a c e e e e a c a a c a c e e e e a c a c a c e e e e a c a c a a c c e e e e a c a c a c e e e e Interleaved layout after partitioni

14 GPU Performance Advantage Runtime of solving an 8M-row matrix (ms) Our SPIKE-diag_pivoting (GPU) Our SPIKE-Thomas (GPU) Random Diagonally dominant CUSPARSE (GPU) Data transfer (pageale) Data transfer (pinned) MKL dgtsv(sequential, CPU) SAMOS 213

15 Relative Backward Error Matrix type SPIKE-diag_pivoting SPIKE-Thomas CUSPARSE MKL Intel SPIKE Matla E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-4 1.5E E E E E-5 3.9E E E-5 9.7E E E E E E E E E E E E-4 2.2E E E E E E E E E E E E E E E E E E E E E+6 Nan Nan 1.47E+59 Fail 3.69E Inf Nan Nan Inf Fail 4.68E+171 SA MO S

16 Numerically Stale Partitioning SPIKE (Polizzi et al), diagonal pivoting for numerical staility SX = Y (5) DY = F (6) AiYi = Fi (7) SAMOS 213

17 Fast Transposition on GPU Chang, et al, SC 212 SAMOS 213

18 Dynamic Tiling Runtime (ms) data layout only dynamic tiling (with data layout) x random diagonally dominant zero diagonal SAMOS 213

19 Chang, et al, SC 212 SAMOS 213 Relative Backward Error Matrix type SPIKE-diag_pivoting SPIKE-Thomas CUSPARSE MKL Intel SPIKE Matla E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-4 1.5E E E E E-5 3.9E E E-5 9.7E E E E E E E E E E E E-4 2.2E E E E E E E E E E E E E E E E E E E E E+6 Nan Nan 1.47E+59 Fail 3.69E Inf Nan Nan Inf Fail 4.68E+171

20 Chang, et al, SC 212 SAMOS 213 GPU Performance Advantage Runtime of solving an 8M matrix (ms) Our SPIKE-diag_pivoting (GPU) Our SPIKE-Thomas (GPU) Random Diagonally dominant CUSPARSE (GPU) Data transfer (pageale) Data transfer (pinned) MKL dgtsv(sequential, CPU)

21 Some Important Trends Driven y moile market Computing devices are quickly ecome SOCs Integrated CPU-GPU-I/O with DRAM Channels Compute outside CPU cores gaining importance Data movement off-chip must e minimized CUDA and HSA (Heterogeneous System Architecture) are leading the trends New OS/runtime and programming systems are emerging to take advantage of these advances 21

22 NEW INNOVATIONS IN EXASCALE COMPUTING

23 UNIFIED ADDRESS SPACE AND SOC DISTRIBUTED COMPUTE for acceleration of at scale data communication and compute - Innovation in HSA and OS

24 State of the Art Data Transfer Behavior Main Memory (DRAM) CPU PCIe Network I/O with Compression engine Disk I/O DMA Device Memory GPU card (or other Accelerator cards)

25 Desired SoC Data Transfer Behavior Main Memory (DRAM) SOC LLC DMA Network I/O with compression engine CPU/GPU Disk I/O

26 UNIFIED COHERENT MEMORY for acceleration of pointer-ased data structures - Innovation in CUDA and HSA

27 Large 3D spatial data structure Legacy Access to Large Structures Legacy SYSTEM MEMORY GPU GPU MEMORY GPU Memory is smaller Have to copy and process in chunks

28 Large 3D spatial data structure Legacy Access to Large Structures Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 GPU MEMORY

29 Large 3D spatial data structure COPY One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL GPU MEMORY Copy of top 2 levels of hierarchy

30 Process One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 FIRST KERNEL GPU MEMORY

31 Process One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 FIRST KERNEL GPU MEMORY

32 Process One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 FIRST KERNEL GPU MEMORY

33 Copy One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL GPU MEMORY Copy of ottom 3 levels of one ranch of the hierarchy

34 Process One Chunk at a Time Legacy SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL SECOND KERNEL GPU MEMORY

35 Large 3D spatial data structure Large Spatial Data structure HSA and full OpenCL 2. SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

36 GPU can Traverse entire hierarchy HSA SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

37 GPU can Traverse entire hierarchy HSA SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

38 GPU can Traverse entire hierarchy HSA SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

39 GPU can Traverse entire hierarchy HSA SYSTEM MEMORY l 1 l 2 l 3 l 4 GPU l 5 KERNEL

40 LIBRARY-DRIVEN COMPILER OPTIMIZATION Enaling high-performance programming using high-level languages - Innovation in Triolet and Tangram Blue Waters SETAC Meeting - Fe 214 4

41 Current State of Programming Heterogeneous Systems CPU Multicore Xeon Phi GPU FPGA SoC for HPC

42 Current State of Programming Heterogeneous Systems C/FORTRA N CPU Multicore Xeon Phi GPU FPGA SoC for HPC

43 Current State of Programming Heterogeneous Systems C/FORTRAN + Directives (OpenMP/ TBB) CPU Multicore Xeon Phi GPU FPGA SoC for HPC

44 Current State of Programming Heterogeneous Systems C/FORTRAN + Directives (OpenMP/ TBB) + SIMD Intrinsics CPU Multicore Xeon Phi GPU FPGA SoC for HPC

45 Current State of Programming Heterogeneous Systems C/FORTRAN OpenCL + Directives (OpenMP/ TBB) + SIMD Intrinsics CPU Multicore Xeon Phi GPU FPGA SoC for HPC

46 Current State of Programming Heterogeneous Systems C/FORTRAN OpenCL + Directives (OpenMP/ TBB) + SIMD Intrinsics Verilog/VHDL CPU Multicore Xeon Phi GPU FPGA SoC for HPC

47 Programming heterogeneous systems requires too many versions of code! C/FORTRAN OpenCL + Directives (OpenMP/ TBB) + SIMD Intrinsics Verilog/VHDL CPU Multicore Xeon Phi GPU FPGA SoC for HPC

48 Triolet Triolet Compiler C/FORTRAN OpenCL + Directives (OpenMP/ TBB) + SIMD Intrinsics Verilog/VHDL CPU Multicore Xeon Phi GPU FPGA SoC for HPC

49 Programming in Triolet Python Nonuniform FT (real part) SoC for HPC

50 Programming in Triolet Python Nonuniform FT (real part) Inner loop Inner loop ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] SoC for HPC

51 Programming in Triolet Python Nonuniform FT (real part) Inner loop Outer loop Inner loop Outer loop ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] SoC for HPC

52 Programming in Triolet Python Nonuniform FT (real part) Inner loop Outer loop Inner loop Outer loop ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] map and reduce style programming no new paradigm to learn Parallel details are implicit easy to use Automated data partitioning, MPI rank generation, MPI messaging, OpenMP, etc. SoC for HPC

53 Lirary-Driven Optimization The asic idea: Compilers are smart (inlining, unoxing,...) Software astractions have no overhead when the compiler can evaluate them statically Just write components as lirary code! Emody optimizations in the lirary code as alternative function implementations, use auto-tuning to select a final arrangement Triolet uses this approach [Rodrigues, et al 14] Prior work demonstrates loop fusion, shared-memory parallelism, vectorization [Coutts et al. 7, Keller et al. 1, Mainland et al. 13]

54 Parallel Loop Fusion Based on Keller et al. 1 sum(map(square, range(k))) range(k) There are k outputs The n th output is n map(square,...) sum(...) Loop over inputs The n th output is square(n th input) Add the n th input to result Fusion There are k iterations In each iteration, add square(n) to result

55 Triolet Productivity and Performance ys = [sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] Lirary functions factor out data decomposition, parallelism, and communication 128-way Speedup (16 cores 8 nodes) Triolet C with MPI+OpenMP C with MPI+OpenM!!"#$!%&"'#&()*+,!%&"'"-.!!/1'2)%%'+"345/1'26//'7689:,!;%&"'#&()*+<.!!/1'2)%%'(=#>5/1'26//'7689:,!;%&"'"-<.!!*)#+$!"#$!(=#>?!@!%&"'"-!@@!?.!!/1'A*=+$5;+"34'B,!C,!/1'1DE,!?,!/1'26//'7689:<.!!/1'A*=+$5;+"34'>,!C,!/1'1DE,!?,!/1'26//'7689:<.!!*)#+$!"#$!*FG#>'+"34'B!@!*4"H-"I5+"34'B,!%&"'#&()*+<.!!*)#+$!"#$!&=--4-'+"34'B!@!*FG#>'+"34'B!J!%&"'#&()*+.!!KH)=$!J>+,!JB+.!!"K!5(=#>?<!L!!!!>+!@!"#&G$'>+.!!!!B+!@!"#&G$'B+.!!M!!4H+4!L!!!!>+!@!%=HH)*5+"34'>!J!+"34)K5KH)=$<<.!!!!B+!@!%=HH)*5+"34'>!J!+"34)K5KH)=$<<.!!M!!KH)=$!J(+'*FG#>!@!%=HH)*5*FG#>'+"34'B!J!+"34)K5KH)=$<<.!!KH)=$!JN+'*FG#>!@!%=HH)*5*FG#>'+"34'B!J!+"34)K5KH)=$<<.!!"K!5(=#>?<!L!!!!"#$!#O)(>4(+!@!%&"'#&()*+PC.!!!!/1'84QG4+$!J(4Q+!@!%=HH)*5#O)(>4(+!J!R!J!+"34)K5/1'84QG4+$<<.!!!!"#$!O.!!!!K)(!5O!@!?.!O!S!#O)(>4(+.!OTT<!L!!!!!!"#$!O)(>4('"-!@!OTC.!!!!!!/1'1+4#-5>+,!+"34'>,!/1'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/1'26//'7689:,!;(4Q+WOX<.!!!!!!/1'1+4#-5B+,!+"34'>,!/1'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/1'26//'7689:,!;(4Q+W#O)(>4(+TOX<.!!!!!!/1'1+4#-5(+!T!O)(>4('"-J*FG#>'+"34'B,!*FG#>'+"34'B,!/1'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/1'26//'7689:,!;(4Q+WYJ#O)(>4(+TOX<.!!!!M!!!!%4%*&N5(+'*FG#>,!(+,!*FG#>'+"34'B!J!+"34)K5KH)=$<<.!!!!/1'7="$=HH5#O)(>4(+JR,!(4Q+,!/1'ZEVE[Z\Z'1]D68\<.!!!!K(445(4Q+<.!!M!!4H+4!L!!!!/1'84*I5>+,!+"34'>,!/1'U96VE,!?,!!!!!!!!!!!!!?,!/1'26//'7689:,!/1'ZEVE[Z'1]D68\<.!!!!/1'84*I5B+,!+"34'>,!/1'U96VE,!?,!!!!!!!!!!!!!?,!/1'26//'7689:,!/1'ZEVE[Z'1]D68\<.!!!!/1'84*I5(+'*FG#>,!*FG#>'+"34'B,!/1'U96VE,!?,!!!!!!!!!!!!!?,!/1'26//'7689:,!/1'ZEVE[Z'1]D68\<.!!M!!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^!!L!!!!"#$!". _&(=`%=!)%&!&=(=HH4H!K)(!+*F4-GH45+$=$"*<!!!!K)(!5"!@!?.!"!S!*FG#>'+"34'B.!"TT<!L!!!!!!KH)=$!+!@!?.!!!!!!"#$!a.!!!!!!K)(!5a!@!?.!a!S!+"34'>.!aTT<!!!!!!!!+!T@!B+WaX!J!*)+K5(+'*FG#>W"X!J!>+WaX<.!!!!!!N+'*FG#>W"X!@!+.!!!!M!!M!!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^!!/1']=$F4(5N+'*FG#>,!*FG#>'+"34'B,!/1'U96VE,!N+,!*FG#>'+"34'B,!/1'U96VE,!!!!!!!!!!!!!?,!/1'26//'7689:<.!!K(445(+'*FG#><.!!K(445N+'*FG#><.!!"K!5(=#>?<!L!!!!K(445(+<.!!M!!4H+4!L!!!!K(445B+<.!!!!K(445>+<.!!M Setup Cleanup SoC for HPC

56 Conclusion Blue Waters and related HPC systems have demonstrated that GPU computing can have narrow ut deep impact on science New architectural innovations enale much easier work migration etween CPUs and accelerators New programming systems innovations enale much faster application development and lowered maintenance cost 56

57 IWC SE 21 Acknowledgements D. August (Princeton), S. Baghsorkhi (Illinois), N. Bell (NVIDIA), D. Callahan (Microsoft), J. Cohen (NVIDIA), B. Dally (Stanford), J. Demmel (Berkeley), P. Duey (Intel), M. Frank (Intel), M. Garland (NVIDIA), Isaac Gelado (BSC), M. Gschwind (IBM), R. Hank (Google), J. Hennessy (Stanford), P. Hanrahan (Stanford), M. Houston (AMD), T. Huang (Illinois), D. Kaeli (NEU), K. Keutzer (Berkeley), W. Kramer (NCSA), I. Gelado (NVIDIA), B. Gropp (Illinois), D. Kirk (NVIDIA), D. Kuck (Intel), S. Mahlke (Michigan), T. Mattson (Intel), N. Navarro (UPC), J. Owens (Davis), D. Padua (Illinois), S. Patel (Illinois), Y. Patt (Texas), D. Patterson (Berkeley), C. Rodrigues (Illinois), P. Rogers (AMD), S. Ryoo (ZeroSoft), B. Sander (AMD), K. Schulten (Illinois), B. Smith (Microsoft), M. Snir (Illinois), I. Sung (Illinois), P. Stenstrom (Chalmers), J. Stone (Illinois), S. Stone (Harvard) J. Stratton (Illinois), H. Takizawa (Tohoku), M. Valero (UPC) And many others!

Rethinking Computer Architecture for Energy Limited Computing

Rethinking Computer Architecture for Energy Limited Computing Rethinking Computer Architecture for Energy Limited Computing Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urana-Champaign Agenda Blue Waters and recent progress in petascale GPU computing

More information

Rethinking Computer Architecture for Throughput Computing

Rethinking Computer Architecture for Throughput Computing Rethinking Computer Architecture for Throughput Computing Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare, Inc. A Few Thoughts on Computer Architecture While in Samos Where are we today?

More information

Scalability, Portability, and Productivity in GPU Computing

Scalability, Portability, and Productivity in GPU Computing Scalability, Portability, and Productivity in GPU Computing Wen-mei Hwu Sanders AMD Chair, ECE and CS University of Illinois, Urbana-Champaign CTO, MulticoreWare Agenda Performance experience in using

More information

Scalability, Portability, and Productivity in GPU Computing

Scalability, Portability, and Productivity in GPU Computing Scalability, Portability, and Productivity in GPU Computing Wen-mei Hwu Sanders AMD Chair, ECE and CS University of Illinois, Urbana-Champaign CTO, MulticoreWare Agenda 4,224 Kepler GPUs in Blue Waters

More information

Why are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare

Why are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare Why are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare Agenda GPU Computing in Blue Waters Library Algorithms Scalability, performance, and numerical

More information

APPLICATION OF ACCELERATORS

APPLICATION OF ACCELERATORS APPLICAION OF ACCELERAORS AND PARALLEL COMPUING O SEQUENCING DAA 1 Wen-mei W. Hwu University of Illinois at Urbana-Champaign With Deming Chen and Jian Ma Blue Waters Computing System Fully Operational

More information

GPU Supercomputing From Blue Waters to Exascale

GPU Supercomputing From Blue Waters to Exascale GPU Supercomputing From Blue Waters to Exascale Wen-mei Hwu Professor, University of Illinois at Urbana-Champaign (UIUC) CTO, MulticoreWare Inc. New BW Configuration Cray System & Storage cabinets: Compute

More information

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul

More information

The Next Generation of High Performance Computing. William Gropp

The Next Generation of High Performance Computing. William Gropp The Next Generation of High Performance Computing William Gropp www.cs.illinois.edu/~wgropp Extrapolation is Risky 1989 T 23 years Intel introduces 486DX Eugene Brooks writes Attack of the Killer Micros

More information

Triolet C++/Python productive programming in heterogeneous parallel systems

Triolet C++/Python productive programming in heterogeneous parallel systems Triolet C++/Python productive programming in heterogeneous parallel systems Wen-mei Hwu Sanders AMD Chair, ECE and CS University of Illinois, Urbana-Champaign CTO, MulticoreWare Agenda Performance portability

More information

The Next Generation of High Performance Computing. William Gropp

The Next Generation of High Performance Computing. William Gropp The Next Generation of High Performance Computing William Gropp www.cs.illinois.edu/~wgropp Extrapolation is Risky 1989 T 23 years Intel introduces 486DX Eugene Brooks writes Attack of the Killer Micros

More information

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance

More information

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Torsten Hoefler With lots of help from the AUS Team and Bill Kramer at NCSA! All used images belong

More information

Raising the Level of Many-core Programming with Compiler Technology

Raising the Level of Many-core Programming with Compiler Technology Raising the Level of Many-core Programming with Compiler Technology meeting a grand challenge Wen-mei Hwu University of Illinois, Urbana-Champaign Agenda Many-core Usage Pattern and Workflow Using Kernels

More information

What is driving heterogeneity in HPC?

What is driving heterogeneity in HPC? What is driving heterogeneity in HPC? Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign with Simon Garcia, and Carl Pearson 1 Agenda RevoluHonary paradigm

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Lecture 20: Distributed Memory Parallelism. William Gropp

Lecture 20: Distributed Memory Parallelism. William Gropp Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester NVIDIA GPU Computing A Revolution in High Performance Computing GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester John Ashley Senior Solutions Architect

More information

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Challenges in High Performance Computing. William Gropp

Challenges in High Performance Computing. William Gropp Challenges in High Performance Computing William Gropp www.cs.illinois.edu/~wgropp 2 What is HPC? High Performance Computing is the use of computing to solve challenging problems that require significant

More information

GPU Consideration for Next Generation Weather (and Climate) Simulations

GPU Consideration for Next Generation Weather (and Climate) Simulations GPU Consideration for Next Generation Weather (and Climate) Simulations Oliver Fuhrer 1, Tobias Gisy 2, Xavier Lapillonne 3, Will Sawyer 4, Ugo Varetto 4, Mauro Bianco 4, David Müller 2, and Thomas C.

More information

Experiences with ENZO on the Intel Many Integrated Core Architecture

Experiences with ENZO on the Intel Many Integrated Core Architecture Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc. Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors

More information

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Stockholm Brain Institute Blue Gene/L

Stockholm Brain Institute Blue Gene/L Stockholm Brain Institute Blue Gene/L 1 Stockholm Brain Institute Blue Gene/L 2 IBM Systems & Technology Group and IBM Research IBM Blue Gene /P - An Overview of a Petaflop Capable System Carl G. Tengwall

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Present and Future Leadership Computers at OLCF

Present and Future Leadership Computers at OLCF Present and Future Leadership Computers at OLCF Al Geist ORNL Corporate Fellow DOE Data/Viz PI Meeting January 13-15, 2015 Walnut Creek, CA ORNL is managed by UT-Battelle for the US Department of Energy

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Blue Waters System Overview. Greg Bauer

Blue Waters System Overview. Greg Bauer Blue Waters System Overview Greg Bauer The Blue Waters EcoSystem Petascale EducaIon, Industry and Outreach Petascale ApplicaIons (CompuIng Resource AllocaIons) Petascale ApplicaIon CollaboraIon Team Support

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

An Introduction to OpenACC

An Introduction to OpenACC An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Blue Waters Super System

Blue Waters Super System Blue Waters Super System Michelle Butler 4/12/12 NCSA Has Completed a Grand Challenge In August, IBM terminated their contract to deliver the base Blue Waters system NSF asked NCSA to propose a change

More information

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017 Creating an Exascale Ecosystem for Science Presented to: HPC Saudi 2017 Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences March 14, 2017 ORNL is managed by UT-Battelle

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Exascale Challenges and Applications Initiatives for Earth System Modeling

Exascale Challenges and Applications Initiatives for Earth System Modeling Exascale Challenges and Applications Initiatives for Earth System Modeling Workshop on Weather and Climate Prediction on Next Generation Supercomputers 22-25 October 2012 Tom Edwards tedwards@cray.com

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start

More information

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016 RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms http://www.scidacreview.org/0902/images/esg13.jpg Matthew Norman Jeffrey Larkin Richard Archibald Valentine Anantharaj

More information

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history

More information

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana

More information

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short

More information

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen Jeffrey S. Vetter SPPEXA Symposium Munich 26 Jan 2016 ORNL is managed by UT-Battelle for the US Department of Energy

More information

Analysis and Visualization Algorithms in VMD

Analysis and Visualization Algorithms in VMD 1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics

More information

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing Jay Boisseau, Director April 17, 2012 TACC Vision & Strategy Provide the most powerful, capable computing technologies and

More information

Execution Models for the Exascale Era

Execution Models for the Exascale Era Execution Models for the Exascale Era Nicholas J. Wright Advanced Technology Group, NERSC/LBNL njwright@lbl.gov Programming weather, climate, and earth- system models on heterogeneous muli- core plajorms

More information

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University

More information

GPU computing at RZG overview & some early performance results. Markus Rampp

GPU computing at RZG overview & some early performance results. Markus Rampp GPU computing at RZG overview & some early performance results Markus Rampp Introduction Outline Hydra configuration overview GPU software environment Benchmarking and porting activities Team Renate Dohmen

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Summarized by: Michael Riera 9/17/2011 University of Central Florida CDA5532 Agenda

More information

How HPC Hardware and Software are Evolving Towards Exascale

How HPC Hardware and Software are Evolving Towards Exascale How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview

More information

TESLA V100 PERFORMANCE GUIDE May 2018

TESLA V100 PERFORMANCE GUIDE May 2018 TESLA V100 PERFORMANCE GUIDE May 2018 TESLA V100 The Fastest and Most Productive GPU for AI and HPC Volta Architecture Tensor Core Improved NVLink & HBM2 Volta MPS Improved SIMT Model Most Productive GPU

More information

HPC Issues for DFT Calculations. Adrian Jackson EPCC

HPC Issues for DFT Calculations. Adrian Jackson EPCC HC Issues for DFT Calculations Adrian Jackson ECC Scientific Simulation Simulation fast becoming 4 th pillar of science Observation, Theory, Experimentation, Simulation Explore universe through simulation

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Introduction to the Intel Xeon Phi on Stampede

Introduction to the Intel Xeon Phi on Stampede June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information