The Parallel Revolution in Computational Science and Engineering

The Parallel Revolution in Computational Science and Engineering applications, education, tools, and impact Wen-mei Hwu University of Illinois, Urbana-Champaign

The Energy Behind Parallel Revolution Calculation: 1 TFLOPS vs. 100 GFLOPS Memory Bandwidth: 100-150 150 GB/s vs. 32-64 GB/s 3 year shift Courtesy: John Owens Multi-core and GPU in every PC massive volume and potential impact 2 Courtesy: John Owens

Applications Entry Timeframes App developers want at least 3X-5X for end-user perceived value-add Apps entry point (2011) 50 GF 100 GF 200 GF 400 GF 2-core 4-core 8-core 16-core Multi-core G80 16-cores 500 GF Apps entry point (2008) G280 32-cores 1TF 3 G380 Larrabee Many-core 64-cores 128-cores 2 TF 4 TF Time 24-month generations

What are these applications? 146X 36X 19X 17X 100X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Transcoding HD video stream to H.264 Simulation in Matlab using.mex. file CUDA function Astrophysics N-N body simulation Courtesy NVIDIA 149X 47X 20X 24X 30X Financial simulation of LIBOR model with swaptions GLAME@lab: : An M-script API for linear Algebra operations on GPU Ultrasound medical imaging for cancer diagnostics 4 Highly optimized object oriented molecular dynamics Cmatch exact string matching to find similar proteins and gene sequences

GPU Graphics Mode The future of GPUs is programmable processing So build the architecture around the processor Host Input Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 5

GPU CUDA Mode Processors execute computing threads New operating mode/hw interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 6

UIUC/NCSA AC Cluster u 32 nodes s 4-GPU (GTX280, Tesla), 1-FPGA, 1 quad-core Opteron node at NCSA s GPUs donated by NVIDIA s FPGA donated by Xilinx s 128 TFLOPS single precision, 10 TFLOPS double precision u Coulomb Summation: s 1.78 TFLOPS/node s 271x speedup vs. Intel QX6700 CPU core w/ SSE UIUC/NCSA AC Cluster http://www.ncsa.uiuc.edu/projects/gpucluster/ A partnership between NCSA and academic departments. 7

Multi-GPU MPI Implementation GPU cluster Performance scaling Cluster manager and Login server 24-port Topspin 120 Server InfiniBand switch Netgear Prosafe 24- port Gigabit Ethernet switch HP xw9400 workstation with NVIDIA Quadro Plex Model IV modules Lustre file system servers 16 cluster compute nodes exeution time (sec) 1000 100 10 0 10 20 30 40 50 number of MPI threads Two Point Angular Correlation Imagination Wen-mei Unbound W. Hwu University of Illinois at Urbana-Champaign 8

CUDA - No more shader functions. CUDA integrated CPU+GPU application C program Serial or modestly parallel C code executes on CPU Highly parallel SPMD kernel C code executes on GPU CPU Serial Code Grid 0 GPU Parallel Kernel KernelA<<< nblk, ntid >>>(args);... CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 9

DRAM Bandwidth Trends Sets Programming Agenda Random access BW 1.2% of peak for DDR3-1600, 0.8% for GDDR4-1600 (and falling) 3D stacking and optical interconnects will unlikely help. 10

It is all about applications! Illinois CUDA Center of Excellence and the IACAT Community parallel.illinois.edu

NAMD Overlapping Execution Phillips et al., SC2002. Example Configuration 847 objects 100,000 108 Offload to GPU Objects are assigned to processors and queued as data arrives. Klaus Schulten and team, over 29,000 registered users 12

Actual Timelines from NAMD Generated using Charm++ tool Projections CPU GPU x Remote Force f Local Force f x f x f 13

NAMD on QP Cluster total application time seconds per step 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 6.76 3.33 CPU only with GPU GPU subset faster 4 8 16 32 60 2.4 GHz Opteron + Quadro FX 5600 Amount of time the GPU is actually doing work 14

VMD Visual Molecular Dynamics (120,000 registered users) Visualization of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry data, particle systems http://www.ks.uiuc.edu/research/vmd www.ks.uiuc.edu/research/vmd/ 15

Molecular Orbital Computation/Display - Molekel, MacMolPlt,, and VMD Units: 10 3 grid points/sec Larger numbers indicate higher performance. High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, and K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2) (in press) 16

Whole-cell Diffusion Modeling Zan Luthey-Schulten group (Chemistry, UIUC) is using CUDA to simulate reaction diffusion in three dimensions on a lattice. GPUs and CUDA enable simulations at cellular length and time scales, enabling study of stochastic biochemical networks in vivo. Simulations sample the cell state s s probability distribution according to the reaction-diffusion master equation. Using size and population distributions from proteomic data, an approximation of an in vivo cellular environment is constructed on a lattice. [left] A cell model in which 30% of the total volume is occupied by obstacles. 17 Particle diffuse around the stationary obstacles, reacting according to kinetic rates.

Scalability on GPUs Roberts, Stone, Sepulveda, Hwu, Luthey Schulten (2009) The Eighth IEEE International Workshop on High Performance Computational Biology, in press. 18

Large Eddy Simulations (LES) on GPUs using CUDA 3D incompressible Navier-Stokes equations Solved using fractional-step method. Used geometric multigrid for pressure-poisson solver. Used Smagorinsky sub-grid scale model. Used Red-Black Gauss-Seidel for linear solver. Aaron Shinn, Pratap Vanka, Wen-mei Hwu 19

Turbulent flow in 3D square duct Instantaneous flow in cross-section Mean flow in cross-section 20

GPU (Tesla C1060) versus 3.0 GHz Intel Xeon Core laminar lid-driven cube simulation First 100 time steps turbulent flow in a square duct 21

Cosmological Data Analysis: Two Point Angular Correlation Function observed data random dataset 1 random dataset n R Two-point angular correlation function Imagination Wen-mei Unbound W. Hwu University of Illinois at Urbana-Champaign 22

Performance on a Single GPU Execution time Speedup kernel execution time (sec) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 AMD Opteron Quadro FX 5600 GeForce GTX 280 (SP) GeForce GTX 280 (DP) speedup vs CPU 250 200 150 100 50 Quadro FX 5600 GeForce GTX 280 (SP) GeForce GTX 280 (DP) 1.E+00 1000 10000 100000 dataset size 0 1000 10000 100000 dataset size Dylan Roeh, Volodymyr V. Kindratenko, Robert J. Brunner 23

Tools are coming. There is always hope. In the eve of the Battle of the Helms Deep

Auto-tuning tuning Tools do heavy lifting! Programmers are doing too much heavy lifting Sum of Absolute Differences Too many memory organizational details are exposed to the programmers u Pareto optimal curve based on analytical performance model of the e source code allows automatic identification of optimal code arrangement S. Ryoo, et al, Program Optimization Space Pruning for a Multithreaded GPU, CGO, April 2008. 25

Implicitly parallel programming with data structure and function property annotations to enable auto parallelization Locality annotation programming to eliminate need for explicit management of memory types and data transfers CUDA-auto CUDA-lite Parameterized CUDA programming using auto-tuning and optimization space pruning CUDA-tune 1 st generation CUDA programming with explicit, hardwired thread organizations and explicit management of memory types and data transfers MCUDA/ OpenMP IA multi-core & Larrabe NVIDIA SDK 1.1 NVIDIA GPU 26

Towards Systematic Code Reuse MRI Example Two main approaches to MRI Reconstruction: Riemann approximation of the continuous inverse FT: M = H = m= 1 ρˆ F we d ( ) wdme [ ] i2 Solve a regularized inverse problem, e.g., 2 Solutions often derived by solving one or more matrix inversions,, e.g., m ( ) ρˆ = arg min Fρ d + R ρ ρ 2 ( ) 1 ˆ H H ρ = F F+ H F d πk m x 27

Towards a Reusable Code Base Certain computations appear frequently in inverse problems, e.g., Matrix inversion Matrix multiplication It is rare that two similar applications have exactly the same structure ( H ) S QS+ H 1 d ( H ) + 1 I W DW d Toeplitz Sparse Diagonal Contourlet Tx Better library interfaces and tools are needed to allow for easy code reuse across different applications. 28

Education is key.

To Learn More u UIUC ECE498AL Programming Massively Parallel Processors (http://courses.ece.uiuc.edu/ece498/al/ http://courses.ece.uiuc.edu/ece498/al/) s David Kirk (NVIDIA) and Wen-mei Hwu (UIUC) co-instructors s CUDA programming, GPU computing, lab exercises, and projects s Lecture slides and voice recordings u More than 500 students worldwide follow the course each semester. 30

Graduate Summer School Aimed at CSE grad students Full registration 180 applicants from 3 continents 44 accepted from 25 universities, 3 continents 50+ remote participants 9 lectures, 3 keynotes, 1 panel, hands-on lab Sponsored by UIUC, NCSA, Microsoft, NVIDIA, VSCSE Will be offered again, August 10-14 14 2009 www.greatlakesconsortium.org/events/gpumulticore Wen-mei Hwu and David Kirk Co-instructor 31

Conclusion - A Great Opportunity for Many Many-core accelerates science discovery Current challenges are in parallelizing sparse and graph- based computation for very large models memory bandwidth! Incorporating many-core technology into large scale systems such as Blue Waters presents major challenges in software tool chains Software engineering issues must be better addressed It is still hard to extract and integrate parallel kernels from real applications Effective code reuse requires a very large library code base and new tools and interfaces. Cross-application fertilization key to future success vision for a global developer community 32

Thank you! Any questions? 33