The Parallel Revolution in Computational Science and Engineering

Similar documents
High Performance Computing with Accelerators

Lecture 1: Introduction and Computational Thinking

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Accelerating Molecular Modeling Applications with Graphics Processors

designing a GPU Computing Solution

CS 668 Parallel Computing Spring 2011

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

Accelerating Cosmological Data Analysis with Graphics Processors Dylan W. Roeh Volodymyr V. Kindratenko Robert J. Brunner

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Threading Hardware in G80

Intro to GPU s for Parallel Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

QP: A Heterogeneous Multi-Accelerator Cluster

Accelerating NAMD with Graphics Processors

GPU Acceleration of Molecular Modeling Applications

Graphics Processor Acceleration and YOU

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin

Multilevel Summation of Electrostatic Potentials Using GPUs

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu

Case Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units

M02: High Performance Computing with CUDA. Overview

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

The Future of GPU Computing

Introduction to CUDA (1 of n*)

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

ENABLING NEW SCIENCE GPU SOLUTIONS

Accelerating Scientific Applications with GPUs

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Analysis and Visualization Algorithms in VMD

Parallel Computing: What has changed lately? David B. Kirk

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Accelerating Molecular Modeling Applications with GPU Computing

Tesla GPU Computing A Revolution in High Performance Computing

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 1: Gentle Introduction to GPUs

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Using The CUDA Programming Model

CUDA Experiences: Over-Optimization and Future HPC

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Master Informatics Eng.

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters

Numerical Algorithms on Multi-GPU Architectures

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

GPU-Accelerated Analysis of Large Biomolecular Complexes

Lecture 1. Introduction Course Overview

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Hardware/Software Co-Design

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Mattan Erez. The University of Texas at Austin

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Spring 2011 Prof. Hyesoon Kim

GPU Programming and Architecture: Course Overview

Using GPUs to compute the multilevel summation of electrostatic forces

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

General Purpose GPU Computing in Partial Wave Analysis

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Spring 2009 Prof. Hyesoon Kim

Software and Performance Engineering for numerical codes on GPU clusters

Simulating Life at the Atomic Scale

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPGPU introduction and network applications. PacketShaders, SSLShader

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Large scale Imaging on Current Many- Core Platforms

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

High Performance Computing. David B. Kirk, Chief Scientist, t NVIDIA

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

ECE 8823: GPU Architectures. Objectives

Computing on GPU Clusters

Accelerating Molecular Modeling Applications with GPU Computing

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

OpenMP Optimization and its Translation to OpenGL

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

Refactoring NAMD for Petascale Machines and Graphics Processors

Application of GPU technology to OpenFOAM simulations

Transcription:

The Parallel Revolution in Computational Science and Engineering applications, education, tools, and impact Wen-mei Hwu University of Illinois, Urbana-Champaign

The Energy Behind Parallel Revolution Calculation: 1 TFLOPS vs. 100 GFLOPS Memory Bandwidth: 100-150 150 GB/s vs. 32-64 GB/s 3 year shift Courtesy: John Owens Multi-core and GPU in every PC massive volume and potential impact 2 Courtesy: John Owens

Applications Entry Timeframes App developers want at least 3X-5X for end-user perceived value-add Apps entry point (2011) 50 GF 100 GF 200 GF 400 GF 2-core 4-core 8-core 16-core Multi-core G80 16-cores 500 GF Apps entry point (2008) G280 32-cores 1TF 3 G380 Larrabee Many-core 64-cores 128-cores 2 TF 4 TF Time 24-month generations

What are these applications? 146X 36X 19X 17X 100X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Transcoding HD video stream to H.264 Simulation in Matlab using.mex. file CUDA function Astrophysics N-N body simulation Courtesy NVIDIA 149X 47X 20X 24X 30X Financial simulation of LIBOR model with swaptions GLAME@lab: : An M-script API for linear Algebra operations on GPU Ultrasound medical imaging for cancer diagnostics 4 Highly optimized object oriented molecular dynamics Cmatch exact string matching to find similar proteins and gene sequences

GPU Graphics Mode The future of GPUs is programmable processing So build the architecture around the processor Host Input Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 5

GPU CUDA Mode Processors execute computing threads New operating mode/hw interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 6

UIUC/NCSA AC Cluster u 32 nodes s 4-GPU (GTX280, Tesla), 1-FPGA, 1 quad-core Opteron node at NCSA s GPUs donated by NVIDIA s FPGA donated by Xilinx s 128 TFLOPS single precision, 10 TFLOPS double precision u Coulomb Summation: s 1.78 TFLOPS/node s 271x speedup vs. Intel QX6700 CPU core w/ SSE UIUC/NCSA AC Cluster http://www.ncsa.uiuc.edu/projects/gpucluster/ A partnership between NCSA and academic departments. 7

Multi-GPU MPI Implementation GPU cluster Performance scaling Cluster manager and Login server 24-port Topspin 120 Server InfiniBand switch Netgear Prosafe 24- port Gigabit Ethernet switch HP xw9400 workstation with NVIDIA Quadro Plex Model IV modules Lustre file system servers 16 cluster compute nodes exeution time (sec) 1000 100 10 0 10 20 30 40 50 number of MPI threads Two Point Angular Correlation Imagination Wen-mei Unbound W. Hwu University of Illinois at Urbana-Champaign 8

CUDA - No more shader functions. CUDA integrated CPU+GPU application C program Serial or modestly parallel C code executes on CPU Highly parallel SPMD kernel C code executes on GPU CPU Serial Code Grid 0 GPU Parallel Kernel KernelA<<< nblk, ntid >>>(args);... CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 9

DRAM Bandwidth Trends Sets Programming Agenda Random access BW 1.2% of peak for DDR3-1600, 0.8% for GDDR4-1600 (and falling) 3D stacking and optical interconnects will unlikely help. 10

It is all about applications! Illinois CUDA Center of Excellence and the IACAT Community parallel.illinois.edu

NAMD Overlapping Execution Phillips et al., SC2002. Example Configuration 847 objects 100,000 108 Offload to GPU Objects are assigned to processors and queued as data arrives. Klaus Schulten and team, over 29,000 registered users 12

Actual Timelines from NAMD Generated using Charm++ tool Projections CPU GPU x Remote Force f Local Force f x f x f 13

NAMD on QP Cluster total application time seconds per step 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 6.76 3.33 CPU only with GPU GPU subset faster 4 8 16 32 60 2.4 GHz Opteron + Quadro FX 5600 Amount of time the GPU is actually doing work 14

VMD Visual Molecular Dynamics (120,000 registered users) Visualization of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry data, particle systems http://www.ks.uiuc.edu/research/vmd www.ks.uiuc.edu/research/vmd/ 15

Molecular Orbital Computation/Display - Molekel, MacMolPlt,, and VMD Units: 10 3 grid points/sec Larger numbers indicate higher performance. High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, and K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2) (in press) 16

Whole-cell Diffusion Modeling Zan Luthey-Schulten group (Chemistry, UIUC) is using CUDA to simulate reaction diffusion in three dimensions on a lattice. GPUs and CUDA enable simulations at cellular length and time scales, enabling study of stochastic biochemical networks in vivo. Simulations sample the cell state s s probability distribution according to the reaction-diffusion master equation. Using size and population distributions from proteomic data, an approximation of an in vivo cellular environment is constructed on a lattice. [left] A cell model in which 30% of the total volume is occupied by obstacles. 17 Particle diffuse around the stationary obstacles, reacting according to kinetic rates.

Scalability on GPUs Roberts, Stone, Sepulveda, Hwu, Luthey Schulten (2009) The Eighth IEEE International Workshop on High Performance Computational Biology, in press. 18

Large Eddy Simulations (LES) on GPUs using CUDA 3D incompressible Navier-Stokes equations Solved using fractional-step method. Used geometric multigrid for pressure-poisson solver. Used Smagorinsky sub-grid scale model. Used Red-Black Gauss-Seidel for linear solver. Aaron Shinn, Pratap Vanka, Wen-mei Hwu 19

Turbulent flow in 3D square duct Instantaneous flow in cross-section Mean flow in cross-section 20

GPU (Tesla C1060) versus 3.0 GHz Intel Xeon Core laminar lid-driven cube simulation First 100 time steps turbulent flow in a square duct 21

Cosmological Data Analysis: Two Point Angular Correlation Function observed data random dataset 1 random dataset n R Two-point angular correlation function Imagination Wen-mei Unbound W. Hwu University of Illinois at Urbana-Champaign 22

Performance on a Single GPU Execution time Speedup kernel execution time (sec) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 AMD Opteron Quadro FX 5600 GeForce GTX 280 (SP) GeForce GTX 280 (DP) speedup vs CPU 250 200 150 100 50 Quadro FX 5600 GeForce GTX 280 (SP) GeForce GTX 280 (DP) 1.E+00 1000 10000 100000 dataset size 0 1000 10000 100000 dataset size Dylan Roeh, Volodymyr V. Kindratenko, Robert J. Brunner 23

Tools are coming. There is always hope. In the eve of the Battle of the Helms Deep

Auto-tuning tuning Tools do heavy lifting! Programmers are doing too much heavy lifting Sum of Absolute Differences Too many memory organizational details are exposed to the programmers u Pareto optimal curve based on analytical performance model of the e source code allows automatic identification of optimal code arrangement S. Ryoo, et al, Program Optimization Space Pruning for a Multithreaded GPU, CGO, April 2008. 25

Implicitly parallel programming with data structure and function property annotations to enable auto parallelization Locality annotation programming to eliminate need for explicit management of memory types and data transfers CUDA-auto CUDA-lite Parameterized CUDA programming using auto-tuning and optimization space pruning CUDA-tune 1 st generation CUDA programming with explicit, hardwired thread organizations and explicit management of memory types and data transfers MCUDA/ OpenMP IA multi-core & Larrabe NVIDIA SDK 1.1 NVIDIA GPU 26

Towards Systematic Code Reuse MRI Example Two main approaches to MRI Reconstruction: Riemann approximation of the continuous inverse FT: M = H = m= 1 ρˆ F we d ( ) wdme [ ] i2 Solve a regularized inverse problem, e.g., 2 Solutions often derived by solving one or more matrix inversions,, e.g., m ( ) ρˆ = arg min Fρ d + R ρ ρ 2 ( ) 1 ˆ H H ρ = F F+ H F d πk m x 27

Towards a Reusable Code Base Certain computations appear frequently in inverse problems, e.g., Matrix inversion Matrix multiplication It is rare that two similar applications have exactly the same structure ( H ) S QS+ H 1 d ( H ) + 1 I W DW d Toeplitz Sparse Diagonal Contourlet Tx Better library interfaces and tools are needed to allow for easy code reuse across different applications. 28

Education is key.

To Learn More u UIUC ECE498AL Programming Massively Parallel Processors (http://courses.ece.uiuc.edu/ece498/al/ http://courses.ece.uiuc.edu/ece498/al/) s David Kirk (NVIDIA) and Wen-mei Hwu (UIUC) co-instructors s CUDA programming, GPU computing, lab exercises, and projects s Lecture slides and voice recordings u More than 500 students worldwide follow the course each semester. 30

Graduate Summer School Aimed at CSE grad students Full registration 180 applicants from 3 continents 44 accepted from 25 universities, 3 continents 50+ remote participants 9 lectures, 3 keynotes, 1 panel, hands-on lab Sponsored by UIUC, NCSA, Microsoft, NVIDIA, VSCSE Will be offered again, August 10-14 14 2009 www.greatlakesconsortium.org/events/gpumulticore Wen-mei Hwu and David Kirk Co-instructor 31

Conclusion - A Great Opportunity for Many Many-core accelerates science discovery Current challenges are in parallelizing sparse and graph- based computation for very large models memory bandwidth! Incorporating many-core technology into large scale systems such as Blue Waters presents major challenges in software tool chains Software engineering issues must be better addressed It is still hard to extract and integrate parallel kernels from real applications Effective code reuse requires a very large library code base and new tools and interfaces. Cross-application fertilization key to future success vision for a global developer community 32

Thank you! Any questions? 33