The Parallel Revolution in Computational Science and Engineering

Size: px
Start display at page:

Download "The Parallel Revolution in Computational Science and Engineering"

Transcription

1 The Parallel Revolution in Computational Science and Engineering applications, education, tools, and impact Wen-mei Hwu University of Illinois, Urbana-Champaign

2 The Energy Behind Parallel Revolution Calculation: 1 TFLOPS vs. 100 GFLOPS Memory Bandwidth: GB/s vs GB/s 3 year shift Courtesy: John Owens Multi-core and GPU in every PC massive volume and potential impact 2 Courtesy: John Owens

3 Applications Entry Timeframes App developers want at least 3X-5X for end-user perceived value-add Apps entry point (2011) 50 GF 100 GF 200 GF 400 GF 2-core 4-core 8-core 16-core Multi-core G80 16-cores 500 GF Apps entry point (2008) G cores 1TF 3 G380 Larrabee Many-core 64-cores 128-cores 2 TF 4 TF Time 24-month generations

4 What are these applications? 146X 36X 19X 17X 100X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Transcoding HD video stream to H.264 Simulation in Matlab using.mex. file CUDA function Astrophysics N-N body simulation Courtesy NVIDIA 149X 47X 20X 24X 30X Financial simulation of LIBOR model with swaptions : An M-script API for linear Algebra operations on GPU Ultrasound medical imaging for cancer diagnostics 4 Highly optimized object oriented molecular dynamics Cmatch exact string matching to find similar proteins and gene sequences

5 GPU Graphics Mode The future of GPUs is programmable processing So build the architecture around the processor Host Input Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 5

6 GPU CUDA Mode Processors execute computing threads New operating mode/hw interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 6

7 UIUC/NCSA AC Cluster u 32 nodes s 4-GPU (GTX280, Tesla), 1-FPGA, 1 quad-core Opteron node at NCSA s GPUs donated by NVIDIA s FPGA donated by Xilinx s 128 TFLOPS single precision, 10 TFLOPS double precision u Coulomb Summation: s 1.78 TFLOPS/node s 271x speedup vs. Intel QX6700 CPU core w/ SSE UIUC/NCSA AC Cluster A partnership between NCSA and academic departments. 7

8 Multi-GPU MPI Implementation GPU cluster Performance scaling Cluster manager and Login server 24-port Topspin 120 Server InfiniBand switch Netgear Prosafe 24- port Gigabit Ethernet switch HP xw9400 workstation with NVIDIA Quadro Plex Model IV modules Lustre file system servers 16 cluster compute nodes exeution time (sec) number of MPI threads Two Point Angular Correlation Imagination Wen-mei Unbound W. Hwu University of Illinois at Urbana-Champaign 8

9 CUDA - No more shader functions. CUDA integrated CPU+GPU application C program Serial or modestly parallel C code executes on CPU Highly parallel SPMD kernel C code executes on GPU CPU Serial Code Grid 0 GPU Parallel Kernel KernelA<<< nblk, ntid >>>(args);... CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 9

10 DRAM Bandwidth Trends Sets Programming Agenda Random access BW 1.2% of peak for DDR3-1600, 0.8% for GDDR (and falling) 3D stacking and optical interconnects will unlikely help. 10

11 It is all about applications! Illinois CUDA Center of Excellence and the IACAT Community parallel.illinois.edu

12 NAMD Overlapping Execution Phillips et al., SC2002. Example Configuration 847 objects 100, Offload to GPU Objects are assigned to processors and queued as data arrives. Klaus Schulten and team, over 29,000 registered users 12

13 Actual Timelines from NAMD Generated using Charm++ tool Projections CPU GPU x Remote Force f Local Force f x f x f 13

14 NAMD on QP Cluster total application time seconds per step CPU only with GPU GPU subset faster GHz Opteron + Quadro FX 5600 Amount of time the GPU is actually doing work 14

15 VMD Visual Molecular Dynamics (120,000 registered users) Visualization of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry data, particle systems

16 Molecular Orbital Computation/Display - Molekel, MacMolPlt,, and VMD Units: 10 3 grid points/sec Larger numbers indicate higher performance. High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, and K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2) (in press) 16

17 Whole-cell Diffusion Modeling Zan Luthey-Schulten group (Chemistry, UIUC) is using CUDA to simulate reaction diffusion in three dimensions on a lattice. GPUs and CUDA enable simulations at cellular length and time scales, enabling study of stochastic biochemical networks in vivo. Simulations sample the cell state s s probability distribution according to the reaction-diffusion master equation. Using size and population distributions from proteomic data, an approximation of an in vivo cellular environment is constructed on a lattice. [left] A cell model in which 30% of the total volume is occupied by obstacles. 17 Particle diffuse around the stationary obstacles, reacting according to kinetic rates.

18 Scalability on GPUs Roberts, Stone, Sepulveda, Hwu, Luthey Schulten (2009) The Eighth IEEE International Workshop on High Performance Computational Biology, in press. 18

19 Large Eddy Simulations (LES) on GPUs using CUDA 3D incompressible Navier-Stokes equations Solved using fractional-step method. Used geometric multigrid for pressure-poisson solver. Used Smagorinsky sub-grid scale model. Used Red-Black Gauss-Seidel for linear solver. Aaron Shinn, Pratap Vanka, Wen-mei Hwu 19

20 Turbulent flow in 3D square duct Instantaneous flow in cross-section Mean flow in cross-section 20

21 GPU (Tesla C1060) versus 3.0 GHz Intel Xeon Core laminar lid-driven cube simulation First 100 time steps turbulent flow in a square duct 21

22 Cosmological Data Analysis: Two Point Angular Correlation Function observed data random dataset 1 random dataset n R Two-point angular correlation function Imagination Wen-mei Unbound W. Hwu University of Illinois at Urbana-Champaign 22

23 Performance on a Single GPU Execution time Speedup kernel execution time (sec) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 AMD Opteron Quadro FX 5600 GeForce GTX 280 (SP) GeForce GTX 280 (DP) speedup vs CPU Quadro FX 5600 GeForce GTX 280 (SP) GeForce GTX 280 (DP) 1.E dataset size dataset size Dylan Roeh, Volodymyr V. Kindratenko, Robert J. Brunner 23

24 Tools are coming. There is always hope. In the eve of the Battle of the Helms Deep

25 Auto-tuning tuning Tools do heavy lifting! Programmers are doing too much heavy lifting Sum of Absolute Differences Too many memory organizational details are exposed to the programmers u Pareto optimal curve based on analytical performance model of the e source code allows automatic identification of optimal code arrangement S. Ryoo, et al, Program Optimization Space Pruning for a Multithreaded GPU, CGO, April

26 Implicitly parallel programming with data structure and function property annotations to enable auto parallelization Locality annotation programming to eliminate need for explicit management of memory types and data transfers CUDA-auto CUDA-lite Parameterized CUDA programming using auto-tuning and optimization space pruning CUDA-tune 1 st generation CUDA programming with explicit, hardwired thread organizations and explicit management of memory types and data transfers MCUDA/ OpenMP IA multi-core & Larrabe NVIDIA SDK 1.1 NVIDIA GPU 26

27 Towards Systematic Code Reuse MRI Example Two main approaches to MRI Reconstruction: Riemann approximation of the continuous inverse FT: M = H = m= 1 ρˆ F we d ( ) wdme [ ] i2 Solve a regularized inverse problem, e.g., 2 Solutions often derived by solving one or more matrix inversions,, e.g., m ( ) ρˆ = arg min Fρ d + R ρ ρ 2 ( ) 1 ˆ H H ρ = F F+ H F d πk m x 27

28 Towards a Reusable Code Base Certain computations appear frequently in inverse problems, e.g., Matrix inversion Matrix multiplication It is rare that two similar applications have exactly the same structure ( H ) S QS+ H 1 d ( H ) + 1 I W DW d Toeplitz Sparse Diagonal Contourlet Tx Better library interfaces and tools are needed to allow for easy code reuse across different applications. 28

29 Education is key.

30 To Learn More u UIUC ECE498AL Programming Massively Parallel Processors ( s David Kirk (NVIDIA) and Wen-mei Hwu (UIUC) co-instructors s CUDA programming, GPU computing, lab exercises, and projects s Lecture slides and voice recordings u More than 500 students worldwide follow the course each semester. 30

31 Graduate Summer School Aimed at CSE grad students Full registration 180 applicants from 3 continents 44 accepted from 25 universities, 3 continents 50+ remote participants 9 lectures, 3 keynotes, 1 panel, hands-on lab Sponsored by UIUC, NCSA, Microsoft, NVIDIA, VSCSE Will be offered again, August Wen-mei Hwu and David Kirk Co-instructor 31

32 Conclusion - A Great Opportunity for Many Many-core accelerates science discovery Current challenges are in parallelizing sparse and graph- based computation for very large models memory bandwidth! Incorporating many-core technology into large scale systems such as Blue Waters presents major challenges in software tool chains Software engineering issues must be better addressed It is still hard to extract and integrate parallel kernels from real applications Effective code reuse requires a very large library code base and new tools and interfaces. Cross-application fertilization key to future success vision for a global developer community 32

33 Thank you! Any questions? 33

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Wen-mei Hwu with David Kirk, Shane Ryoo, Christopher Rodrigues, John Stratton, Kuangwei Huang Overview

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

designing a GPU Computing Solution

designing a GPU Computing Solution designing a GPU Computing Solution Patrick Van Reeth EMEA HPC Competency Center - GPU Computing Solutions Saturday, May the 29th, 2010 1 2010 Hewlett-Packard Development Company, L.P. The information contained

More information

CS 668 Parallel Computing Spring 2011

CS 668 Parallel Computing Spring 2011 CS 668 Parallel Computing Spring 2011 Prof. Fred Annexstein @proffreda fred.annexstein@uc.edu Office Hours: 11-1 MW or by appointment Tel: 513-556-1807 Meeting: TuTh 2:00-3:25 in RecCenter 3240 Lecture

More information

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois

More information

Accelerating Cosmological Data Analysis with Graphics Processors Dylan W. Roeh Volodymyr V. Kindratenko Robert J. Brunner

Accelerating Cosmological Data Analysis with Graphics Processors Dylan W. Roeh Volodymyr V. Kindratenko Robert J. Brunner Accelerating Cosmological Data Analysis with Graphics Processors Dylan W. Roeh Volodymyr V. Kindratenko Robert J. Brunner University of Illinois at Urbana-Champaign Presentation Outline Motivation Digital

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Intro to GPU s for Parallel Computing

Intro to GPU s for Parallel Computing Intro to GPU s for Parallel Computing Goals for Rest of Course Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability across future

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Presentation Outline NVIDIA

More information

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University

More information

QP: A Heterogeneous Multi-Accelerator Cluster

QP: A Heterogeneous Multi-Accelerator Cluster 10th LCI International Conference on High-Performance Clustered Computing March 10-12, 2009; Boulder, Colorado QP: A Heterogeneous Multi-Accelerator Cluster Michael Showerman 1, Jeremy Enos, Avneesh Pant,

More information

Accelerating NAMD with Graphics Processors

Accelerating NAMD with Graphics Processors Accelerating NAMD with Graphics Processors James Phillips John Stone Klaus Schulten Research/namd/ NAMD: Practical Supercomputing 24,000 users can t all be computer experts. 18% are NIH-funded; many in

More information

GPU Acceleration of Molecular Modeling Applications

GPU Acceleration of Molecular Modeling Applications GPU Acceleration of Molecular Modeling Applications James Phillips John Stone Research/gpu/ NAMD: Practical Supercomputing 25,000 users can t all be computer experts. 18% are NIH-funded; many in other

More information

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 11 The Graphics Processing Unit Mattan Erez The University of Texas at Austin Outline What is a GPU? Why should

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

Multilevel Summation of Electrostatic Potentials Using GPUs

Multilevel Summation of Electrostatic Potentials Using GPUs Multilevel Summation of Electrostatic Potentials Using GPUs David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling S03: High Performance Computing with CUDA Heterogeneous GPU Computing for Molecular Modeling John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology

More information

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu Many-core Computing Can compilers and tools do the heavy lifting? Wen-mei Hwu FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT, IMPACT University of Illinois, Urbana-Champaign Outline Parallel application

More information

Case Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units

Case Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units - Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Summer School 2009: Many-Core Processors for Science and Engineering Applications,

More information

M02: High Performance Computing with CUDA. Overview

M02: High Performance Computing with CUDA. Overview Overview Speakers Ian Buck Massimiliano Fatica Patrick Legresley Paulius Micikevicius Scott Morton Jim Phillips John Stone NVIDIA NVIDIA NVIDIA NVIDIA Hess Corporation University of Illinois Urbana-Champaign

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

The Future of GPU Computing

The Future of GPU Computing The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements

More information

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting CIS 565: GPU Programming and Architecture Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider and Patrick Cozzi Meeting Monday and Wednesday 6:00 7:30pm Moore 212 Recorded lectures upon

More information

ENABLING NEW SCIENCE GPU SOLUTIONS

ENABLING NEW SCIENCE GPU SOLUTIONS ENABLING NEW SCIENCE TESLA BIO Workbench The NVIDIA Tesla Bio Workbench enables biophysicists and computational chemists to push the boundaries of life sciences research. It turns a standard PC into a

More information

Accelerating Scientific Applications with GPUs

Accelerating Scientific Applications with GPUs Accelerating Scientific Applications with GPUs John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ Workshop on Programming Massively Parallel

More information

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Analysis and Visualization Algorithms in VMD

Analysis and Visualization Algorithms in VMD 1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics

More information

Parallel Computing: What has changed lately? David B. Kirk

Parallel Computing: What has changed lately? David B. Kirk Parallel Computing: What has changed lately? David B. Kirk Supercomputing 2007 Future Science and Engineering Breakthroughs Hinge on Computing Computational Geoscience Computational Chemistry Computational

More information

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014 GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014 Solutions to Parallel Processing Message Passing (distributed) MPI (library) Threads (shared memory) pthreads (library) OpenMP (compiler)

More information

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures

More information

Accelerating Molecular Modeling Applications with GPU Computing

Accelerating Molecular Modeling Applications with GPU Computing Accelerating Molecular Modeling Applications with GPU Computing John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO 2009: The GPU Computing Tipping Point Jen-Hsun Huang, CEO Someday, our graphics chips will have 1 TeraFLOPS of computing power, will be used for playing games to discovering cures for cancer to streaming

More information

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming CS/EE 217 GPU Architecture and Parallel Programming Project Kickoff David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 University of Illinois, Urbana-Champaign! 1 Two flavors Application Implement/optimize

More information

Using The CUDA Programming Model

Using The CUDA Programming Model Using The CUDA Programming Model Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin Eau Claire 1 What is (Historical) GPGPU? General Purpose computation using

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay Introduction to CUDA Lecture originally by Luke Durant and Tamas Szalay Today CUDA - Why CUDA? - Overview of CUDA architecture - Dense matrix multiplication with CUDA 2 Shader GPGPU - Before current generation,

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model CIS 665: GPU Programming Lecture 2: The CUDA Programming Model 1 Slides References Nvidia (Kitchen) David Kirk + Wen-Mei Hwu (UIUC) Gary Katz and Joe Kider 2 3D 3D API: API: OpenGL OpenGL or or Direct3D

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters Craig Steffen csteffen@ncsa.uiuc.edu NCSA Innovative Systems Lab First International Green Computing Conference Workshop

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois Accelerating Computational Biology by 100x Using CUDA John Stone Theoretical and Computational Biophysics Group, University of Illinois GPU Computing Commodity devices, omnipresent in modern computers

More information

GPU-Accelerated Analysis of Large Biomolecular Complexes

GPU-Accelerated Analysis of Large Biomolecular Complexes GPU-Accelerated Analysis of Large Biomolecular Complexes John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign

More information

Lecture 1. Introduction Course Overview

Lecture 1. Introduction Course Overview Lecture 1 Introduction Course Overview Welcome to CSE 260! Your instructor is Scott Baden baden@ucsd.edu Office: room 3244 in EBU3B Office hours Week 1: Today (after class), Tuesday (after class) Remainder

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop James Phillips Research/gpu/ Beckman Institute University of Illinois at Urbana-Champaign Theoretical and Computational Biophysics

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011 2 / 27 Outline 1 2 3 3 / 27 Outline 1 2 3 CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V: Principles in Computer Architecture Parallelism and Locality Fall 2008 Lecture 10 The Graphics Processing Unit Mattan Erez The University of Texas at Austin Outline What is a GPU? Why should we

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

GPU Programming and Architecture: Course Overview

GPU Programming and Architecture: Course Overview Lectures GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012 Monday and Wednesday 9-10:30am Moore 212 Lectures will be recorded Image from http://pinoytutorial.com/techtorial/geforce-gtx-580-vs-amd-radeon-hd-6870-review-and-comparison-conclusion/

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Simulating Life at the Atomic Scale

Simulating Life at the Atomic Scale Simulating Life at the Atomic Scale James Phillips Beckman Institute, University of Illinois Research/namd/ Beckman Institute University of Illinois at Urbana-Champaign Theoretical and Computational Biophysics

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

GPGPU introduction and network applications. PacketShaders, SSLShader

GPGPU introduction and network applications. PacketShaders, SSLShader GPGPU introduction and network applications PacketShaders, SSLShader Agenda GPGPU Introduction Computer graphics background GPGPUs past, present and future PacketShader A GPU-Accelerated Software Router

More information

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

High Performance Computing. David B. Kirk, Chief Scientist, t NVIDIA

High Performance Computing. David B. Kirk, Chief Scientist, t NVIDIA High Performance Computing David B. Kirk, Chief Scientist, t NVIDIA A History of Innovation Invented the Graphics Processing Unit (GPU) Pioneered programmable shading Over 2000 patents* 1995 1999 2002

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations

GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging Saoni Mukherjee, Nicholas Moore, James Brock and Miriam Leeser September 12, 2012 1 Outline Introduction to CT Scan, 3D reconstruction

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Computing on GPU Clusters

Computing on GPU Clusters Computing on GPU Clusters Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009

More information

Accelerating Molecular Modeling Applications with GPU Computing

Accelerating Molecular Modeling Applications with GPU Computing Accelerating Molecular Modeling Applications with GPU Computing John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

OpenMP Optimization and its Translation to OpenGL

OpenMP Optimization and its Translation to OpenGL OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations

GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois

More information

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

Interactive Supercomputing for State-of-the-art Biomolecular Simulation Interactive Supercomputing for State-of-the-art Biomolecular Simulation John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Refactoring NAMD for Petascale Machines and Graphics Processors

Refactoring NAMD for Petascale Machines and Graphics Processors Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips Research/namd/ NAMD Design Designed from the beginning as a parallel program Uses the Charm++ idea: Decompose the computation

More information

Application of GPU technology to OpenFOAM simulations

Application of GPU technology to OpenFOAM simulations Application of GPU technology to OpenFOAM simulations Jakub Poła, Andrzej Kosior, Łukasz Miroslaw jakub.pola@vratis.com, www.vratis.com Wroclaw, Poland Agenda Motivation Partial acceleration SpeedIT OpenFOAM

More information