Accelerating Biomolecular Modeling with CUDA and GPU Clusters

Size: px
Start display at page:

Download "Accelerating Biomolecular Modeling with CUDA and GPU Clusters"

Transcription

1 Accelerating Biomolecular Modeling with CUDA and GPU Clusters James Phillips John Stone Klaus Schulten Research/gpu/

2 Beckman Institute University o Illinois at Urbana-Champaign Theoretical and Computational Biophysics Group

3 National Center or Supercomputing Applications

4 Computational Microscopy Ribosome: synthesizes proteins rom genetic inormation, target or antibiotics Silicon nanopore: bionanodevice or sequencing DNA eiciently

5 NAMD: Practical Supercomputing 35,000 users can t all be computer experts. 18% are NIH-unded; many in other countries have downloaded more than one version. User experience is the same on all platorms. No change in input, output, or coniguration iles. Run any simulation on any number o processors. Precompiled binaries available when possible. Desktops and laptops setup and testing x86 and x86-64 Windows, and Macintosh Allow both shared-memory and network-based parallelism. Linux clusters aordable workhorses x86, x86-64, and Itanium processors Gigabit ethernet, Myrinet, IniniBand, Quadrics, Altix, etc Phillips et al., J. Comp. Chem. 26: , 2005.

6 Our Goal: Practical Acceleration Broadly applicable to scientiic computing Programmable by domain scientists Scalable rom small to large machines Broadly available to researchers Price driven by commodity market Low burden on system administration Sustainable perormance advantage Perormance driven by Moore s law Stable market and supply chain

7 Acceleration Options or NAMD Outlook in : FPGA reconigurable computing (with NCSA) Diicult to program, slow loating point, expensive Cell processor (NCSA hardware) Relatively easy to program, expensive ClearSpeed (direct contact with company) Limited memory and memory bandwidth, expensive MDGRAPE Inlexible and expensive Graphics processor (GPU) Program must be expressed as graphics operations

8 CUDA: Practical Perormance November 2006: NVIDIA announces CUDA or G80 GPU. CUDA makes GPU acceleration usable: Developed and supported by NVIDIA. No masquerading as graphics rendering. New shared memory and synchronization. No OpenGL or display device hassles. Multiple processes per card (or vice versa). Resource and collaborators make it useul: Experience rom VMD development David Kirk (Chie Scientist, NVIDIA) Wen-mei Hwu (ECE Proessor, UIUC) Fun to program (and drive) Stone et al., J. Comp. Chem. 28: , 2007.

9 VMD Visual Molecular Dynamics Visualization and analysis o molecular dynamics simulations, sequence data, volumetric data, quantum chemistry simulations, particle systems, User extensible with scripting and plugins Research/vmd/

10 CUDA Acceleration in VMD Electrostatic ield calculation, ion placement 20x to 44x aster Molecular orbital calculation and display 100x to 120x aster Imaging o gas migration pathways in proteins with implicit ligand sampling 20x to 30x aster

11 Apples to Oranges Perormance Results: Molecular Orbital Kernels Kernel Intel QX6700 CPU ICC-SSE (SSE intrinsics) Intel Core2 Duo CPU OpenCL scalar Intel QX6700 CPU ICC-SSE (SSE intrinsics) Intel Core2 Duo CPU OpenCL vec4 Cell OpenCL vec4*** no constant Radeon 4870 OpenCL scalar Radeon 4870 OpenCL vec4 GeForce GTX 285 OpenCL vec4 GeForce GTX 285 CUDA 2.1 scalar GeForce GTX 285 OpenCL scalar GeForce GTX 285 CUDA 2.0 scalar Cores Runtime (s) Speedup CUDA results demonstrate perormance variability with compiler revisions, and that with vendor eort, OpenCL has the potential to match the perormance o other APIs.

12 NAMD Hybrid Decomposition Kale et al., J. Comp. Phys. 151: , Spatially decompose data and communication. Separate but related work decomposition. Compute objects acilitate iterative, measurement-based load balancing system.

13 NAMD Code is Message-Driven No receive calls as in message passing Messages sent to object entry points Incoming messages placed in queue Priorities are necessary or perormance Execution generates new messages Implemented in Charm++ on top o MPI Can be emulated in MPI alone Charm++ provides tools and idioms Parallel Programming Lab:

14 System Noise Example Timeline rom Charm++ tool Projections

15 NAMD Overlapping Execution Phillips et al., SC2002. Oload to GPU Objects are assigned to processors and queued as data arrives.

16 Message-Driven CUDA? No, CUDA is too coarse-grained. CPU needs ine-grained work to interleave and pipeline. GPU needs large numbers o tasks submitted all at once. No, CUDA lacks priorities. FIFO isn t enough. Perhaps in a uture interace: Stream data to GPU. Append blocks to a running kernel invocation. Stream data out as blocks complete. Fermi looks very promising!

17 Nonbonded Forces on CUDA GPU Start with most expensive calculation: direct nonbonded interactions. Decompose work into pairs o patches, identical to NAMD structure. GPU hardware assigns patch-pairs to multiprocessors dynamically. Force computation on single multiprocessor (GeForce 8800 GTX has 16) Texture Unit Force Table Interpolation 8kB cache 16kB Shared Memory Patch A Coordinates & Parameters 32-way SIMD Multiprocessor multiplexed threads 32kB Registers Patch B Coords, Params, & Forces Constants Exclusions 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency Stone et al., J. Comp. Chem. 28: , 2007.

18 texture<loat4> orce_table; constant unsigned int exclusions[]; shared atom jatom[]; atom iatom; // per-thread atom, stored in registers loat4 iorce; // per-thread orce, stored in registers or ( int j = 0; j < jatom_count; ++j ) { loat dx = jatom[j].x - iatom.x; loat dy = jatom[j].y - iatom.y; loat dz = jatom[j].z - iatom.z; loat r2 = dx*dx + dy*dy + dz*dz; i ( r2 < cuto2 ) { } loat4 t = texetch(orce_table, 1./sqrt(r2)); bool excluded = alse; int indexdi = iatom.index - jatom[j].index; i ( abs(indexdi) <= (int) jatom[j].excl_maxdi ) { indexdi += jatom[j].excl_index; excluded = ((exclusions[indexdi>>5] & (1<<(indexdi&31)))!= 0); } loat = iatom.hal_sigma + jatom[j].hal_sigma; // sigma *= *; // sigma^3 *= ; // sigma^6 *= ( * t.x + t.y ); // sigma^12 * i.x - sigma^6 * i.y *= iatom.sqrt_epsilon * jatom[j].sqrt_epsilon; loat qq = iatom.charge * jatom[j].charge; i ( excluded ) { = qq * t.w; } // PME correction else { += qq * t.z; } // Coulomb iorce.x += dx * ; iorce.y += dy * ; iorce.z += dz * ; iorce.w += 1.; // interaction count or energy Nonbonded Forces CUDA Code } Stone et al., J. Comp. Chem. 28: , Force Interpolation Exclusions Parameters Accumulation

19 Overlapping GPU and CPU with Communication GPU x Remote Force Local Force CPU Remote Local Local Update x Other Nodes/Processes x One Timestep

20 Remote Forces Forces on atoms in a local patch are local Forces on atoms in a remote patch are remote Calculate remote orces irst to overlap orce communication with local orce calculation Not enough work to overlap with position communication Remote Patch Local Patch Local Patch Remote Patch Remote Patch Remote Patch Work done by one processor

21 Actual Timelines rom NAMD Generated using Charm++ tool Projections CPU GPU x Remote Force Local Force x x

22 NCSA 4+4 QuadroPlex Cluster aster 2.4 GHz Opteron + Quadro FX 5600

23 Current GPU Clusters at NCSA Lincoln Production system available via the standard NCSA/TeraGrid HPC allocation AC Experimental system available or exploring GPU computing

24 CUDA/OpenCL Wrapper Library Basic operation principle: Use /etc/ld.so.preload to overload (intercept) a subset o CUDA/OpenCL unctions, e.g. {cu cuda}{get Set}Device, clgetdeviceids, etc Purpose: Enables controlled GPU device visibility and access, extending resource allocation to the workload manager Prove or disprove eature useulness, with the hope o eventual uptake or reimplementation o proven eatures by the vendor Provides a platorm or rapid implementation and testing o HPC relevant eatures not available in NVIDIA APIs Features: NUMA Ainity mapping Sets thread ainity to CPU core nearest the gpu device Shared host, multi-gpu device encing Only GPUs allocated by scheduler are visible or accessible to user GPU device numbers are virtualized, with a ixed mapping to a physical device per user environment User always sees allocated GPU devices indexed rom 0

25 CUDA/OpenCL Wrapper Library Features (cont d): Device Rotation (deprecated) Virtual to Physical device mapping rotated or each process accessing a GPU device Allowed or common execution parameters (e.g. Target gpu0 with 4 processes, each one gets separate gpu, assuming 4 gpus available) CUDA 2.2 introduced compute-exclusive device mode, which includes allback to next device. Device rotation eature may no longer needed Memory Scrubber Independent utility rom wrapper, but packaged with it Linux kernel does no management o GPU device memory Must run between user jobs to ensure security between users Availability NCSA/UoI Open Source License

26 CUDA Memtest 4GB o Tesla GPU memory is not ECC protected Hunt or sot error Features Full re-implementation o every test included in memtest86 Random and ixed test patterns, error reports, error addresses, test speciication notiication Includes additional stress test or sotware and hardware errors Usage scenarios Hardware test or deective GPU memory chips CUDA API/driver sotware bugs detection Hardware test or detecting sot errors due to non-ecc memory No sot error detected in 2 years x 4 gig o cumulative runtime Availability NCSA/UoI Open Source License

27 GPU Node Pre/Post Allocation Sequence Pre-Job (minimized or rapid device acquisition) Assemble detected device ile unless it exists Sanity check results Checkout requested GPU devices rom that ile Initialize CUDA wrapper shared memory segment with unique key or user (allows user to ssh to node outside o job environment and have same gpu devices visible) Post-Job Use quick memtest run to veriy healthy GPU state I bad state detected, mark node oline i other jobs present on node I no other jobs, reload kernel module to heal node (or CUDA 2.2 driver bug) Run memscrubber utility to clear gpu device memory Notiy o any ailure events with job details Terminate wrapper shared memory segment Check-in GPUs back to global ile o detected devices

28 AMD Opteron Tesla Linux Cluster AC HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 Ininiband QDR Tesla S1070 1U 4-GPU Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32 Accelerator Units: 32 (128 GPUS, 128 TF SP, 10 TF DP) PCIe x16 PCIe interace T10 DRA M QDR IB T10 IB HP xw9400 workstation DRA M T10 Tesla S1070 PCIe interace DRA M T10 DRA M PCI-X PCIe x16 Nallatech H101 FPGA card

29 Intel 64 Tesla Linux Cluster Lincoln Dell PowerEdge 1955 server Intel 64 (Harpertown) 2.33 GHz dual socket quad core 16 GB DDR2 Ininiband SDR Tesla S1070 1U 4-GPU Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 192 Accelerator Units: 96 (384 GPUs, 384 TF SP, 32 TF DP) Dell PowerEdge 1955 server PCIe x8 PCIe interace T10 DRAM SDR IB T10 DRAM IB Tesla S1070 Dell PowerEdge 1955 server PCIe interace T10 DRAM SDR IB PCIe x8 T10 DRAM

30 NCSA 8+2 Lincoln Cluster How to share a GPU among 4 CPU cores? Send all GPU work to one process? Coordinate via messages to avoid conlict? Or just hope or the best?

31 NCSA Lincoln Cluster Perormance (8 Intel cores and 2 NVIDIA Telsa GPUs per node) STMV (1M atoms) s/step ~2.8 CPU cores 2 GPUs = 24 cores 4 GPUs GPUs 16 GPUs

32 NCSA Lincoln Cluster Perormance (8 cores and 2 GPUs per node) STMV s/step ~5.6 ~2.8 2 GPUs = 24 cores 4 GPUs 8 GPUs 16 GPUs 8 GPUs = 96 CPU cores CPU cores

33 No GPU Sharing (Ideal World) GPU 1 x Remote Force Local Force GPU 2 x Remote Force Local Force

34 GPU Sharing (Desired) x Remote Force Local Force Client 1 x Remote Force Local Force Client 2

35 GPU Sharing (Feared) x Remote Force Local Force Client 1 x Remote Force Local Force Client 2

36 GPU Sharing (Observed) x Remote Force Local Force Client 1 x Remote Force Local Force Client 2

37 GPU Sharing (Explained) CUDA is behaving reasonably, but Force calculation is actually two kernels Longer kernel writes to multiple arrays Shorter kernel combines output Possible solutions: Modiy CUDA to be less air (please!) Use locks (atomics) to merge kernels (not G80) Explicit inter-client coordination

38 Inter-client Communication First identiy which processes share a GPU Need to know physical node or each process GPU-assignment must reveal real device ID Threads don t eliminate the problem Production code can t make assumptions Token-passing is simple and predictable Rotate clients in ixed order High-priority, yield, low-priority, yield,

39 Token-Passing GPU-Sharing GPU1 GPU2 Local Remote Local Remote

40 GPU-Sharing with PME Local Remote Local Rem

41 Weakness o Token-Passing GPU is idle while token is being passed Busy client delays itsel and others Next strategy requires threads: One process per GPU, one thread per core Funnel CUDA calls through a single stream No local work until all remote work is queued Typically unnels MPI as well

42 Recent NAMD GPU Developments Production eatures in 2.7b2 release: Full electrostatics with PME 1-4 exclusions Constant-pressure simulation Improved orce accuracy: Patch-centered atom coordinates Increased precision o orce interpolation Perormance enhancements (in progress): Recursive bisection within patch on 32-atom boundaries Block-based pairlists based on sorted atoms Sort blocks in order o decreasing work

43 GPU-Accelerated NAMD Plans Serial perormance Target NVIDIA Fermi architecture Revisit GPU kernel design decisions made in 2007 Improve perormance o remaining CPU code Parallel scaling Target NSF Track 2D Keeneland cluster at ORNL Finer-grained work units on GPU (eature o Fermi) One process per GPU, one thread per CPU core Dynamic load balancing o GPU work Wider range o simulation options and eatures

44 Conclusions and Outlook CUDA today is suicient or Single-GPU acceleration (the mass market) Coarse-grained multi-gpu parallelism Enough work per call to spin up all multiprocessors Improvements in CUDA are needed or Assigning GPUs to processes Sharing GPUs between processes Fine-grained multi-gpu parallelism Fewer blocks per call than chip has multiprocessors Moving data between GPUs (same or dierent node) Eager to test Fermi architecture and eatures!

45 Acknowledgements Theoretical and Computational Biophysics Group, University o Illinois at Urbana-Champaign Pro. Wen-mei Hwu, Chris Rodrigues, IMPACT Group, University o Illinois at Urbana-Champaign Mike Showerman, Jeremy Enos, NCSA David Kirk, Massimiliano Fatica, others at NVIDIA UIUC NVIDIA CUDA Center o Excellence NIH support: P41-RR05969 Research/gpu/

Experience with NAMD on GPU-Accelerated Clusters

Experience with NAMD on GPU-Accelerated Clusters Experience with NAMD on GPU-Accelerated Clusters James Phillips John Stone Klaus Schulten Research/gpu/ Outline Motivational images o NAMD simulations Why all the uss about GPUs? What is message-driven

More information

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop James Phillips Research/gpu/ Beckman Institute University of Illinois at Urbana-Champaign Theoretical and Computational Biophysics

More information

Accelerating NAMD with Graphics Processors

Accelerating NAMD with Graphics Processors Accelerating NAMD with Graphics Processors James Phillips John Stone Klaus Schulten Research/namd/ NAMD: Practical Supercomputing 24,000 users can t all be computer experts. 18% are NIH-funded; many in

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Presentation Outline NVIDIA

More information

GPU Acceleration of Molecular Modeling Applications

GPU Acceleration of Molecular Modeling Applications GPU Acceleration of Molecular Modeling Applications James Phillips John Stone Research/gpu/ NAMD: Practical Supercomputing 25,000 users can t all be computer experts. 18% are NIH-funded; many in other

More information

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have

More information

Accelerating Scientific Applications with GPUs

Accelerating Scientific Applications with GPUs Accelerating Scientific Applications with GPUs John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ Workshop on Programming Massively Parallel

More information

Refactoring NAMD for Petascale Machines and Graphics Processors

Refactoring NAMD for Petascale Machines and Graphics Processors Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips Research/namd/ NAMD Design Designed from the beginning as a parallel program Uses the Charm++ idea: Decompose the computation

More information

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois Accelerating Computational Biology by 100x Using CUDA John Stone Theoretical and Computational Biophysics Group, University of Illinois GPU Computing Commodity devices, omnipresent in modern computers

More information

Simulating Life at the Atomic Scale

Simulating Life at the Atomic Scale Simulating Life at the Atomic Scale James Phillips Beckman Institute, University of Illinois Research/namd/ Beckman Institute University of Illinois at Urbana-Champaign Theoretical and Computational Biophysics

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois

More information

Multilevel Summation of Electrostatic Potentials Using GPUs

Multilevel Summation of Electrostatic Potentials Using GPUs Multilevel Summation of Electrostatic Potentials Using GPUs David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at

More information

Improving NAMD Performance on Volta GPUs

Improving NAMD Performance on Volta GPUs Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of Illinois at Urbana-Champaign Ke Li - HPC Developer Technology Engineer, NVIDIA John Stone - Senior Research Programmer,

More information

Analysis and Visualization Algorithms in VMD

Analysis and Visualization Algorithms in VMD 1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics

More information

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University

More information

Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories

Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories J. Stone, K. Vandivort, K. Schulten Theoretical and Computational Biophysics Group Beckman Institute

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling S03: High Performance Computing with CUDA Heterogeneous GPU Computing for Molecular Modeling John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

NAMD GPU Performance Benchmark. March 2011

NAMD GPU Performance Benchmark. March 2011 NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory

More information

GPU Particle-Grid Methods: Electrostatics

GPU Particle-Grid Methods: Electrostatics GPU Particle-Grid Methods: Electrostatics John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Research/gpu/

More information

NAMD Serial and Parallel Performance

NAMD Serial and Parallel Performance NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance

More information

Accelerating Molecular Modeling Applications with GPU Computing

Accelerating Molecular Modeling Applications with GPU Computing Accelerating Molecular Modeling Applications with GPU Computing John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Petascale Molecular Dynamics Simulations on GPU-Accelerated Supercomputers

Petascale Molecular Dynamics Simulations on GPU-Accelerated Supercomputers Petascale Molecular Dynamics Simulations on GPU-Accelerated Supercomputers James Phillips Beckman Institute, University of Illinois Research/namd/ GTC 2012 NAMD: Scalable Molecular Dynamics 2002 Gordon

More information

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

Interactive Supercomputing for State-of-the-art Biomolecular Simulation Interactive Supercomputing for State-of-the-art Biomolecular Simulation John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis Case Study: Molecular Visualization and Analysis John Stone NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/research/gpu/ Outline What speedups can be expected Fluorescence

More information

GPU-Accelerated Analysis of Large Biomolecular Complexes

GPU-Accelerated Analysis of Large Biomolecular Complexes GPU-Accelerated Analysis of Large Biomolecular Complexes John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign

More information

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters Craig Steffen csteffen@ncsa.uiuc.edu NCSA Innovative Systems Lab First International Green Computing Conference Workshop

More information

GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations

GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois

More information

High Performance Molecular Simulation, Visualization, and Analysis on GPUs

High Performance Molecular Simulation, Visualization, and Analysis on GPUs High Performance Molecular Simulation, Visualization, and Analysis on GPUs John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis ECE 498AL Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis Guest Lecture by John Stone Theoretical and Computational Biophysics Group NIH Resource for Macromolecular

More information

NAMD Performance Benchmark and Profiling. February 2012

NAMD Performance Benchmark and Profiling. February 2012 NAMD Performance Benchmark and Profiling February 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster

More information

GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations

GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Visualizing Biomolecular Complexes on x86 and KNL Platforms: Integrating VMD and OSPRay

Visualizing Biomolecular Complexes on x86 and KNL Platforms: Integrating VMD and OSPRay Visualizing Biomolecular Complexes on x86 and KNL Platforms: Integrating VMD and OSPRay John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

VMD: Immersive Molecular Visualization and Interactive Ray Tracing for Domes, Panoramic Theaters, and Head Mounted Displays

VMD: Immersive Molecular Visualization and Interactive Ray Tracing for Domes, Panoramic Theaters, and Head Mounted Displays VMD: Immersive Molecular Visualization and Interactive Ray Tracing for Domes, Panoramic Theaters, and Head Mounted Displays John E. Stone Theoretical and Computational Biophysics Group Beckman Institute

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

GPU Histogramming: Radial Distribution Functions

GPU Histogramming: Radial Distribution Functions GPU Histogramming: Radial Distribution Functions John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign

More information

ENABLING NEW SCIENCE GPU SOLUTIONS

ENABLING NEW SCIENCE GPU SOLUTIONS ENABLING NEW SCIENCE TESLA BIO Workbench The NVIDIA Tesla Bio Workbench enables biophysicists and computational chemists to push the boundaries of life sciences research. It turns a standard PC into a

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

GPU Clusters for High-Performance Computing

GPU Clusters for High-Performance Computing GPU Clusters for High-Performance Computing Volodymyr V. Kindratenko #1, Jeremy J. Enos #1, Guochun Shi #1, Michael T. Showerman #1, Galen W. Arnold #1, John E. Stone *2, James C. Phillips *2, Wen-mei

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

NAMD Performance Benchmark and Profiling. November 2010

NAMD Performance Benchmark and Profiling. November 2010 NAMD Performance Benchmark and Profiling November 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox Compute resource - HPC Advisory

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

QP: A Heterogeneous Multi-Accelerator Cluster

QP: A Heterogeneous Multi-Accelerator Cluster 10th LCI International Conference on High-Performance Clustered Computing March 10-12, 2009; Boulder, Colorado QP: A Heterogeneous Multi-Accelerator Cluster Michael Showerman 1, Jeremy Enos, Avneesh Pant,

More information

What does Heterogeneity bring?

What does Heterogeneity bring? What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

The Parallel Revolution in Computational Science and Engineering

The Parallel Revolution in Computational Science and Engineering The Parallel Revolution in Computational Science and Engineering applications, education, tools, and impact Wen-mei Hwu University of Illinois, Urbana-Champaign The Energy Behind Parallel Revolution Calculation:

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

GPU Accelerated Visualization and Analysis in VMD and Recent NAMD Developments

GPU Accelerated Visualization and Analysis in VMD and Recent NAMD Developments GPU Accelerated Visualization and Analysis in VMD and Recent NAMD Developments John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for

More information

Harnessing GPU speed to accelerate LAMMPS particle simulations

Harnessing GPU speed to accelerate LAMMPS particle simulations Harnessing GPU speed to accelerate LAMMPS particle simulations Paul S. Crozier, W. Michael Brown, Peng Wang pscrozi@sandia.gov, wmbrown@sandia.gov, penwang@nvidia.com SC09, Portland, Oregon November 18,

More information

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms Dr. Jeffrey Layton Enterprise Technologist HPC Why GPUs? GPUs have very high peak compute capability! 6-9X CPU Challenges

More information

Accelerating Molecular Modeling Applications with GPU Computing

Accelerating Molecular Modeling Applications with GPU Computing Accelerating Molecular Modeling Applications with GPU Computing John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO 2009: The GPU Computing Tipping Point Jen-Hsun Huang, CEO Someday, our graphics chips will have 1 TeraFLOPS of computing power, will be used for playing games to discovering cures for cancer to streaming

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Wen-mei Hwu with David Kirk, Shane Ryoo, Christopher Rodrigues, John Stratton, Kuangwei Huang Overview

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information