Hybrid Architectures Why Should I Bother?

Size: px
Start display at page:

Download "Hybrid Architectures Why Should I Bother?"

Transcription

1 Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering, July 8 19,

2 The Simulation Pipeline v a l i d a t i o n phenomenon, process etc. mathematical model numerical algorithm simulation code results to interpret statement tool modelling numerical treatment parallel implementation visualization embedding Computer Simulations in Science and Engineering, July 8 19,

3 Parallel Computing Faster, Bigger, More Why parallel high performance computing: Response time: compute a problem in 1 p time speed up engineering processes real-time simulations (tsunami warning?) Problem size: compute a p-times bigger problem simulation of large-/multi-scale phenomena maximal problem size that fits into the machine validation of smaller, operational models Throughput: compute p problems at once case and parameter studies, statistical risk scenarios, etc. (hazard maps, data base for tsunami warning,... ) massively distributed computing (SETI@home, e.g.) Computer Simulations in Science and Engineering, July 8 19,

4 Part I High Performance Computing in CSE Past(?) and Present Trends Computer Simulations in Science and Engineering, July 8 19,

5 The Seven Dwarfs of HPC dwarfs = key algorithmic kernels in many scientific computing applications P. Colella (LBNL), 2004: 1. dense linear algebra 2. sparse linear algebra 3. spectral methods 4. N-body methods 5. structured grids 6. unstructured grids 7. Monte Carlo Tsunami & storm-surge simulation: usually PDE solvers on structured or unstructured meshes SWE: a simple shallow water solver on Cartesian grids Computer Simulations in Science and Engineering, July 8 19,

6 Computational Science Demands a New Paradigm Computational simulation must meet three challenges to become a mature partner of theory and experiment (Post & Votta, 2005) 1. performance challenge exponential growth of performance, massively parallel architectures 2. programming challenge new (parallel) programming models 3. prediction challenge careful verification and validation of codes; towards reproducible simulation experiments Computer Simulations in Science and Engineering, July 8 19,

7 Four Horizons for Enhancing the Performance of Parallel Simulations Based on Partial Differential Equations (David Keyes, 2000) 1. Expanded Number of Processors in 2000: 1000 cores; in 2010: 200,000 cores 2. More Efficient Use of Faster Processors PDF working-sets, cache efficiency 3. More Architecture-Friendly Algorithms improve temporal/spatial locality 4. Algorithms Delivering More Science per Flop adaptivity (in space and time), higher-order methods, fast solvers Computer Simulations in Science and Engineering, July 8 19,

8 Performance Development in Supercomputing (source: Computer Simulations in Science and Engineering, July 8 19,

9 Top 500 ( June 2013 Computer Simulations in Science and Engineering, July 8 19,

10 Top 500 Spotlights Tianhe-2 and K Computer Tianhe-2/MilkyWay-2 Intel Xeon Phi (NUDT) 3.1 mio cores(!) Intel Ivy Bridge and Xeon Phi Linpack benchmark: 33.8 PFlop/s 17 MW power(!!) Knights Corner / Intel Xeon Phi / Intel MIC as accelerator 61 cores, roughly GHz Titan Cray XK7, NVIDIA K20x (ORNL) 18,688 compute nodes; 300,000 Opteron cores 18,688 NVIDIA Tesla K20 GPUs Linpack benchmark: 17.6 PFlop/s 8.2 MW power Computer Simulations in Science and Engineering, July 8 19,

11 Top 500 Spotlights Sequoia and K Computer Sequoia IBM BlueGene/Q (LLNL) 98,304 compute nodes; 1.6 mio cores Linpack benchmark: 17.1 PFlop/s 8 MW power K Computer SPARC64 (RIKEN, Kobe) 88,128 processors; 705,024 cores Linpack benchmark: PFlop/s 12 MW power SPARC64 VIIIfx 2.0GHz (8-core CPU) Computer Simulations in Science and Engineering, July 8 19,

12 Performance Development in Supercomputing (source: Computer Simulations in Science and Engineering, July 8 19,

13 International Exascale Software Project Roadmap Towards an Exa-Flop/s Platform in 2018 ( 1. technology trends concurrency, reliability, power consumption,... blueprint of an exascale system: 10-billion-way concurrency, 100 million to 1 billion cores, 10-to-100-way concurrency per core, hundreds of cores per die, science trends climate, high-energy physics, nuclear physics, fusion energy sciences, materials science and chemistry, X-stack (software stack for exascale) energy, resiliency, heterogeneity, I/O and memory 4. Polito-economic trends exascale systems run by government labs, used by CSE scientists Computer Simulations in Science and Engineering, July 8 19,

14 Exascale Roadmap Aggressively Designed Strawman Architecture Level What Perform. Power RAM FPU FPU, regs,. instr.-memory 1.5 Gflops 30mW 4 FPUs, L1 6 Gflops 141mW Proc. Chip 742 cores, L2/L3, Interconn. 4.5 Tflops 214W Node Proc. chip, DRAM 4.5 Tflops 230W 16 GB Group 12 proc. chips, routers 54 Tflops 3.5KW 192 GB rack 32 groups 1.7 Pflops 116KW 6.1 TB System 583 racks 1 Eflops 67.7MW 3.6 PB approx. 285,000 cores per rack; 166 mio cores in total Source: ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Computer Simulations in Science and Engineering, July 8 19,

15 Exascale Roadmap Should You Bother? Your department s compute cluster in 5 years? a Petaflop System! one rack of the Exaflop system using the same/similar hardware extrapolated example machine: peak performance: 1.7 PFlop/s 6 TB RAM, 60 GB cache memory total concurrency : number of cores: 280, 000 number of chips: 384 Source: ExaScale Software Study: Software Challenges in Extreme Scale Systems Computer Simulations in Science and Engineering, July 8 19,

16 Your Department s PetaFlop/s Cluster in 5 Years? Tianhe-1A (Tianjin, China; Top500 # 10 ) 14,336 Xeon X5670 CPUs 7,168 Nvidia Tesla M2050 GPUs Linpack benchmark: 2.6 PFlop/s 4 MW power Stampede (Intel, Top500 # 6) 102,400 cores (incl. Xeon Phi: MIC/ many integrated cores ) Linpack benchmark: 5 PFlop/s Knights Corner / Intel Xeon Phi / Intel MIC as accelerator 61 cores, roughly GHz wider vector FP units: 64 bytes (i.e., 16 floats, 8 doubles) 4.5 MW power Computer Simulations in Science and Engineering, July 8 19,

17 Free Lunch is Over ( )... actually already over for quite some time! Speedup of your software can only come from parallelism: clock speed of CPU has stalled instruction-level parallelism per core has stalled number of cores is growing size of vector units is growing ( ) Quote and image taken from: H. Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb s Journal 30(3), March Computer Simulations in Science and Engineering, July 8 19,

18 Manycore CPU Intel MIC Architecture Intel MIC Architecture: An Intel Co-Processor Architecture FIXED FUNCTION LOGIC VECTOR VECTOR VECTOR IA CORE IA CORE IA CORE INTERPROCESSOR NETWORK COHERENT COHERENT COHERENT CACHE CACHE CACHE COHERENT COHERENT COHERENT CACHE CACHE CACHE INTERPROCESSOR NETWORK VECTOR VECTOR VECTOR IA CORE IA CORE IA CORE VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE MEMORY and I/O INTERFACES Many cores and many, many more threads Standard IA programming and memory model (source: Intel/K. Skaugen SC 10 keynote presentation) Computer Simulations in Science and Engineering, July 8 19,

19 Manycore CPU Intel MIC Architecture (2) Diagram of a Knights Corner core: Figure 4: Knights Corner (source: An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors) The cores are in-order dual issue x86 processor cores which trace some history to the original Computer Simulations in Science and Engineering, July 8 19, Pentium design, but with the addition of 64-bit support, four hardware threads per core, power

20 cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The Technische Universität 512 CUDA München cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread GPGPU global scheduler NVIDIA distributes thread Fermi blocks to SM thread schedulers. Fermi s 16 SM are positioned (source: around NVIDIA a common Fermi Whitepaper) L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache). Computer Simulations in Science and Engineering, July 8 19,

21 cessor Technische Universität München GPGPU NVIDIA Fermi (2) eneration SM introduces several ral innovations that make it not only the erful SM yet built, Third Generation but also Streaming the most able and efficient. Multiprocessor Performance CUDA cores eatures 32 CUDA s a fourfold ver prior SM ach CUDA has a fully integer arithmetic The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient. CUDA 512 High Performance CUDA cores Each SM features 32 CUDA Dispatch Port CUDA processors a fourfold Operand Collector Dispatch Port Operand Collector increase over prior SM designs. Each CUDA FP Unit INT Unit processor has a fully FP Unit INT Unit pipelined integer arithmetic Result Queue logic unit (ALU) and floating point unit (FPU). Prior GPUs Result used Queue IEEE floating point arithmetic. The Fermi architecture implements the new IEEE floating-point standard, providing the fused multiply-add (FMA) ALU) and floating (FPU). Prior GPUs instruction used for both IEEE single and double precision arithmetic. FMA improves over a multiply-add int arithmetic. The Fermi architecture ts the new IEEE loss of precision in the addition. floating-point FMA is more (source: NVIDIA accurate Fermi than Whitepaper) performing the operations providing the fused multiply-add (FMA) for both single and double precision. FMA improves over a multiply-add (MAD) instruction by doing the multiplication and addition with a single final rounding step, with no Warp Scheduler Dispatch Unit Fermi Streaming Multiprocessor (SM) separately. GT200 implemented double precision FMA. Warp Scheduler Dispatch Unit In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly Computer Simulations in Science and Engineering, July 8 19, 2013 Interconnect Network 21 designed integer ALU supports full 32-bit precision for all instructions, consistent with standard Warp Scheduler Dispatch Unit Instruction Cache Register File (32,768 x 32-bit) Interconnect Network 64 KB Shared Memory / L1 Cache Uniform Cache Register File (32,768 x 32-bit) Warp Scheduler Dispatch Unit SFU SFU SFU SFU SFU SFU SFU SFU

22 GPGPU NVIDIA Fermi (3) General Purpose Graphics Processing Unit: 512 CUDA cores Memory Subsystem Innovations improved double precision performance shared vs. global program memory behavior. new: L1 und L2 cache (768 KB) trend from GPU towards CPU? NVIDIA Parallel DataCache TM with Configurable L1 and Unified L2 Cache Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many problems, it is not appropriate for all problems. Some algorithms map naturally to Shared memory, others require a cache, while others require a combination of both. The optimal memory hierarchy should offer the benefits of both Shared memory and cache, and allow the programmer a choice over its partitioning. The Fermi memory hierarchy adapts to both types of Adding a true cache hierarchy for load / store operations presented significant challenges. Traditional GPU architectures support a read-only load path for texture operations and a write-only export path for pixel data output. However, this approach is poorly suited to executing general purpose C or C++ thread programs that expect reads and writes to be ordered. As one example: spilling a register operand to memory and then reading it back creates a read after write hazard; if the read and write paths are separate, it may be necessary to explicitly flush the entire write / export path before it is safe to issue the read, and any caches on the read path would not be coherent with respect to the write data. The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 Computer Simulations in Science and Engineering, cache Julythat 8 19, services 2013all operations (load, store and texture). The per-sm L1 cache is 22

23 Parallel Computing Paradigms Not exactly sure how the hardware will look like... (CPU-style, GPU-style, something new?) However: massively parallel programming required revival of vector computing several/many FPUs performing the same operation hybrid/heterogenous achitectures different kind of cores; dedicated accelerator hardware different access to memory cache and cache coherency small amount of memory per core our concern in this course: data parallelism vectorisation (and a look into GPU computing) Computer Simulations in Science and Engineering, July 8 19,

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Intro Michael Bader Winter 2015/2016 Intro, Winter 2015/2016 1 Part I Scientific Computing and Numerical Simulation Intro, Winter 2015/2016 2 The Simulation Pipeline phenomenon,

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor

More information

PART I - Fundamentals of Parallel Computing

PART I - Fundamentals of Parallel Computing PART I - Fundamentals of Parallel Computing Objectives What is scientific computing? The need for more computing power The need for parallel computing and parallel programs 1 What is scientific computing?

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

John Hengeveld Director of Marketing, HPC Evangelist

John Hengeveld Director of Marketing, HPC Evangelist MIC, Intel and Rearchitecting for Exascale John Hengeveld Director of Marketing, HPC Evangelist Intel Data Center Group Dr. Jean-Laurent Philippe, PhD Technical Sales Manager & Exascale Technical Lead

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Fundamentals Parallel Architectures, Models, and Languages Michael Bader Winter 2013/2014 Fundamentals Parallel Architectures, Models, and Languages, Winter 2013/2014 1

More information

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents

More information

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2 CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

HIGH-PERFORMANCE COMPUTING

HIGH-PERFORMANCE COMPUTING HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Timothy Lanfear, NVIDIA WHY GPU COMPUTING? Science is Desperate for Throughput Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 Petaflop Bacteria 100s of

More information

Parallel Programming

Parallel Programming Parallel Programming Introduction Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation

Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation Race to Exascale: Opportunities and Challenges Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation Exascale Goal: 1-ExaFlops (10 18 ) within 20 MW by 2018 1 ZFlops 100 EFlops 10 EFlops

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU. Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Fabio AFFINITO.

Fabio AFFINITO. Introduction to High Performance Computing Fabio AFFINITO What is the meaning of High Performance Computing? What does HIGH PERFORMANCE mean??? 1976... Cray-1 supercomputer First commercial successful

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I. Programming GPUs with CUDA Tutorial at 1th IEEE CSE 15 and 13th IEEE EUC 15 conferences Prerequisites for this tutorial Porto (Portugal). October, 20th, 2015 You (probably) need experience with C. You

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1 Slide No 1 CSE5351: Parallel Procesisng Part 1B Slide No 2 State of the Art In Supercomputing Several of the next slides (or modified) are the courtesy of Dr. Jack Dongarra, a distinguished professor of

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

NVIDIA Fermi Architecture

NVIDIA Fermi Architecture Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Overview of Tianhe-2

Overview of Tianhe-2 Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Parallel Computing Accelerators John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

GPU Computing with Fornax. Dr. Christopher Harris

GPU Computing with Fornax. Dr. Christopher Harris GPU Computing with Fornax Dr. Christopher Harris ivec@uwa CAASTRO GPU Training Workshop 8-9 October 2012 Introducing the Historical GPU Graphics Processing Unit (GPU) n : A specialised electronic circuit

More information

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -

More information

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

A study on linear algebra operations using extended precision floating-point arithmetic on GPUs

A study on linear algebra operations using extended precision floating-point arithmetic on GPUs A study on linear algebra operations using extended precision floating-point arithmetic on GPUs Graduate School of Systems and Information Engineering University of Tsukuba November 2013 Daichi Mukunoki

More information

Godson Processor and its Application in High Performance Computers

Godson Processor and its Application in High Performance Computers Godson Processor and its Application in High Performance Computers Weiwu Hu Institute of Computing Technology, Chinese Academy of Sciences Loongson Technologies Corporation Limited hww@ict.ac.cn 1 Contents

More information

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Overview Context

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

Performance Benefits of NVIDIA GPUs for LS-DYNA

Performance Benefits of NVIDIA GPUs for LS-DYNA Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

The End of Denial Architecture and The Rise of Throughput Computing

The End of Denial Architecture and The Rise of Throughput Computing The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University Outline! Performance = Parallelism

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information