GPU for HPC. October 2010

Similar documents
CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Real-Time Rendering Architectures

From Shader Code to a Teraflop: How Shader Cores Work

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Multi-Processors and GPU

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

CS427 Multicore Architecture and Parallel Computing

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Using Graphics Chips for General Purpose Computation

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

General Purpose GPU Computing in Partial Wave Analysis

Introduction to GPU hardware and to CUDA

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Introduction to GPGPU and GPU-architectures

Antonio R. Miele Marco D. Santambrogio

CME 213 S PRING Eric Darve

Scientific Computing on GPUs: GPU Architecture Overview

ECE 8823: GPU Architectures. Objectives

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU ARCHITECTURE Chris Schultz, June 2017

Technology for a better society. hetcomp.com

Lecture 1: Gentle Introduction to GPUs

Trends in HPC (hardware complexity and software challenges)

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Threading Hardware in G80

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Mathematical computations with GPUs

GPUs and Emerging Architectures

Graphics Processor Acceleration and YOU

When MPPDB Meets GPU:

GRAPHICS PROCESSING UNITS

Warps and Reduction Algorithms

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Efficient and Scalable Shading for Many Lights

Parallel Computing: Parallel Architectures Jin, Hai

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

High Performance Computing and GPU Programming

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

GPU ARCHITECTURE Chris Schultz, June 2017

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Introduction to GPU computing

GPUs have enormous power that is enormously difficult to use

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

High Performance Computing with Accelerators

Current Trends in Computer Graphics Hardware

Accelerating image registration on GPUs

CONSOLE ARCHITECTURE

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Parallel Systems I The GPU architecture. Jan Lemeire

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

World s most advanced data center accelerator for PCIe-based servers

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

Tesla GPU Computing A Revolution in High Performance Computing

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

General Purpose Computing on Graphical Processing Units (GPGPU(

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

! Readings! ! Room-level, on-chip! vs.!

Introduction to CUDA (1 of n*)

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

GPU programming. Dr. Bernhard Kainz

Introduction to CUDA Programming

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Accelerating CFD with Graphics Hardware

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

From Brook to CUDA. GPU Technology Conference

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

ATS-GPU Real Time Signal Processing Software

An Introduction to Graphical Processing Unit

Introduction to GPU architecture

Portland State University ECE 588/688. Graphics Processors

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

Fundamental CUDA Optimization. NVIDIA Corporation

Hardware/Software Co-Design

high performance medical reconstruction using stream programming paradigms

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Transcription:

GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1

Moore s law: in the old days, power increase exponentially 2

The free lunch is over: no further increase of clock rate 3

A limit to clock rate: power consumption Power Cost 4

Another limit: memory access 5

Check Point The free lunch is over: no further automatic increase of CPU frequency. Our only chance to keep up with Moore's law: parallel programming. 6

Needs Must use all cores efficiently Careful data and memory management Must rethink software design Must rethink algorithms Must learn new skills! 7

GPU (Graphic Processing Unit) PC hardware dedicated for 3D graphics Massively parallel SIMD processor Performance pushed by game industry 8

Games and Graphics 9

Computer Games PC games business: $11 bio/year market ( 08) 111 mio GPUs shipped in 2008 1/3 of all PCs have more than one GPU High-end GPUs sold for around $300 10

GPGPU General Purpose computing on the GPU Started in Computer Graphics research community Mapping computational problems to graphics rendering pipeline 11

Speed-ups 12

Why GPU computing? GPU is fast Massively parallel CPU : ~4 cores @ 3.2 GHs (Intel Quad Core) GPU: ~30 cores @ 1.3 GHz (NVIDIA GT200) Programmable NVIDIA CUDA, DirectX Compute Shader, OpenCL High precision floating point support 64bit floating point (IEEE 754) Inexpensive desktop supercomputer NVIDIA Tesla C1060 : ~ 1 Tflops @ 1000 $ 13

NVIDIA : Company History 1993: NVIDIA is founded by Jen-Hsun Huang, Chris Malachowsky, and Curtis Priem. 1995: NVIDIA introduces NV1, the first mainstream multimedia processor. 1997: NVIDIA introduces Real-time Interactive Video Animation 3-D graphics chip or RIVA 128, the first high-performance, 128-bit Direct3D processor. 1999: NVIDIA goes public in January. 2000: Microsoft Corporation selects NVIDIA to provide the graphics processors for its forthcoming gaming console, X-Box. 2001: NVIDIA introduces GeForce3, the industry's first programmable graphics processor. 2006 : CUDA project was announced together with G80 in November,14 Public beta version of CUDA SDK was released in February, 2007.

CPU vs GPU FLOPS 15

CPU vs GPU Memory Bandwidth 16

CPU vs GPU Power Consumption: Flops per Watt Green500 list: Rate of computation that can be delivered by a computer for every watt of power consumed. 17

To understand this difference between CPU and GPU, let's investigate the architecture of a CPU. 18

Example: AMD Opteron 19

Example: AMD Opteron 20

Example: AMD Opteron 21

Example: AMD Opteron 22

Example: AMD Opteron 23

Why are CPUs so complicated? Instruction-level parallelism ( superscalar processors ) More than one instruction is executed during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. 9 Cycles 24

Instruction-level parallelism Compiler to extract best performance, reordering instructions if necessary. Out-of-order CPU execution to avoid delays waiting for read/write or earlier operations. Branch prediction to minimise delays due toconditional branching (loops, if-then-else). Memory hierarchy to deliver data to registers fast enough to feed the processor. These all limit the number of pipelines that can be used, and increase the chip complexity; 90% of Intel chip devoted to control and data? 25

Comparison: GPU is much simpler than CPU Intel Core 2 / Xeon / i7 4 MIMD cores few registers, multilevel caches 5-10 GB/s bandwidth to main memory NVIDIA GTX280: 240 cores, arranged as 30 units each with 8 SIMD cores lots of registers, almost no cache 5 GB/s bandwidth to host processor (PCIe x16 gen 2) 140 GB/s bandwidth to graphics memory 26

Comparison: GPU is much simpler than CPU GPU Up to 240 cores on a single chip Simplified logic (minimal caching, no out-of-order execution, no branch prediction) Most of the chip is devoted floating-point computation Usually arranged as multiple units with each unit being effectively a vector unit Very high bandwidth (up to 140GB/s) to graphics memory (up to 4GB) 27

Multi-threaded parallelism on CPU: two completely independent instruction streams. 2 cores = 2 simultaneous instruction streams 28

Thread-level parallelism on GPU: common instruction stream for groups of functional units 29

NVIDIA GeForce GTX 285 core Groups of 32 threads share instruction streams (calles WARPS) Up to 32 groups are simultaneously interleaved Up to 1024 fragment contexts can be stored 30

NVIDIA GeForce GTX 285 There are 30 of these things on the GTX 285: 30,00031 threads!

SIMD vs MIMD MIMD (Multiple Instruction / Multiple Data) each core operates independently each can be working with a different code, performing different operations with entirely different data SIMD (Single Instruction / Multiple Data) all cores executing the same instruction at the same time, but working on different data only one instruction de-coder needed to control all cores functions like a vector unit 32

Summary: two ways of handling parallelism CPU Instruction-level parallelism with branch prediction. GPU MIMD model for thread-level parallelism across cores. Simplified hardware, no branch prediction. Processor is packed full of ALUs (by sharing instruction stream across groups of threads). SIMD execution model.

CPU-style memory CPU cores run efficiently when data is resident in cache (reduce latency, provide high bandwidth) 34

GPU-style memory More ALUs, no traditional cache hierarhy: Need high bandwidth connection to memory 35

GPU-style memory On a high-end GPU: 11x compute performance on high-end CPU 6x bandwidth to feed it No complicated cache hierarchy GPU memory system is designed for throughput Wide bus (150 GB/sec) Repack/reorder/interleave memory maximize use of memory bus requests to 36

Data Throughput 37

What is CUDA? CUDA/Nvidia Architecture) (Compute Unified Device Unified hardware and software specification for parallel computation.. As an enabling hardware and software technology, CUDA makes it possible to use the many computing cores in a graphics processor to perform general-purpose mathematical calculations, achieving dramatic speedups in computing performance. 38

Books, links CUDA 2.x Programming Guide, NVIDIA GPU Gems 3 by Hubert Nguyen (Hardcover - Aug 12, 2007) Introduction to Parallel Computing (2nd Edition) by Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta (Hardcover - Jan 26, 2003) Cuda Zone: Education 39

GPGPU/CUDA Application Fields 40

Performance/Development Streaming SIMD Extensions 41