Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Similar documents
Numerical Simulation on the GPU

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Scientific Computing on GPUs: GPU Architecture Overview

CS654 Advanced Computer Architecture. Lec 1 - Introduction

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

UC Berkeley CS61C : Machine Structures

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Fundamentals of Quantitative Design and Analysis

EECS4201 Computer Architecture

UC Berkeley CS61C : Machine Structures

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

45-year CPU Evolution: 1 Law -2 Equations

Lecture 1: Gentle Introduction to GPUs

Computer Architecture. R. Poss

High Performance Computing on GPUs using NVIDIA CUDA

Master Informatics Eng.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Parallel Systems I The GPU architecture. Jan Lemeire

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Comparing Memory Systems for Chip Multiprocessors

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 1: Introduction

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CS 426 Parallel Computing. Parallel Computing Platforms

What s New with GPGPU?

B649 Graduate Computer Architecture. Lec 1 - Introduction

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Multicore Hardware and Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Portland State University ECE 588/688. Graphics Processors

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Parallel Processing SIMD, Vector and GPU s cont.

GPUs and GPGPUs. Greg Blanton John T. Lubia

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

From Brook to CUDA. GPU Technology Conference

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Technology. Giorgio Richelli

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Performance potential for simulating spin models on GPU

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GRAPHICS PROCESSING UNITS

The Processor: Instruction-Level Parallelism

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Architecture-Conscious Database Systems

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Computer Architecture

The Art of Parallel Processing

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS 194 Parallel Programming. Why Program for Parallelism?

Introduction to GPU architecture

Real-Time Rendering Architectures

ECE 8823: GPU Architectures. Objectives

Data Parallel Architectures

Mathematical computations with GPUs

Computational Optimization ISE 407. Lecture1. Dr. Ted Ralphs

COSC4201 Multiprocessors

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CMSC 611: Advanced. Parallel Systems

Introduction to GPU computing

1. Memory technology & Hierarchy

Parallelism in Hardware

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Page 1. Multilevel Memories (Improving performance using a little cash )

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Lect. 2: Types of Parallelism

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Graphics Processor Acceleration and YOU

Master Informatics Eng.

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

! Readings! ! Room-level, on-chip! vs.!

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Lec 25: Parallel Processors. Announcements

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

GPU Fundamentals Jeff Larkin November 14, 2016

Introduction to GPU architecture

Massively Parallel Architectures

CUDA Experiences: Over-Optimization and Future HPC

Introduction to CELL B.E. and GPU Programming. Agenda

Transcription:

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

Overview Computation / Bandwidth / Power GPU Characteristics 2

Data Processing in General memory wall lack of parallelism memory OUT IN Processor OUT IN memory 3

Old and New Wisdom in Computer Architecture Old: Power is free, Transistors are expensive New: Power wall, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: Memory wall, Multiplies fast, Memory slow (200 clocks to DRAM memory, 4 clocks for FP multiply) Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, ) New: ILP wall, diminishing returns on more ILP HW (Explicit thread and data parallelism must be exploited) New: Power Wall + Memory Wall + ILP Wall = Brick Wall slide courtesy of Christos Kozyrakis 4

Uniprocessor Performance (SPECint( SPECint) Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 25%/year 52%/year??%/year Sea change in chip design: multiple cores or processors per chip 3X 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 slide courtesy of Christos Kozyrakis 5

SW Peformance: FeatFlow 1993 2008 100 Best Average 10 1 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 80x speedup in 16 years from HW for free But: HW peak performance had 1000x speedup And: Since 2006 stagnation No future for serial code: Parallelism is indispensable 6

Instruction-Stream Stream-Based Processing instructions Processor memory data data data data cache data memory 7

Instruction- and Data-Streams Addition of 2D arrays: C= A + B instuction stream processing data for(y=0; y<height; y++) for(x=0; x<width; x++) { C[y][x]= A[y][x]+B[y][x]; } data streams undergoing a kernel operation inputstreams(a,b); outputstream(c); kernelprogram(op_add); processstreams(); 8

Data-Stream Stream-Based Processing memory data Processor pipeline pipeline pipeline data memory configuration 9

Architectures: Data Processor Locality Field Programmable Gate Array (FPGA) Compute by configuring Boolean functions and local memory Processor Array / Multi-core Processor Assemble many (simple) processors and memories on one chip Processor-in-Memory (PIM) Insert processing elements directly into RAM chips Stream Processor Create data locality through a hierarchy of memories Graphics Processor Unit (GPU) Hide data access latencies by keeping 1000s of threads in-flight GPUs often excel in the performance/price ratio 10

Overview Computation / Bandwidth / Power GPU Characteristics 11

The GPU is a Fast, Highly Multi-Threaded Processor Input Arrays: nd Output Arrays: nd Start thousands of parallel threads in groups of m, e.g. 32 Each group operates in a SIMD fashion, with predication if necessary In general threads are independent but certain collections of groups may use on-chip memory to exchange data 12

Input and Output Arrays Single threaded Input and output arrays may overlap Multi threaded Input and output arrays should rather not overlap Input Input Output Output 13

Native Memory Layout Data Locality General memory 1D input 1D output Other dimensions with offsets Texture memory 2D input 2D output Other dimensions with offsets Input Input Output Color coded locality red (near), blue (far) Output 14

GPUs are Optimized for Local Data Access Memory access types: Cache, Sequential, Random CPU Large cache Few processing elements Optimized for spatial and temporal data reuse GeForce 7800 GTX Pentium 4 GPU Small cache Many processing elements Optimized for sequential (streaming) data access chart courtesy of Ian Buck 15

Configuration Overhead Configu- ration limited Compu- tation limited chart courtesy of Ian Buck 16

Bandwidth in a CPU-GPU System

Sparse MatVec on Tensor Product Grid 13GFLOP/s (single) with GPGPU on GeForce 8800 GTX 46GFLOP/s (single), 140GB/s with CUDA on GeForce GTX 280 18

Conclusions Parallelism is now indispensable to further increase performance For most applications the data processor locality plays an important role GPUs offer a fast, inexpensive solution, but understanding the parallel tradeoffs is crucial 19