Parallel Systems I The GPU architecture. Jan Lemeire

Similar documents
Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

COMPUTER ORGANIZATION AND DESIGN

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7. Multicores, Multiprocessors, and

General introduction: GPUs and the realm of parallel architectures

Parallel computing and GPU introduction

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Static Compiler Optimization Techniques

Vector Processing. Computer Organization Architectures for Embedded Computing. Friday, 13 December 13

Chapter 7. Multicores, Multiprocessors, and Clusters

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Advanced Computer Architecture

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO September 25th 2015

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Online Course Evaluation. What we will do in the last week?

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

! Readings! ! Room-level, on-chip! vs.!

Processor Performance and Parallelism Y. K. Malaiya

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO October 2017

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Parallel Systems Course: Chapter IV Advanced Computer Architecture. GPU Programming. Jan Lemeire Dept. ETRO September 27th 2013

Hardware-Based Speculation

THREAD LEVEL PARALLELISM

EE 4683/5683: COMPUTER ARCHITECTURE

Multicore Hardware and Parallelism

Computer Systems Architecture

WHY PARALLEL PROCESSING? (CE-401)

Computer Systems Architecture

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Lecture 1: Introduction

Lecture-13 (ROB and Multi-threading) CS422-Spring

An Introduction to Parallel Architectures

TDT 4260 lecture 7 spring semester 2015

UNIT I (Two Marks Questions & Answers)

Parallel Systems. Introduction. Principles of Parallel Programming, Calvin Lin & Lawrence Snyder, Chapters 1 & 2

Introduction to GPU programming with CUDA

Parallelism in Hardware

Hardware-based Speculation

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

45-year CPU Evolution: 1 Law -2 Equations

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

Exploitation of instruction level parallelism

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Introduction II. Overview

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ILP: Instruction Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

COSC 6385 Computer Architecture. - Vector Processors

Lec 25: Parallel Processors. Announcements

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multicore and Parallel Processing


CS 152, Spring 2011 Section 10

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 4 Data-Level Parallelism

Computer Architecture

Multi-cycle Instructions in the Pipeline (Floating Point)

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 7: Support for Parallelism

GP-GPU. General Purpose Programming on the Graphics Processing Unit

Multithreaded Processors. Department of Electrical Engineering Stanford University

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

Parallel Computing: Parallel Architectures Jin, Hai

The Processor: Instruction-Level Parallelism

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Multi-Processors and GPU

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Processor (IV) - advanced ILP. Hwansoo Han

A taxonomy of computer architectures

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

GRAPHICS PROCESSING UNITS

Handout 2 ILP: Part B

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Keywords and Review Questions

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Vector Processors. Abhishek Kulkarni Girish Subramanian

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

UC Berkeley CS61C : Machine Structures

Copyright 2012, Elsevier Inc. All rights reserved.

Fundamentals of Computer Design

CS 614 COMPUTER ARCHITECTURE II FALL 2005

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Transcription:

Parallel Systems I The GPU architecture Jan Lemeire 2012-2013

Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution Limitations: max ILP = 4 Reached Maximal clock frequency reached Branch prediction, caching, forwarding, More improvement possible? Computerarchitectuur

This is The End? Computerarchitectuur

Way out: parallel processors relying on explicit parallel software Computerarchitectuur

Chuck Moore, "DATA PROCESSING IN EXASCALE-CLASS COMPUTER SYSTEMS", The Salishan Conference on High Speed Computing, 2011.

Parallel processors GPUs (arrays) Courtesy of

Thus: The Future looks Parallel Computerarchitectuur

Conclusions Sequential world Changed into 20/12/2 8

Conclusions Parallel world 20/12/2 9

GPU vs CPU Peak Performance Trends 2010 350 Million triangles/second 3 Billion transistors GPU GPU peak performance has grown aggressively. Hardware has kept up with Moore s law 1995 5,000 triangles/second 800,000 transistors GPU Source : NVIDIA 10

Graphical Processing Units (GPUs) Many-core GPU 94 fps (AMD Tahiti Pro) GPU: 1-3 TeraFlop/second Multi-core CPU instead of 10-20 GigaFlop/second for CPU Courtesy: John Owens igure 1.1. Enlarging Perform ance Gap betw een GPUs and CPUs. Computerarchitectuur

Supercomputing for free FASTRA at university of Antwerp Collection of 8 graphical cards in PC FASTRA 8 cards = 8x128 processors = 4000 euro http://fastra.ua.ac.be Similar performance as University s supercomputer (512 regular desktop PCs) that costed 3.5 million euro in 2005 Computerarchitectuur

Why are GPUs faster? Devote transistors to computation

GPU processor pipeline ±24 stages in-order execution!! no branch prediction!! no forwarding!! no register renaming!! Memory system: relatively small Until recently no caching On the other hand: much more registers (see later) New concepts: 1. Multithreading 2. SIMD Computerarchitectuur

Overhead Multithreading on CPU 1 process/thread active per core When activating another thread: context switch Stop program execution: flush pipeline (let all instructions finish) Save state of process/thread into Process Control Block : registers, program counter and operating system-specific data Restore state of activated thread Restart program execution and refill the pipeline Computerarchitectuur

Hardware threads In several modern CPUs typically 2HW threads Devote extra hardware for process state Context switch by hardware (almost) no overhead Within 1 cycle! Instructions in flight from different threads Computerarchitectuur

Multithreading Performing multiple threads of execution in parallel Replicate registers, PC, etc. Fast switching between threads Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn t hide short stalls (eg, data hazards) 7.5 Hardware Multithreading Chapter 7 Multicores, Multiprocessors, and Clusters 17

Simultaneous Multithreading In multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HT Two threads: duplicated registers, shared function units and caches Chapter 7 Multicores, Multiprocessors, and Clusters 18

Multithreading Example Chapter 7 Multicores, Multiprocessors, and Clusters 19

Benefits of fine-grained multithreading Independent instructions (no bubbles) More time between instructions: possibility for latency hiding Hide memory accesses If pipeline full Forwarding not necessary Branch prediction not necessary Computerarchitectuur

Instruction and Data Streams An alternate classification Instruction Streams Single Multiple Single SISD: Intel Pentium 4 MISD: No examples today Data Streams Multiple SIMD: SSE instructions of x86 MIMD: Intel Xeon e5345 SPMD: Single Program Multiple Data A parallel program on a MIMD computer Conditional code for different processors 7.6 SISD, MIMD, SIMD, SPMD, and Vector Chapter 7 Multicores, Multiprocessors, and Clusters 21

SIMD Operate elementwise on vectors of data E.g., MMX and SSE instructions in x86 Multiple data elements in 128-bit wide registers All processors execute the same instruction at the same time Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications Chapter 7 Multicores, Multiprocessors, and Clusters 22

Vector Processors Highly pipelined function units Stream data from/to vector registers to units Data collected from memory into registers Results stored from registers to memory Example: Vector extension to MIPS 32 64-element registers (64-bit elements) Vector instructions lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each element of vector of double Significantly reduces instruction-fetch bandwidth Chapter 7 Multicores, Multiprocessors, and Clusters 23

Example: DAXPY (Y = a X + Y) Conventional MIPS code l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to load loop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done Vector MIPS code l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result Chapter 7 Multicores, Multiprocessors, and Clusters 24

Vector vs. Scalar Vector architectures and compilers Simplify data-parallel programming Explicit statement of absence of loop-carried dependences Reduced checking in hardware Regular access patterns benefit from interleaved and burst memory Avoid control hazards by avoiding loops More general than ad-hoc media extensions (such as MMX, SSE) Better match with compiler technology Chapter 7 Multicores, Multiprocessors, and Clusters 25

Now: the GPU architecture Computerarchitectuur

1 Streaming Multiprocessor hardware threads grouped by 32 threads (warps), executing in lockstep (SIMD) In each cycle, a warp is selected of which the next instruction is put in the pipeline Ps: cache only in new models Computerarchitectuur

Computerarchitectuur

Peak GPU Performance GPUs consist of Streaming MultiProcessors (MPs) grouping a number of Scalar Processors (SPs) Nvidia GTX 280: 30MPs x 8 SPs/MP x 2FLOPs/instr/SP x 1 instr/clock x 1.3 GHz = 624 GFlops Nvidia Tesla C2050: 14 MPs x 32 SPs/MP x 2FLOPs/instr/SP x 1 instr/clock x 1.15 GHz (clocks per second) = 1030 GFlops Computerarchitectuur

Example: pixel transformation Kernel executed by each thread, each processing 1 pixel GPU = Zero-overhead thread processor usgn_8 transform(usgn_8 in, sgn_16 gain, sgn_16 gain_divide, sgn_8 offset) { sgn_32 x; x = (in * gain / gain_divide) + offset; } if (x < 0) x = 0; if (x > 255) x = 255; return x; Computerarchitectuur

Computerarchitectuur 31

Example: real-time image processing Images of 20MegaPixels Pixel rescaling lens correction pattern detection CPU gives only 4 fps next generation machines need 50fps Computerarchitectuur 32

CPU: 4 fps GPU: 70 fps Computerarchitectuur 33

However: processing power not for free Computerarchitectuur

Obstacle 1 Hard(er) to implement Computerarchitectuur 35

Obstacle 2 Hard(er) to get efficiency Computerarchitectuur 36

Computerarchitectuur

X86 to GPU GUDI - A combined GP- GPU/FPGA desktop system for accelerating image processing applications 18 mei 2011

Example 1: convolution Parallelism: +++ Locality: ++ Work/pixel: ++ GUDI - A combined GP- GPU/FPGA desktop system for accelerating image processing applications 23 september