ECSE 425 Lecture 25: Mul1- threading

Similar documents
CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??


Lecture-13 (ROB and Multi-threading) CS422-Spring

Parallel Processing SIMD, Vector and GPU s cont.

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Hardware-Based Speculation

Multithreading: Exploiting Thread-Level Parallelism within a Processor

TDT 4260 lecture 7 spring semester 2015

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Handout 2 ILP: Part B

Limits to ILP. Limits to ILP. Overcoming Limits. CS252 Graduate Computer Architecture Lecture 11. Conflicting studies of amount

CS252 Graduate Computer Architecture Spring 2014 Lecture 13: Mul>threading

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Multithreaded Processors. Department of Electrical Engineering Stanford University

LIMITS OF ILP. B649 Parallel Architectures and Programming

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Simultaneous Multithreading Architecture

Lecture 14: Multithreading

Exploitation of instruction level parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

ILP Ends TLP Begins. ILP Limits via an Oracle

CS425 Computer Systems Architecture

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

THREAD LEVEL PARALLELISM

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

ECSE 425 Lecture 1: Course Introduc5on Bre9 H. Meyer

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Advanced Processor Architecture

Lecture 25: Board Notes: Threads and GPUs

Computer Architecture Spring 2016

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

CSE502: Computer Architecture CSE 502: Computer Architecture

COSC4201 Multiprocessors

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Hyperthreading Technology

Multiple Instruction Issue. Superscalars

" # " $ % & ' ( ) * + $ " % '* + * ' "

EECS 452 Lecture 9 TLP Thread-Level Parallelism

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introducing Multi-core Computing / Hyperthreading

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

TDT 4260 TDT ILP Chap 2, App. C

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Processor Architecture

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Simultaneous Multithreading on Pentium 4

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

November 7, 2014 Prediction

High-Performance Processors Design Choices

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Simultaneous Multithreading (SMT)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Kaisen Lin and Michael Conley

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

UNIT V: CENTRAL PROCESSING UNIT

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Control Hazards. Prediction

Advanced Computer Architecture

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

CS 654 Computer Architecture Summary. Peter Kemper

5008: Computer Architecture

CS425 Computer Systems Architecture

Instruction Level Parallelism (ILP)

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Computer Architecture

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Multicore Hardware and Parallelism

Parallel Systems I The GPU architecture. Jan Lemeire

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

CS 465 Final Review. Fall 2017 Prof. Daniel Menasce

WHY PARALLEL PROCESSING? (CE-401)

Computer Systems Architecture

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Instruction Level Parallelism

Transcription:

ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3

Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2

Today Mul1- threading Chapter 3.5 Summary of ILP: Is there a best architecture? Chapter 3.6 3

Single- Threaded Superscalar Machine Issue Slots Time (processor cycle) Superscalar - Mul1ple, dynamic issue - Mul1ple FU - Specula1on - Whole processor stalls when an instruc1on stalls Idle slot 4

Beyond Single Thread ILP Parallelism is abundant in many applica1ons Database or scien1fic codes Medical imaging Explicit Parallelism Thread Level Parallelism or Data Level Parallelism Thread: process with own instruc1ons and data part of a parallel program of mul1ple processes, or an independent program Each thread has its own state Data Level Parallelism: iden1cal opera1ons on data and lots of data 5

Thread- Level Parallelism (TLP) ILP exploits implicitly parallel opera1ons within a loop or straight- line code segment TLP exploits explicit parallelism mul1ple threads of execu1on that are inherently parallel Goal: Use mul1ple instruc1on streams to improve 1. Throughput of computers that run many programs 2. Execu1on 1me of mul1- threaded programs TLP could be more cost- effec1ve to exploit than ILP Mul1threading allows mul1ple threads to share the FUs of a single processor in an overlapping fashion Processor must duplicate independent states of threads 6

Mul1- Threading Strategies Issue Slots Superscalar Coarse-Grained Time (processor cycle) Thread 1 Thread 2 Thread 3 Thread 4 Idle slot 7

Course- Grained Mul1- Threading Switches threads only on costly stalls, such as L2 misses Advantages Fast thread- switching isn t needed Doesn t slow down threads Disadvantages Can t hide stalls due to short dependencies Since CPU issues instruc1ons from 1 thread, when a stall occurs, the pipeline must be emp1ed or frozen New thread must fill pipeline before instruc1ons can complete Start- up overhead means coarse- grained mul1threading is beaer for reducing penalty of high cost stalls Pipeline refill << stall 1me Used in IBM AS/400 8

Mul1- Threading Strategies Issue Slots Superscalar Coarse-Grained Fine-Grained Time (processor cycle) Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot 9

Fine- Grained Mul1- Threading Switches between threads on each instruc1on, Causing the execu1on of mul1ples threads to be interleaved Usually done in a round- robin fashion, skipping stalled threads CPU must be able to switch threads every clock Advantage: can hide both short and long stalls Instruc1ons from other threads execute when a thread stalls Disadvantage: slow down execu1on of individual threads A thread ready to execute without stalls will be delayed by instruc1ons from other threads Used in Sun s Niagara 10

Mul1- Threading Strategies Issue Slots Superscalar Coarse-Grained Fine-Grained Time (processor cycle) Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot 11

Can We Exploit both ILP and TLP? TLP and ILP exploit different parallel structure Could a processor designed for ILP exploit TLP? Func1onal units idle in data path designed for ILP because of either stalls or dependencies in the code Could TLP provide independent instruc1ons to keep the processor busy during stalls? Could TLP employ the func1onal units that would otherwise lie idle when insufficient ILP exists? 12

Simultaneous Mul1threading (SMT) Simultaneous mul1threading (SMT) Leverage mul1ple- issue and dynamic scheduling Simultaneously exploit both TLP and ILP Enough func1onal units to support mul1ple threads Register renaming and dynamic scheduling Instruc1ons from independent threads can be issued Out- of- order execu1on and comple1on No inter- thread ordering imposed Need to maintain architectural state for each thread Separate PCs, renaming tables, and reorder buffers 13

Mul1- threading Strategies Superscalar Issue Slots Coarse-Grained Fine-Grained Simultaneous Multithreading Time (processor cycle) Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot 14

Single- threaded Performance in SMT SMT is a fine- grained mul1- threading technique Interleaving instruc1ons degrades single- threaded performance Solu1on: primarily execute instruc1ons from the preferred thread Preferred designa1on can rotate But this limits the availability of instruc1ons from other threads when the preferred thread stalls 15

Other Design Challenges in SMT Larger register file to hold mul1ple contexts Not increasing clock cycle 1me, especially in Instruc1on issue more candidate instruc1ons need to be considered Instruc1on comple1on choosing which instruc1ons to commit may be challenging Cache and TLB conflicts Threads compete for fixed resources Designers must ensure that conflict misses do not degrade performance 16

SMT Example: IBM Power 5 More, larger structures, compared with Power 4 Increased associa1vity of L1 I$ cache and ITLB Per- thread load and store queue Larger L2 and L3 cache Per- thread instruc1on prefetch and buffering Increased virtual registers from 152 to 240 Larger issue queues 17

IBM Power 5 SMT Performance Power5 w/ SMT vs. Power5 w/o SMT SPECRate2000 suite SPECintRate: 1.23 SPECfpRate: 1.16 FP applica1ons show the least gains Cache conflicts 18

Head- to- Head ILP Compe11on Processor Micro architecture Fetch / Issue / Execute Intel Pen1um 4 Extreme AMD Athlon 64 FX- 57 IBM Power5 (1 CPU only) Intel Itanium 2 Specula1ve dynamically scheduled; deeply pipelined; SMT Specula1ve dynamically scheduled Specula1ve dynamically scheduled; SMT; 2 CPU cores/chip Sta1cally scheduled VLIW- style FU 3/3/4 7 int. 1 FP 3/3/4 6 int. 3 FP 8/4/8 6 int. 2 FP 6/5/11 9 int. 2 FP Clock Rate (GHz) Transis- tors Die size Power 3.8 125 M 122 mm 2 115 W 2.8 114 M 115 mm 2 104 W 1.9 200 M 300 mm 2 (est.) 80W (est.) 1.6 592 M 423 mm 2 130 W Which architecture is the best? 19

Performance on SPECint2000 20

Performance on SPECfp2000 21

Efficiency in Silicon Area and Power Rank Power 5 AthLon 64 PentIum 4 Itanium2 Int/Trans 4 2 1 3 FP/Trans 4 2 1 3 Int/area 4 2 1 3 FP/area 4 2 1 3 Int/WaJ 4 3 1 2 FP/WaJ 2 4 3 1 22

No Silver Bullet for ILP No obvious leader SPECInt: The AMD Athlon leads performance Followed by the Pen1um 4, Itanium 2, and Power5 SPECFP: Itanium 2 and Power5 dominate Athlon and P4 Efficiency Itanium 2 is the least efficient Athlon and P4 are cost- efficient Power5 is the most energy efficient 23

Addi1onal Limits on ILP Doubling issue rates of 3-6 instruc1ons per clock to 6-12 requires that processors be able to issue 3 or 4 data memory accesses per cycle, resolve 2 or 3 branches per cycle, rename and access more than 20 registers per cycle, and fetch 12 to 24 instruc1ons per cycle. Implemen1ng this likely means sacrificing clock rate E.g. Among the four processors, Itanium 2 (VLIW) has widest issue processor slowest clock rate and consumes the most power! 24

Addi1onal Limits on ILP, Cont d Modern processors are mostly power limited Increasing performance also increases power Energy efficiency: does performance grow faster than power? Mul1ple- issue techniques are energy inefficient Logic overhead grows faster than the issue rate Growing gap between peak and sustained issue rates Performance grows with the sustained performance Power grows with the peak performance Specula1on is inherently inefficient 25

Summary The power inefficiency and complexity of exploi1ng ILP seem to limit CPUs to 3-6 wide issue Exploi1ng explicit parallelism is the next step Data- level parallelism or Thread- level parallelism Coarse vs. Fine grained mul1- threading Switching on big stalls vs. every clock cycle Simultaneous Mul1- threading is fine grained mul1threading that exploits both ILP and TLP The applica1on- specific balance of ILP and TLP is decided by the marketplace 26

Next Time Mul1processors and Thread- level Parallelism Read Chapter 4.1! 27