Simultaneous Multithreading Architecture

Similar documents
Handout 2 ILP: Part B

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP


CS425 Computer Systems Architecture

Multithreaded Processors. Department of Electrical Engineering Stanford University

Lecture-13 (ROB and Multi-threading) CS422-Spring

Superscalar Processor Design

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Hardware-Based Speculation

Superscalar Processor

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

TDT 4260 lecture 7 spring semester 2015

Parallel Processing SIMD, Vector and GPU s cont.

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Simultaneous Multithreading: a Platform for Next Generation Processors

LIMITS OF ILP. B649 Parallel Architectures and Programming

ECSE 425 Lecture 25: Mul1- threading

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Exploitation of instruction level parallelism

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Kaisen Lin and Michael Conley

Simultaneous Multithreading on Pentium 4

Dynamic Scheduling. CSE471 Susan Eggers 1

Limits to ILP. Limits to ILP. Overcoming Limits. CS252 Graduate Computer Architecture Lecture 11. Conflicting studies of amount

Lecture 14: Multithreading

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Hardware-based Speculation

Lecture 19: Instruction Level Parallelism -- Dynamic Scheduling, Multiple Issue, and Speculation

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

TDT 4260 TDT ILP Chap 2, App. C

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Hyperthreading Technology

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CS 152 Computer Architecture and Engineering

CS 654 Computer Architecture Summary. Peter Kemper

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Hardware-Based Speculation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

CS425 Computer Systems Architecture

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Limitations of Scalar Pipelines

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

November 7, 2014 Prediction

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Simultaneous Multithreading and the Case for Chip Multiprocessing

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

A Prototype Multithreaded Associative SIMD Processor

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Instruction Level Parallelism (ILP)

Case Study IBM PowerPC 620

Software-Controlled Multithreading Using Informing Memory Operations

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Simultaneous Multithreading (SMT)

5008: Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Processor Architecture

ILP Ends TLP Begins. ILP Limits via an Oracle

THREAD LEVEL PARALLELISM

E0-243: Computer Architecture

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Multiple Instruction Issue. Superscalars

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

An Overview of MIPS Multi-Threading. White Paper

Hardware Speculation Support

Course on Advanced Computer Architectures

CISC Processor Design

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Tutorial 11. Final Exam Review

High-Performance Processors Design Choices

CISC Processor Design

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Getting CPI under 1: Outline

Architectures for Instruction-Level Parallelism

Instruction Level Parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor.

Transcription:

Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design

For most apps, most execution units lie idle For an 8-way superscalar. From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. Apr 12, 2010 SE-273@SERC 2

Do both ILP and TLP? TLP and ILP exploit two different kinds of parallel structure in a program Could a processor oriented at ILP to exploit TLP? functional units are often idle in data path designed for ILP because of either stalls or dependences in the code Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists? Apr 12, 2010 SE-273@SERC 3

Cycle Simultaneous Multi-threading... One thread, 8 units M M FX FX FP FP BR CC Cycle 1 1 Two threads, 8 units M M FX FX FP FP BR CC 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 9 Apr 12, 2010 SE-273@SERC 4 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

Simultaneous Multithreading (SMT) Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading Large set of virtual registers that can be used to hold the register sets of independent threads Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW Just adding a per thread renaming table and keeping separate PCs Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Source: Micrprocessor Report, December 6, 1999 Compaq Chooses SMT for Alpha Apr 12, 2010 SE-273@SERC 5

Time (proce essor cycle) Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot Apr 12, 2010 SE-273@SERC 6

Design Challenges in SMT Since SMT makes sense only with fine-grained implementation, impact of fine-grained scheduling on single thread performance? A preferred thread approach sacrifices neither throughput nor single-thread performance? Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls Larger register file needed to hold multiple contexts Not affecting clock cycle time, especially in Instruction issue - more candidate instructions need to be considered Instruction completion - choosing which instructions to commit may be challenging Ensuring that cache and TLB conflicts generated by SMT do not degrade performance Apr 12, 2010 SE-273@SERC 7

SMT Implementation Apr 12, 2010 SE-273@SERC 8

Design Challenges in SMT - Fetch Most expensive resource Cache port Limited to accessing the contiguous memory locations Less likely that multiple thread read from contiguous or even spatially local addresses Either provide dedicated fetch stage per thread Or time share a single port in fine grain or course grain manner Cost of dual porting cache is quite high and not justifiable Time sharing is feasible solution Apr 12, 2010 SE-273@SERC 9

Design Challenges in SMT - Fetch Other expensive resource is Branch Predictor Multi-porting p g branch predictor is equivalent to halving its effective size Time sharing makes more sense Certain element of BP rely on serial semantics and may not perform well for multi-thread Return address stack rely on FIFO behaviour Global BHR may not perform well BHR needs to be replicated Apr 12, 2010 SE-273@SERC 10

Design Challenges in SMT - Fetch Other expensive resource is Branch Predictor Multi-porting p g branch predictor is equivalent to halving its effective size Time sharing makes more sense Certain element of BP rely on serial semantics and may not perform well for multi-thread Return address stack rely on FIFO behaviour Global BHR may not perform well BHR needs to be replicated Apr 12, 2010 SE-273@SERC 11

Design Challenges in SMT - Decode Primary Task Identify y source operand and destination Resolve dependency Instructions from different threads are not dependent Example Two 4-wide decoders operate in parallel will have much less logical complexity Tradeoff single thread performance Apr 12, 2010 SE-273@SERC 12

Design Challenges in SMT - Rename Allocate physical register Map AR to PR Makes sense to share logic which maintain the free list of registers AR numbers are disjoint across the threads, hence can be partitioned High bandwith at low cost than multipoting Limits the single thread performance Apr 12, 2010 SE-273@SERC 13

Design Challenges in SMT - Issue Tomasulo s algorithm Wakeup and select Clearly improve the performance Selection Dependent D d t on the instruction ti from multiple threads Wakeup Limited to intrathread interaction Make sense to partition the issue window Limit the performance of single thread Apr 12, 2010 SE-273@SERC 14

Design Challenges in SMT - Execute Clearly improve the performance Bypass network Memmory: Separate LS Queue Apr 12, 2010 SE-273@SERC 15