Simultaneous Multithreading Processor

Similar documents
Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Simultaneous Multithreading (SMT)

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Simultaneous Multithreading (SMT)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Hyperthreading Technology

floating point instruction queue integer instruction queue

" # " $ % & ' ( ) * + $ " % '* + * ' "

Simultaneous Multithreading (SMT)

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Simultaneous Multithreading: a Platform for Next Generation Processors

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

November 7, 2014 Prediction

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Simultaneous Multithreading (SMT)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Processor Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Handout 2 ILP: Part B

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

CS425 Computer Systems Architecture

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Processor (IV) - advanced ILP. Hwansoo Han

Out of Order Processing

Superscalar Processor Design

Getting CPI under 1: Outline

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS425 Computer Systems Architecture

Architectures for Instruction-Level Parallelism

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Boosting SMT Performance by Speculation Control

Kaisen Lin and Michael Conley

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

5008: Computer Architecture

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Memory Hierarchy Basics

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov


Control Hazards. Prediction

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Use-Based Register Caching with Decoupled Indexing

ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size

One-Level Cache Memory Design for Scalable SMT Architectures

Lec 11 How to improve cache performance

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

The Processor: Instruction-Level Parallelism

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Techniques for Efficient Processing in Runahead Execution Engines

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

COMPUTER ORGANIZATION AND DESI

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Alexandria University

Exploitation of instruction level parallelism

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Chapter 4. The Processor

Multiple Instruction Issue and Hardware Based Speculation

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Multi-core Programming Evolution

Pipelining, Branch Prediction, Trends

SPECULATIVE MULTITHREADED ARCHITECTURES

Full Datapath. Chapter 4 The Processor 2

EECS 470 Midterm Exam Winter 2009

Lecture-13 (ROB and Multi-threading) CS422-Spring

The Pentium II/III Processor Compiler on a Chip

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Computer Science 146. Computer Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Pipelining. CSC Friday, November 6, 2015

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Computer Architecture

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Data Processing on Modern Hardware

Hardware-based Speculation

EECC551 Exam Review 4 questions out of 6 questions

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Transcription:

Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

Processors for parallelism n Super scalar and multithreaded processor: Architecture for increasing parallelism n Super scalar: Instruction level parallelism n Multithreaded processor: Thread level parallelism

Limits on super scalar Which parts to improve? Simulation Result: 1.IPC<1.5 2.No general dominant cause for all applications Simulation for unused issue cycles in pure super scalar. Processor busy: utilized issue slots, other are wasted issue slots. I can t help!

Problem with super scalar and multithreaded processor Color: Thread Vertical Waste Horizontal Waste Source: Computer Architecture A Quantitative Approach 4.ed

n Problem with the experiments results: The model is too idealized (huge number of register file ports, complex scheduling, ) Need more Realistic Model!!

Goal n Minimize architectural impact on superscalar to build SMT n Minimal performance impact on single thread applications n High throughputs (IPC) with multiple threads n Be able to issue the threads that use processor most efficiently in each cycle.

Choose the thread that is without I cache miss and conflict to others Base Architecture Derive from OOO super scalar Up to 8 8(threads)*32(32-bit ISA) 3 Multibanked Multibanked >100 registers 32 entries Per-thread return stack, instruction retirement, IQ flush, trap mechanism, ID (in IQ,BTB) 6, 4can ld/st

n Larger register file has some effects on the pipeline: 1.Latency for accessing large register file is high => break it into 2 cycles for register access 2. So, 2 pipeline stages (RR,WB) need to access register file => 2 more pipeline stages added in total 3. Effects of more pipeline stages: Increase branch misprediction penalty, one more bypass level for WB, instructions stays in pipeline for longer time more instructions in the pipeline registers may be not enough for more instructions

Increasing pipeline stages doesn t increase interinstructions latency between instructions except load Optimistic issue: Load instruction needs 2 cycles but they still assume one cycle to load (do not wait 2 cycles for load): When load suffer cache miss or bank conflict => squash this load, re-issue later

Performance of the base architecture Only ~4 IPC! MT: 3 IPC

Improve base architecture Improve fetching 1.Partition the Fetch Unit : fetch instructions from multiple threads. 2.Fetch policy: fetch useful instructions 3.Unblocking the Fetch Unit: prevent conditions that make fetch unit stop Improve issuing 1.Issuing policy

Partition the Fetch Unit Alg. # threads to fetch in a cycle. Max # instr. per thread in a cycle RR.1.8: -baseline (MT) RR.2.4: -2 cache output buses, each 4 instr. wide -additional circuit for multiple output buses RR.4.2: -more expensive changes than RR.2.4 -suffer from thread shortage (available threads are less than required to fill fetch bandwidth) RR.2.8: -Best -fetch as many instr. As possible for first thread then second thread fills up to 8 -high throughput-> cause IQ full

Fetch Policy: Factors to consider: 1.Don t fetch threads that follow wrong paths (misprediction) Policy: 2.Don t fetch the threads that stay in IQ for very long time BRCOUNT: favor branches with fewest unresolved branches ( 1st factor) MISSCOUNT: Favor threads with fewest outstanding D cache miss ( 2nd factor) ICOUNT: Favor threads with fewest instruction in decode, rename, IQ (2nd factor) IQPOSN: Don t favor threads that have old instructions in IQ (2nd factor) ICOUNT ICOUNT ICOUNT ICOUNT ICOUNT ICOUNT ICOUNT ICOUNT

Unblocking Fetching Unit Conditions that cause the fetch unit to be blocked: 1. IQ full 2. I cache miss Scheme to solve the conditions: 1. BIGQ: restriction on IQ: searching time Thus, double the IQ size but only search first 32 instructions in the IQ. Additional space in IQ is for buffering instruction when IQ overflow. 2. ITAG: look up I cache tag one cycle early, if it s miss, choose other thread and load the missing instruction into cache( fetch after the instruction is in the cache). Need more pipeline stages.

Which Instruction to issue? Issue priority: aim to avoid wasting issue slots Causes for issue slot waste: 1.Wrong path instruction 2.Optimistically (issue load-dependant instructions a cycle before have D cache hit information, if miss or bank conflict, squash opt issued instruction, waste slot) OLDEST_FIRST: deepest instructions in IQ first (default) OPT_LAST: issue Optimistic instructions as late as possible SPCE_LAST: issue speculative instructions as late as possible BRANCH_FIRST: issue branch instructions as soon as possible (to identify mispredicted branch quickly) Need multiple passes to search IQ OLDEST No much difference

Bottlenecks Issue BW: Infinite FUs, throughput increases 0.5% IQ size: Reduced by ICOUNT; increases1% with 64-entry IQ Fetch BW: Fetch up to 16 instr. => 5.7 IPC (increase 8%) Bottleneck! Fetch up tp 16 instr., 64-entry IQ, 140 regs. => 6.1 IPC Branch prediction: Doubling BHT,BTB size, increases 2% Speculative execution: Wrong path rare, mechanisms to prevent speculative execution reduce throughput Memory/cache: No bank conflict (infinite BW cache) increases 3% Register file: Infinite excess register increase 2% ICOUNT reduces clog/strain on Reg. file

Summary No heavily change to conventional superscalar No great impact on single-thread performance Achieve 5.4 IPC, 2.5x compare to 8 threads super scalar These improvements are from: -1.Partition fetch bandwidth -2.Be able to select the threads that uses resources most efficiently

SMT implemented processors n Hyper-Threading in Intel n Implemented in Atom, Intel Core i3/i7, Itanium, Pentium 4 and Xeon n 2 threads per core n Need OS support and optimization http://en.wikipedia.org/wiki/hyper-threading

SMT implemented processors- Hyper Threading Architecture Logical Processors cache, execution units, branch prediction, control logic, buses Physical Processor Source: Intel Technology Journal. Volume 6 Issue 1. February 14, 2002

SMT implemented processors n Netburst microarchitecture( Pentium 4 and Pentium D) Optimistic issue for load Red: Partition: favor faster threads, like ICOUNT Green: Threshold: limit resources to one logical processor, prevent to block other access, find maximum parallelism Blue: Full Share: Mainly for cache, share data, code Source: HYPERTHREADING TECHNOLOGY IN THE NETBURST MICROARCHITECTURE

Thank you