Multithreaded Processors. Department of Electrical Engineering Stanford University

Similar documents
EECS 452 Lecture 9 TLP Thread-Level Parallelism

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Simultaneous Multithreading Architecture

Lecture 18: Multithreading and Multicores

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Handout 2 ILP: Part B

" # " $ % & ' ( ) * + $ " % '* + * ' "

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS425 Computer Systems Architecture

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Exploitation of instruction level parallelism

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

ILP Ends TLP Begins. ILP Limits via an Oracle


EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

November 7, 2014 Prediction

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Limitations of Scalar Pipelines

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Lecture 14: Multithreading

Kaisen Lin and Michael Conley

TDT 4260 lecture 7 spring semester 2015

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Simultaneous Multithreading (SMT)

Hyperthreading Technology

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Simultaneous Multithreading (SMT)

Simultaneous Multithreading Processor

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

The University of Texas at Austin

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Multiple Instruction Issue. Superscalars

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Simultaneous Multithreading (SMT)

Hardware-Based Speculation

Pipelining to Superscalar

Control Hazards. Branch Prediction

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Simultaneous Multithreading (SMT)

CS425 Computer Systems Architecture

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Advanced issues in pipelining

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Processor Architecture

Pentium IV-XEON. Computer architectures M

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

ECSE 425 Lecture 25: Mul1- threading

Introducing Multi-core Computing / Hyperthreading

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CSE502: Computer Architecture CSE 502: Computer Architecture

Control Hazards. Prediction

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

5008: Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

CS 654 Computer Architecture Summary. Peter Kemper

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

E0-243: Computer Architecture

Simultaneous Multithreading on Pentium 4

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

TDT 4260 TDT ILP Chap 2, App. C

Superscalar Processors Ch 14

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Precise Exceptions and Out-of-Order Execution. Samira Khan

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 25: Board Notes: Threads and GPUs

Superscalar Processors

Advanced Computer Architecture

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

LIMITS OF ILP. B649 Parallel Architectures and Programming

An Overview of MIPS Multi-Threading. White Paper

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Transcription:

Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1

The Big Picture Previous lectures: Core design for single-thread performance Focus on instruction level parallelism (ILP) Single-thread performance can't scale indefinitely! Hard to increase IPC or frequency Industry shift to architectures that exploit Thread-Level Parallelism (TLP) and Data-Level Parallelism (DLP) In the rest of the course: Multithreaded processors (today) Multicores GPUs Vector processors DLP TLP Lecture 12-2

Limits of Single-Thread Performance Fundamental ILP limitation in program Control dependences Data dependences Magnified by memory latency Power consumption ILP techniques must be energy-efficient Constrains clock frequency scaling Switching power = C(cap)*N(switch)*Vdd^2*F(freq) Complexity of advanced superscalar design with diminishing returns Higher design, verification, manufacturing costs Lecture 12-3

Limits of Single-Thread Performance Lecture 12-4

Thread-Level Parallelism Thread-Level Parallelism: Parallelism obtained by executing multiple instruction streams (threads) simultaneously Coarser grain Much easier to exploit than ILP Requires abundant threads software should be parallel Two benefits from having multiple threads: Parallel execution: Execute several threads in parallel (e.g. in different processors) Throughput improvement: Avoid idle time on a stall by switching to other thread Today's lecture: Exploiting TLP in a processor pipeline Lecture 12-5

Reminder: Processes and threads Process: An instance of a program executing in a system OS supports concurrent execution of multiple processes Each process has its own address space A process can have one or more threads Threads in the same process share its address space Two different processes can partially share their address spaces to communicate Thread: An independent control stream within a process Private state: PC, registers (integer, FP), stack, thread-local storage Shared state: Heap, address space (VM structures) Lecture 12-6

Reminder: Classic OS Context Switch OS context-switch: Timer interrupt stops a program mid-execution (precise) OS saves the context of the stopped thread PC, GPRs, and more Shared state such as physical pages are not saved OS restores the context of a previously stopped thread (all except PC) OS uses a return from exception to jump to the restarting PC The restored thread has no idea it was interrupted, removed, later restored and restarted Take a few hundred cycles per switch Amortized over the execution quantum What latencies can you hide using OS context switching for threads? How much faster would a user-level thread switch be? What are the implications? Lecture 12-7

Multithreaded processors Motivation: HW underutilized on stalls Memory accesses (misses) Pipeline hazards Control hazards Synchronization I/O Instead of reducing stalls, switch to running a different thread while the original thread is stalled Latency tolerance (vs avoidance) Improves throughput & HW utilization Does not improve single thread latency Need hardware support for fast contextswitching Lecture 12-8

HW Support for Multithreading: the Options Lecture 12-9

Coarse Grain MT Lead : a processor becomes idle when a thread runs into a cache miss Why not switch to another thread? Problem: SW switch takes hundreds to thousands of cycles Solution: faster context switch in hardware Replicate hardware context registers: PC, GPRs, cntrl/status, PT base ptr Eliminates copying Share some resources with tagging thread id Cache, BTB and TLB match tags Eliminates cold starts Hardware context switch takes only a few cycles Flush the pipeline Choose the next ready thread as the currently active one Start fetching instructions from this thread Where does the penalty come from? Can it be eliminated? Lecture 12-10

Simple Multithreaded Processor Multiple threads supported in hardware With multiple hardware contexts Context switch events: With coarse-grain MT, High latency event cache miss, I/O operations Max cycle count for fairness Lecture 12-11

Coarse-grain Multithreading: Thread States 4 potential states: {Ready/not-ready} x {using HW context, swapped out} User-level runtime or OS can manage swapping Based on readiness, fairness, or priorities Lecture 12-12

Fine Grain Multithreading Lead : when pipelined processor stalls due to RAW dependence, the execution stage is idling Why not switch to another thread? Problem: Context-switching must be instantaneous to have any advantage Solution : 1-cycle context switch Multiple hardware contexts Instruction buffer To dispatch instruction from another thread at the next cycle Multiple threads co-exist in the pipeline Lecture 12-13

Example: Tera MTA Instruction latency hiding Worst case instruction latency is 128 cycles (may need 128 threads!!) On every cycle, select a ready thread (i.e. last instruction has finished) from a pool of up to 128 threads Benefits : no forward logic, no interlock, and no cache Problem : Single thread performance H Inst T1 A B C D E E G H Inst T2 A B C D E F G Inst T3 Inst A B C D E F G H A B C D E F G H T4 Inst T5 A B C D E F G H Inst T6 A B C D E F G H Inst T7 A B C D E F G H Inst T8 A B C D E F G H Inst T1 A B C D E F G H Lecture 12-14

Example: Sun UltraSparc T1 A fine-grain multithreaded system With multiple processors on a chip 4-threads per CPU, round-robin switch Thread blocked on stalls, mul, div, loads, Lecture 12-15

Example: Sun UltraSparc T2 Similar to T1, but 8-way threaded 8-stage dual pipeline can execute 2 threads at the time Fetch, cache, pick, decode, execute, memory, bypass, writeback Lecture 12-16

UltraSparc T2: Instruction Fetch Logic Fetch up to 4 instructions from $I Priority to least-recently fetched ready thread Predict not-taken (5 cycle penalty) Threads separated into two groups 1 instruction issued per group Priority to least-recently picked ready thread Decode stage resolves LSU conflicts Lecture 12-17

UltraSparc T2: Further Details Load miss queue One miss per thread Store buffer 8 stores per thread (write-through L1 cache) MMU can handle up to 8 TLB misses For instructions or data for each thread Hardware page table walk Lecture 12-18

Simultaneous Multithreading (SMT, aka HyperThreading) Lead : not all resources on all stages are busy in a superscalar processor. Why not execute multiple threads in the same pipeline stage? Problem : how can we support multiple active threads? Solution : variety of design options What pipeline stages to share? The rest of the stages are replicated How to share resources? Time sharing: one thread at a cycle Spatial partition: dedicate a portion of resource to a thread Lecture 12-19

Regular OOO Vs. SMT OOO: the Alpha Approach Lecture 12-20

SMT Resource Sharing What are the tradeoffs here? Complexity, resource utilization, interference Lecture 12-21

SMT Design Options (1) Fetch How many threads to fetch from each cycle? If >1, Multi-ported I-cache I-fetch buffer replication From which threads to fetch? Branch predictor: BTB is shared BHR and RAS are replicated. Why? Decode Share decoding logic O(n^2) for dependence check Spatial partition at the cost of single thread performance Lecture 12-22

SMT Design Options (2) Rename / RF Need per-thread map tables More architectural registers Larger (A)RF Sharing registers? Multi-ported register file same as in non-smt superscalar Issue Share wakeup logic for arguments O(n^2) for instruction selection Spatial partition at the cost of single thread performance Execution Very natural to share Pipeline vs. dedicate multi-cycle ALUs Lecture 12-23

SMT Design Options (3) Memory Multi-ported L1D? Separate load/store queues? Load forwarding/bypassing must be thread-aware Retire Separate reorder buffers? Per-thread instruction retirement Lecture 12-24

SMT Design Options: How Many Threads? As the number of HW threads increases: Need larger register file Resources statically partitioned/replicated Lower utilization, lower single-thread performance Shared resources Better utilization, but interference and fairness issues More threads!= Higher aggregate throughput! Scheduling optimizations: Adapt number of running threads depending on interference Co-schedule threads that have good interference behavior (symbiotic threads) Lecture 12-25

SMT Processors Alpha EV8: would be the 1 st SMT CPU if not cancelled 8-wide superscalar with support for 4-way SMT SMT mode: like 4 CPUs with shared caches and TLBs Replicated HW: PCs, registers (different maps, shared physical regs) Shared: instruction queue, caches, TLBs, branch predictors, Pentium 4 HT: 1 st commercial SMT CPU (2 threads) SMT threads share: caches, FUs, predictors (5% area increase) Statically partitioned IW & ROB Replicated RAS, 1 st level global branch history table Shared second-level branch history table, tagged with logical processor IDs IBM Power5, IBM Power 6: 2-way SMT Intel Nehalem (Core i7): 4-wide superscalar, 2-way SMT Lecture 12-26

Fetch Policies for SMT Fetch polices try to identify the best thread to fetch from. Policies: Least unresolved branches (BRCOUNT) Least misses in-flight (MISSCOUNT) Least instructions in decode/rename/issue stages (ICOUNT) Motivation for each policy? ICOUNT typically does best. Reasons: Gives priority to more efficient threads Avoids thread starvation Lecture 12-27

Long-latency stall tolerance with SMT SMT does not necessarily work well for long-latency stalls E.g. a thread misses to main memory (around 300 cycles) CPU continues issuing instructions from stalled thread, eventually clogging IW Improved policies: Detect long-latency stall Completely stop issuing instructions Flush instructions from stalled thread (works surprisingly well) Lecture 12-28

Summary Multithreaded processors implement hardware support to share pipeline across multiple threads Stall tolerance Better resource utilization 3 types of multithreading: Coarse grain (switch on long-latency stalls) Fine-grain (switch every cycle) Simultaneous (instrs from multiple threads each cycle) Design options in SMT processors Lecture 12-29