EECS 452 Lecture 9 TLP Thread-Level Parallelism

Similar documents
Lecture 18: Multithreading and Multicores

Multithreaded Processors. Department of Electrical Engineering Stanford University

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS425 Computer Systems Architecture

Lecture-13 (ROB and Multi-threading) CS422-Spring

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Handout 2 ILP: Part B

COSC 6385 Computer Architecture - Thread Level Parallelism (I)


Chapter 06: Instruction Pipelining and Parallel Processing

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Parallel Processing SIMD, Vector and GPU s cont.

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Kaisen Lin and Michael Conley

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Exploitation of instruction level parallelism

Lecture 14: Multithreading

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

TDT 4260 lecture 7 spring semester 2015

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

THREAD LEVEL PARALLELISM

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Hardware-Based Speculation

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 06: Multithreaded Processors

Hyperthreading Technology

Computer Science 146. Computer Architecture

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Advanced issues in pipelining

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

CS 654 Computer Architecture Summary. Peter Kemper

Keywords and Review Questions

Simultaneous Multithreading Architecture

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

CS 152, Spring 2011 Section 8

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

EC 513 Computer Architecture

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

ECSE 425 Lecture 25: Mul1- threading

OOO Execution and 21264

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

LIMITS OF ILP. B649 Parallel Architectures and Programming

Dynamic Scheduling. CSE471 Susan Eggers 1

ILP Ends TLP Begins. ILP Limits via an Oracle

Computer Systems Architecture

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

ECE 588/688 Advanced Computer Architecture II

Computer Architecture Spring 2016

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

! Readings! ! Room-level, on-chip! vs.!

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Advanced Computer Architecture

RECAP. B649 Parallel Architectures and Programming

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Processor (IV) - advanced ILP. Hwansoo Han

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

CMSC411 Fall 2013 Midterm 2 Solutions

Computer Systems Architecture

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

CSE502: Computer Architecture CSE 502: Computer Architecture

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Instruction Level Parallelism (ILP)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Hardware-based Speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Tutorial 11. Final Exam Review

Processor Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Improve performance by increasing instruction throughput

ECE 588/688 Advanced Computer Architecture II

Portland State University ECE 588/688. Cray-1 and Cray T3E

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Simultaneous Multithreading (SMT)

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Lecture 25: Board Notes: Threads and GPUs

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

The Processor: Instruction-Level Parallelism

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

Transcription:

EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU -> Intel -> Nokia)

Overview of ILP When executing a program, how many independent operations can be performed in parallel How have we taken advantage of ILP so far? Pipelining (including superpipelining) overlap different stages from different instructions limited by divisibility of an instruction and ILP Superscalar overlap processing of different instructions in all stages limited by ILP VLIW limited by ILP What methods have we employed to increase ILP? dynamic/static register renaming -> reduce WAW and WAR dynamic/static instruction scheduling -> reduce RAW hazards Use predictions/speculation to optimistically break dependence Covered: Branch prediction, memory dependence prediction Not covered: value prediction EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 2

Thread-Level Parallelism The average processor actually executes several programs (a.k.a. processes, threads of control, etc.) at the same time Time Multiplexing The instructions from these different threads have a lot of parallelism Thread-Level Parallelism Taking advantage of thread-level parallelism, (by concurrent execution), can improve the overall throughput of the processor But NOT turn-around time of any single thread EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 3

Context-switch Time-multiplex multiprocessing on uniprocessors started back in 1962 Even concurrent execution by time-multiplexing improves throughput How? a single thread would effectively idle the processor when spinwaiting for I/O to complete, e.g. disk, keyboard, mouse, etc. can spin for thousands to millions of cycles at a time a thread should just go to sleep when waiting on I/O and let other threads use the processor, a.k.a. context switch EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 4

Context Switching (cont.) A context is all of the processor (plus machine) states associated with a particular process programmer visible states: program counter, register file contents, memory contents and some invisible states: control and status reg, page table base pointers, page tables What about cache (virtual vs. physical), BTB and TLB entries? Classic Context Switching timer interrupt stops a program mid-execution (precise) OS saves the context of the stopped thread OS restores the context of a previously stopped thread (all except PC) OS uses a return from exception to jump to the restarting PC The restored thread has no idea it was interrupted, removed, later restored, and restarted EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 5

Saving and Restoring Context Saving Context information that occupies unique resources must be copied and saved to a special memory region belonging exclusively to the OS e.g. program counter, reg file contents, control/status reg Context information that occupies commodity resources just needs to be hidden from the other threads e.g. active memory pages can be left in physical memory but page translations must be removed (but remembered) Restoring is the opposite of saving The act of saving and restoring is performed by the OS in software can take a few hundred cycles per switch, but the cost is amortize over the execution quantum EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 6

Fast Context Switches A processor becomes idle when a thread runs into a cache miss Why not switch to another thread? Cache miss lasts only tens of cycles, but it costs OS at least 64 cycles just to save and restore the 32 GPRs Solution: fast context switch in hardware replicate hardware context registers: PC, GPRs, control/status, PT base ptr eliminates copying allow multiple context to share some resources, (i.e., include process ID as cache, BTB and TLB match tags) eliminates cold starts hardware context switch takes only a few cycles set the PID register to the next process ID select the corresponding set of hardware context registers to be active EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 7

Really Fast Context Switches When pipelined processor stalls due to RAW dependence between instructions, the execution stage is idling Why not switch to another thread? Not only do you need hardware contexts, switching between contexts must be instantaneous to have any advantage!! If this can be done, don t need complicated forwarding logic to avoid stalls RAW dependence and long latency operations (multiply, cache misses) do not cause throughput performance loss Multithreading is a latency hiding technique EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 8

Fine-Grain Multithreading Suppose instruction processing can be divided into several stages, but some stages have very long latencies run the pipeline at the speed of the slowest stage, or superpipeline the longer stages, but then back-to-back dependencies cannot be forwarded EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 9

Really really fast context switch Superscalar processor datapath must be over-resourced has more functional units than ILP because the units are not universal current 4 to 8 way designs only achieve IPC of 2 to 3 Some units must be idling in each cycle Why not switch to another thread? EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 10

Simultaneous Multithreading Dynamic and flexible sharing of functional units between multiple threads increases utilization increases throughput EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 11

Digital/Compaq/HP Alpha EV8 Technology 1.2 ~ 2.0 GHz 250 million transistors (mostly in the caches) 0.125um CMOS with copper 1.2V Vdd 1100 signal pins (flip chip) probably about that many power and ground pins Architecture 8-wide superscalar with support for 4-way SMT supports both ILP and thread-level parallelism On-chip router and directory support for building glueless 512-way ccnuma SMP EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 12

EV8: Superscalar to SMT In SMT mode, it is as if there are 4 processors on a chip that shares their caches and TLB Replicated hardware contexts program counter architected registers (actually just the renaming table since architected registers and rename registers come from the same physical pool) Shared resources rename register pool (larger than needed by 1 thread) instruction queue (i.e., unified reservation stations) Caches, TLB, branch predictors The dynamic superscalar execution pipeline is more or less unchanged EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 13

SMT Issues and Summary Adding SMT to superscalar Single-thread performance is slight worse due to overhead (longer pipeline, longer combinational delays) Over-utilization of shared resources contention for instruction and data memory bandwidth interferences in caches, TLB and BTBs But remember multithreading can hide some of the penalties. For a given design point, SMT should be more efficient than superscalar if thread-level parallelism is available High-degree SMT faces similar scalability problems as superscalars needs numerous I-cache and D-cache ports needs numerous register file read and write ports the dynamic renaming and reordering logic is not simpler EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 14

MT Architectures EECS 452 http://www.eecs.northwestern.edu/~memik/courses/452 15