Lecture 18: Multithreading and Multicores

Similar documents
EECS 452 Lecture 9 TLP Thread-Level Parallelism

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Multithreaded Processors. Department of Electrical Engineering Stanford University

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS425 Computer Systems Architecture

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Handout 2 ILP: Part B

Multithreading: Exploiting Thread-Level Parallelism within a Processor

TDT 4260 lecture 7 spring semester 2015

COMPUTER ORGANIZATION AND DESI

Lecture 14: Multithreading


250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

EC 513 Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Advanced issues in pipelining

CS252 Graduate Computer Architecture Midterm 1 Solutions

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Lecture-13 (ROB and Multi-threading) CS422-Spring

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

EITF20: Computer Architecture Part2.2.1: Pipeline-1

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

ECE 588/688 Advanced Computer Architecture II

Full Datapath. Chapter 4 The Processor 2

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

LECTURE 10. Pipelining: Advanced ILP

ECE260: Fundamentals of Computer Engineering

Pipelining to Superscalar

Parallel Processing SIMD, Vector and GPU s cont.

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Hardware-based Speculation

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Hardware-Based Speculation

The Processor: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Improve performance by increasing instruction throughput

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPE300: Digital System Architecture and Design

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lec 25: Parallel Processors. Announcements

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 19 Summary

Lecture 25: Busses. A Typical Computer Organization

Tutorial 11. Final Exam Review

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Multiple Instruction Issue. Superscalars

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Keywords and Review Questions

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Parallelism, Multicore, and Synchronization

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

CS425 Computer Systems Architecture

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CS 152, Spring 2011 Section 8

Chapter 4. The Processor

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Pipelining. CSC Friday, November 6, 2015

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Computer Architecture Spring 2016

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Exploitation of instruction level parallelism

CS 654 Computer Architecture Summary. Peter Kemper

Multicore and Parallel Processing

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Systems Architecture

Thomas Polzer Institut für Technische Informatik

Lecture 13: Branch Prediction

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Transcription:

S 09 L18-1 18-447 Lecture 18: Multithreading and Multicores James C. Hoe Dept of ECE, CMU April 1, 2009 Announcements: Handouts: Handout #13 Project 4 (On Blackboard) Design Challenges of Technology Scaling, Shekhar Borkar, IEEE Micro, 1999 (on Blackboard) Instruction-Level Parallelism S 09 L18-2 When executing a program, how many independent instructions can be performed in parallel How to take advantage of ILP Pipelining (including superpipelining) overlap different stages from different instructions limited by divisibility of an instruction and ILP Superscalar (including VLIW) overlap processing of different instructions in all stages limited by ILP How to increase ILP dynamic/static register renaming reduce WAW and WAR dynamic/static instruction scheduling reduce RAW hazards use predictions to optimistically break dependence

Thread-Level Parallelism S 09 L18-3 The average processor actually executes several programs (a.k.a. processes, threads of control, etc) at the same time (time multiplexing) The instructions from these different threads have lots of parallelism Taking advantage of thread-level parallelism, i.e. by concurrent execution, can improve the overall throughput of the processor (but not turn-around time of any one thread) Assumption: a single thread cannot use the full performance potential of the processor peak performance is always higher than average must overprovision to achieve your average perf. target Classic Time-multiplex Multiprocessing Time-multiplex multiprocessing on uniprocessors started back in 1962 to enable sharing Even concurrent execution by time-multiplexing improves throughput a single thread would effectively idle the processor when spin-waiting for new event or for I/O to complete can spin-wait for thousands to millions of cycles at a time compute waiting for I/O compute waiting for I/O compute waiting for I/O a thread should just go to sleep when waiting and let other threads use the processor, compute1 compute2 compute1 compute2 compute1 compute2 keep in mind, this is very coarse-grain interleaving S 09 L18-4

Classic Context Switching S 09 L18-5 A context is all of the processor (plus machine) states associated with a particular process programmer visible states: program counter, register file contents, memory contents and some invisible states: control and status reg, page table base pointers, page tables What about cache, BTB and TLB entries? Classic Context Switching interrupt stops a program mid-execution (precise) (a thread can also voluntarily give up control by a syscall OS saves away the context of the stopped thread OS restores the context of a previously stopped thread OS uses a return from exception to jump to the restarting PC The restored thread has no idea it was interrupted, removed, later restored and restarted Saving and Restoring Context Saving Context information that occupy unique resources must be copied and saved to memory by the OS e.g. PC, GPR, cntrl/status reg Context information the occupy commodity resources just needs to be hidden from the other threads e.g. active pages in memory can be left in place but hidden via address translation (more on this when we talk about VM protection) Restoring is the opposite of saving The act of saving and restoring is performed by the OS in software can take a few hundred cycles per switch, but the cost is amortize over a long execution quantum S 09 L18-6 (If you want the full story, take a real OS course!)

Fast Context Switches S 09 L18-7 A processor becomes idle when a thread runs into a cache miss Why not switch to another thread? Cache miss lasts only tens of cycles, but it costs at least 64 inst just to save and restore the 32 GPRs Solution: fast context switch in hardware replicate hardware context registers: PC, GPRs, cntrl/status, PT base ptr eliminates copying allow multiple context to share some resources, i.e. include process ID as cache, BTB and TLB match tags eliminates cold starts hardware context switch takes only a few cycles set the PID register to the next process ID select the corresponding set of hardware context registers to be active Really Fast Context Switches S 09 L18-8 When pipelined processor stalls due to RAW dependence between instructions, the execution stage is idling Why not switch to another thread? Not only do you need hardware contexts, switching between contexts must be instantaneous to have any advantage!! If this can be done, don t need complicated forwarding logic to avoid stalls RAW dependence and long latency operations (multiply, cache misses) do not cause throughput performance loss Multithreading is a latency hiding technique

Fine-grain Multithreading (hide latency) Superpipeline revisited deeper pipeline to increase frequency and performance but back-to-back dependencies cannot be forwarded no performance gain without ILP S 09 L18-9 superpipelined t 0 t 1 t 2 t 3 t 4 Inst 0 F a F b D a D b E a E b W a W b Inst 1 F a F b D a D b E a E b W a W b What about 2-way multithreaded superpipelined Inst T1-0 Inst T2-x F a F b D a D b E a E b W a W b t 0 t 1 t 2 t 3 t 4 F a F b D a D b E a E b W a W b F a F b D a E b W a W b Inst F T1-1 a F b D a D b E a E b W a W b D b E a Inst T2-y F a F b D a D b E a E b W a W b Barrel Instruction Pipeline (hide latency) On each cycle, select a ready thread from scheduling pool [HEP, Smith] only one instruction per context in flight at once on a long latency stall, remove the context from scheduling Actually make pipelining simpler no data dependence, hence no stall or forwarding no penalty in making pipeline deeper for frequency or complexity assume there are many threads waiting to run H Inst T1 A B C D E E G H Inst T2 A B C D E F G Inst T3 Inst A B C D E F G H A B C D E F G H T4 Inst T5 A B C D E F G H Inst T6 A B C D E F G H Inst T7 A B C D E F G H Inst T8 A B C D E F G H Inst T1 A B C D E F G H S 09 L18-10

Multithreading and Out-of-order (better utilization) S 09 L18-11 Fdiv, unpipe (16 cyc) FMult (4 cyc) Fetch Unit OOO Dispatch Unit FAdd (2 cyc) Reorder Buffer ALU2 ALU1 Load/Store (variable) Superscalar processor datapath must be over-resourced has more functional units than ILP because the units are not universal current 4 to 8 way designs only achieves IPC of 2 to 3 Some units must be idling in each cycle Why not switch to another thread? Simultaneous Multi-Threading [Eggers, et al.] S 09 L18-12 Context A Fetch Unit A OOO Dispatch A Fdiv, unpipe (16 cyc) FAdd (2 cyc) FMult (4 cyc) Reorder Buffer A Context Z Fetch Unit Z OOO Dispatch Z ALU1 ALU2 Load/Store (variable) Reorder Buffer Z Dynamic and flexible sharing of functional units between multiple threads increases utilization increases throughput

S 09 L18-13 The rise of multicores Transistor Scaling S 09 L18-14 gate length [http://www.intel.com/museum/online/circuits.htm] distance between silicon atoms ~ 500 pm

Basic Scaling Theory S 09 L18-15 Planned scaling occurs in discrete nodes where each is ~0.7x of the previous in linear dimension e.g., 90nm, 65nm, 45nm, 32nm, 22nm, 15nm, The End Take the same design, reducing the linear dimensions by 0.7x (aka gate shrink ) leads to delay = 0.7x, frequency=1.43x capacitance = 0.7x die area = 0.5x Vdd=0.7x (if constant field) or Vdd=1x (if constant voltage) power = C V 2 f = 0.5x (if constant field) BUT power = 1x (if constant voltage) Take the same area, then transistor count = 2x power = 1x (constant field), power = 2x (constant voltage) [refer to the Shekhar article for the more complete story] Moore s Law S 09 L18-16 The number of transistors that can be economically integrated shall double every 24 months [http://www.intel.com/research/silicon/mooreslaw.htm]

Moore s Law Performance S 09 L18-17 Microprocessor performance has been doubling about every 24 months is this Moore s Law? are we doing well? According to scaling theory, we should get constant complexity: 1x transistor at 1.43x frequency 1.43x performance at 0.5x power max complexity: 2x transistor at 1.43x frequency 2.8x performance at constant power Instead, we have been getting (for high-perf CPUs) ~2x transistor? ~2x frequency (how we get about ~2x performance at ~2x power Performance (In)efficiency S 09 L18-18 To hit the expected performance target on single-thread microprocessors we had been pushing frequency harder by deepening pipelines we used the 2x transistors to build more complicated microarchitectures so the fast/deep pipelines don t stall (i.e., caches, BP, superscalar, out-of-order) The consequence of performance inefficiency limit of economical PC cooling [ITRS] Intel Tehas 150W (Guess what happened next) [from Shekhar Borkar, IEEE Micro, July 1999]

Multicore Performance Efficiency PC-class cooling and packaging technologies cap per-die CPU power to ~150W Going forward, still deliver 2x performance every 24 months but do it without increasing power in other words, must go faster and at the same time use less energy per instruction How do you do that? How about just use the 2x transistor per technology node to 2x the number of cores? 2x the aggregate performance without even having to increase frequency slow-down or even stop power climb *** this only works well when we have sufficient parallel workloads to keep all cores busy *** uncore portion of processors become very important S 09 L18-19 Chip-Multiprocessor S 09 L18-20 Core $ Core $ Core $ Fat Interconnect Big L2 Bigger L3 Current CMPs adopt the familiar SMP paradigm future design focus on the uncore how to support interprocessor communication how to support programmability

S 09 L18-21 Proj 4: Multithreading Multicore Checkpoint 1: Caches S 09 L18-22 IF ID EX MEM WB 128 32 I Cache combinational 0.5 KB each D Cache 128 128 Memory 10 cycle latency On cache miss: 1. stall previous stages 2. insert bubbles for next stages Real memory are even slower, 150 cycles for 3.0 GHz CPU w/ddr2

Selected cache details S 09 L18-23 Both caches are direct-mapped 128-bit cache lines write-back Cache hits indicated by p_data_ready in same cycle as read/write request Misses take 10 or more cycles Instruction cache read-only, returns entire cache line Modify the caches to output hit/miss statistics A new memory module Wider data bus to accommodate caches Using the Cache Performing a cache read Provide address in p_addr_in, and assert p_re Maintain inputs until p_data_ready is asserted Cache hit: within the same cycle (combinational) Cache miss: at least 10 cycles to read from memory Then, check p_addr_out, and get data from p_data_out Read Hit Read Miss clock S 09 L18-24 p_addr_in A1 A1 A1 p_re p_data_ready p_addr_out A1 A1 p_data_out D1 D1

Using the Cache Performing a cache write Provide p_addr_in, p_data_in, and assert p_we Maintain inputs until p_data_ready and p_writing asserted Cache hit: within the same cycle (combinational) Cache miss: at least 10 cycles to bring block from mem Note: p_writing asserted 1 cycle after p_data_ready S 09 L18-25 clock Write Hit Write Miss p_addr_in A1 A1 A1 p_we p_data_in D1 D1 D1 p_data_ready p_writing S 09 L18-26 Structural Hazard on D-cache Writes On cache hit, need extra cycle to perform the write D-cache unavailable for 2 cycles If a ST is followed by a memory inst structural hazard! How to deal with this? On Decode stage, detect hazard i.e., if inst in E is ST, and inst in D is mem inst On hazard, inject one bubble Allow the older ST to complete before executing the subsequent mem inst

Checkpoint 1: Pipeline Changes S 09 L18-27 I-cache Addressed at a cache block granularity Stall fetch on miss, propagate bubbles D-cache Addressed at a word granularity Stall on load/store misses Stall on (2 cycle) store hits Be deliberate when enabling reads and writes As before, interlock for ld dependencies (no delay slot) Stall on load dependency Checkpoint 2: Multithreading We will make one pipeline run two programs/ threads simultaneously add a second set of architectural state registers (PC and register file) for thread 0 and thread 1 instructions in the pipeline are tagged by thread-id, 0 or 1 use instruction s thread-id to choose which set of architectural states to access what about memory states? (more later) Keep pipeline full *without forwarding* the decode stage tries to issue instruction from thread 0 unless it encounters a dependency or structural hazard while thread 0 is stalled, decode issues from thread 1; if both thread 0 and thread 1 stall, decode stage stalls as soon as thread 0 is able to advance, return to issuing from thread 0 S 09 L18-28

How do we fetch? S 09 L18-29 Separate IRs for thread 0 and 1 ideally both are full on each cycle so the decode stage can choose thread according to hazards if one IR is empty, decode should try to issue from the other thread v IR0 The fetch stage needs to be synchronized with decode to replenish the IR is issuing, except if IR0 is empty, always try to fetch IR0 next (even if thread 1 is issuing) if IR1 is empty, it is only fetched if IR0 is not empty and thread 0 is not issuing v IR1 decode stage Checkpoint 3: Multicores S 09 L18-30 Two cores have private I- caches (read-only); the two I-caches share a common instruction memory port Two cores share a common D-cache interface Require a 2-port fair arbiter if both cores need to access memory in the same cycle, one side is told to stall the core that is stalled is guaranteed to go on the next round. Threads 0, 1 Threads 2, 3 I Cache Memory I Cache 128 32 D Cache 128 128

Deliverables S 09 L18-31 All project checkpoints due by 4/30 no late credit, partial credit by complete checkpoints checkpoint 1 90 points checkpoint 2 240 points checkpoint 3 120 points Extra Credit (50 max) Checkpoint 1 checked-off by 4/9 10 points Checkpoint 1&2 checked-off by 4/21 20 points Checkpoint 1&2&3 checked-off by 4/28 20 points