CS252 Graduate Computer Architecture Spring 2014 Lecture 13: Mul>threading

Similar documents
CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

CS425 Computer Systems Architecture

Multithreading and the Tera MTA. Multithreading for Latency Tolerance

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

ECSE 425 Lecture 25: Mul1- threading

High-Performance Processors Design Choices

Lecture 19: Instruction Level Parallelism -- Dynamic Scheduling, Multiple Issue, and Speculation

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Multithreaded Processors. Department of Electrical Engineering Stanford University

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 9 Virtual Memory

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CS 152, Spring 2011 Section 8

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Parallel Processing SIMD, Vector and GPU s cont.

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Processor Architecture

Simultaneous Multithreading Architecture

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

" # " $ % & ' ( ) * + $ " % '* + * ' "

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Graduate Computer Architecture Spring 2014 Lecture 17: I/O

Simultaneous Multithreading: a Platform for Next Generation Processors

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Exploitation of instruction level parallelism

Simultaneous Multithreading (SMT)

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

ILP Ends TLP Begins. ILP Limits via an Oracle

Lecture 9 - Virtual Memory

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Hardware-Based Speculation

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 4 Pipelining Part II

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

CS252 Lecture Notes Multithreaded Architectures

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 3 - Pipelining

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 8 Address Transla>on

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

THREAD LEVEL PARALLELISM

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Simultaneous Multithreading and the Case for Chip Multiprocessing

Lecture 9 Virtual Memory

Simultaneous Multithreading on Pentium 4

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Lecture 18: Multithreading and Multicores

CS252 Graduate Computer Architecture Fall 2015 Lecture 14: Synchroniza>on and Memory Models

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Kaisen Lin and Michael Conley

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 61C: Great Ideas in Computer Architecture Excep&ons/Traps/Interrupts. Smart Phone. Core. FuncWonal Unit(s) Logic Gates

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

Lecture 4 Pipelining Part II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

TDT 4260 lecture 7 spring semester 2015

Computer Architecture ELEC3441

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Simultaneous Multithreading (SMT)

Lecture 7 - Memory Hierarchy-II

Simultaneous Multithreading (SMT)

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol


CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Transcription:

CS252 Graduate Computer Architecture Spring 2014 Lecture 13: Mul>threading Krste Asanovic krste@eecs.berkeley.edu http://inst.eecs.berkeley.edu/~cs252/sp14

Last Time in Lecture 12 Synchroniza?on and Memory Models Producer- Consumer versus Mutual Exclusion Sequen?al Consistency Relaxed Memory models Fences Atomic memory opera?ons Non- Blocking Synchroniza?on 2

Mul>threading Difficult to con?nue to extract instruc?on- level parallelism (ILP) from a single sequen?al thread of control Many workloads can make use of thread- level parallelism (TLP) - TLP from mul?programming (run independent sequen?al jobs) - TLP from mul?threaded applica?ons (run one job faster using parallel threads) Mul?threading uses TLP to improve u?liza?on of a single processor 3

Mul>threading How can we guarantee no dependencies between instruc?ons in a pipeline? One way is to interleave execu?on of instruc?ons from different program threads on same pipeline Interleave 4 threads, T1- T4, on non- bypassed 5- stage pipe T1:LD x1,0(x2) T2:ADD x7,x1,x4 T3:XORI x5,x4,12 T4:SD 0(x7),x5 T1:LD x5,12(x1) t0 t1 t2 t3 t4 t5 t6 t7 t8 F D X M W F D X M W F D X M W F D X M W F D X M W t9 Prior instruc<on in a thread always completes write- back before next instruc<on in same thread reads register file 4

CDC 6600 Peripheral Processors (Cray, 1964) First mul?threaded hardware 10 virtual I/O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle?me Each virtual processor executes one instruc?on every 1000ns Accumulator- based instruc?on set to reduce processor state 5

Simple Mul>threaded Pipeline PC PC PC 1 PC 1 1 1 I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 2 Thread select 2 Have to carry thread select down pipeline to ensure correct state bits read/ wri_en at each pipe stage Appears to so`ware (including OS) as mul?ple, albeit slower, CPUs 6

Mul>threading Costs Each thread requires its own user state - PC - GPRs Also, needs its own system state - Virtual- memory page- table- base register - Excep?on- handling registers Other overheads: - Addi?onal cache/tlb conflicts from compe?ng threads - (or add larger cache/tlb capacity) - More OS overhead to schedule more threads (where do all these threads come from?) 7

Thread Scheduling Policies Fixed interleave (CDC 6600 PPUs, 1964) - Each of N threads executes one instruc?on every N cycles - If thread not ready to go in its slot, insert pipeline bubble So`ware- controlled interleave (TI ASC PPUs, 1971) - OS allocates S pipeline slots amongst N threads - Hardware performs fixed interleave over S slots, execu?ng whichever thread is in that slot Hardware- controlled thread scheduling (HEP, 1982) - Hardware keeps track of which threads are ready to go - Picks next thread to execute based on hardware priority scheme 8

Denelcor HEP (Burton Smith, 1982) First commercial machine to use hardware threading in main CPU - 120 threads per processor - 10 MHz clock rate - Up to 8 processors - precursor to Tera MTA (Mul?threaded Architecture) 9

Tera MTA (1990- ) Up to 256 processors Up to 128 ac?ve threads per processor Processors and memory modules populate a sparse 3D torus interconnec?on fabric Flat, shared main memory - No data cache - Sustains one main memory access per cycle per processor GaAs logic in prototype, 1KW/processor @ 260MHz - Second version CMOS, MTA- 2, 50W/processor - New version, XMT, fits into AMD Opteron socket, runs at 500MHz 10

MTA Pipeline W Issue Pool Inst Fetch M A C Every cycle, one VLIW instruc?on from one ac?ve thread is launched into pipeline Write Pool Memory Pool W W Instruc?on pipeline is 21 cycles long Memory opera?ons incur ~150 cycles of latency Retry Pool Interconnec>on Network Memory pipeline Assuming a single thread issues one instruc?on every 21 cycles, and clock rate is 260 MHz What is single- thread performance? Effec?ve single- thread issue rate is 260/21 = 12.4 MIPS 11

Coarse- Grain Mul>threading Tera MTA designed for supercompu?ng applica?ons with large data sets and low locality - No data cache - Many parallel threads needed to hide large memory latency Other applica?ons are more cache friendly - Few pipeline bubbles if cache mostly has hits - Just add a few threads to hide occasional cache miss latencies - Swap threads on cache misses 12

MIT Alewife (1990) Modified SPARC chips - register windows hold different thread contexts Up to four threads per node Thread switch on local cache miss 13

IBM PowerPC RS64- IV (2000) Commercial coarse- grain mul?threading CPU Based on PowerPC with quad- issue in- order five- stage pipeline Each physical CPU supports two virtual CPUs On L2 cache miss, pipeline is flushed and execu?on switches to second thread - short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency - flush pipeline to simplify excep?on handling 14

Oracle/Sun Niagara processors Target is datacenters running web servers and databases, with many concurrent requests Provide mul?ple simple cores each with mul?ple hardware threads, reduced energy/opera?on though much lower single thread performance Niagara- 1 [2004], 8 cores, 4 threads/core Niagara- 2 [2007], 8 cores, 8 threads/core Niagara- 3 [2009], 16 cores, 8 threads/core T4 [2011], 8 cores, 8 threads/core T5 [2012], 16 cores, 8 threads/core 15

Oracle/Sun Niagara- 3, Rainbow Falls 2009 16

Simultaneous Mul>threading (SMT) for OoO Superscalars Techniques presented so far have all been ver?cal mul?threading where each pipeline stage works on one thread at a?me SMT uses fine- grain control already present inside an OoO superscalar to allow instruc?ons from mul?ple threads to enter execu?on on same clock cycle. Gives be_er u?liza?on of machine resources. 17

For most apps, most execu>on units lie idle in an OoO superscalar For an 8- way superscalar. From: Tullsen, Eggers, and Levy, Simultaneous Mul?threading: Maximizing On- chip Parallelism, ISCA 1995. 18

Superscalar Machine Efficiency Instruc6on issue Time Issue width Completely idle cycle (ver6cal waste) Par6ally filled cycle, i.e., IPC < 4 (horizontal waste) 19

Ver>cal Mul>threading Instruc6on issue Time Issue width Second thread interleaved cycle- by- cycle Par6ally filled cycle, i.e., IPC < 4 (horizontal waste) Cycle- by- cycle interleaving removes ver?cal waste, but leaves some horizontal waste 20

Chip Mul>processing (CMP) Issue width Time What is the effect of splivng into mul?ple processors? - reduces horizontal waste, - leaves some ver?cal waste, and - puts upper limit on peak throughput of each thread. 21

Ideal Superscalar Mul>threading [Tullsen, Eggers, Levy, UW, 1995] Issue width Time Interleave mul?ple threads to mul?ple issue slots with no restric?ons 22

O- o- O Simultaneous Mul>threading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996] Add mul?ple contexts and fetch engines and allow instruc?ons fetched from different threads to issue simultaneously U?lize wide out- of- order superscalar processor issue queue to find instruc?ons to issue from mul?ple threads OOO instruc?on window already has most of the circuitry required to schedule from mul?ple threads Any single thread can u?lize whole machine 23

SMT adapta>on to parallelism type For regions with high thread- level parallelism (TLP) en?re machine width is shared by all threads Issue width For regions with low thread- level parallelism (TLP) en?re machine width is available for instruc?on- level parallelism (ILP) Issue width Time Time 24

Pen>um- 4 Hyperthreading (2002) First commercial SMT design (2- way SMT) Logical processors share nearly all resources of the physical processor - Caches, execu?on units, branch predictors Die area overhead of hyperthreading ~ 5% When one logical processor is stalled, the other can make progress - No logical processor can use all entries in queues when two threads are ac?ve Processor running only one ac?ve so`ware thread runs at approximately same speed with or without hyperthreading Hyperthreading dropped on OoO P6 based followons to Pen?um- 4 (Pen?um- M, Core Duo, Core 2 Duo), un?l revived with Nehalem genera?on machines in 2008. Intel Atom (in- order x86 core) has two- way ver?cal mul?threading - Hyperthreading == (SMT for Intel OoO & Ver?cal for Intel InO) 25

IBM Power 4 Single- threaded predecessor to Power 5. 8 execu?on units in out- of- order engine, each may issue an instruc?on each cycle. 26

Power 4 Power 5 2 commits (architected register sets) 2 fetch (PC), 2 initial decodes 27

Power 5 data flow... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bo_leneck 28

Ini>al Performance of SMT Pen?um 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate - Pen?um 4 is dual threaded SMT - SPECRate requires that each SPEC benchmark be run against a vendor- selected number of copies of the same benchmark Running on Pen?um 4 each of 26 SPEC benchmarks paired with every other (26 2 runs) speed- ups from 0.90 to 1.58; average was 1.20 Power 5, 8- processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate Power 5 running 2 copies of each app speedup between 0.89 and 1.41 - Most gained some - Fl.Pt. apps had most cache conflicts and least gains 29

Icount Choosing Policy Fetch from thread with the least instruc?ons in flight. Why does this enhance throughput? 30

Summary: Mul>threaded Categories Time (processor cycle) Superscalar Fine- Grained Coarse- Grained Mul>processing Simultaneous Mul>threading Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 31

Acknowledgements This course is partly inspired by previous MIT 6.823 and Berkeley CS252 computer architecture courses created by my collaborators and colleagues: - Arvind (MIT) - Joel Emer (Intel/MIT) - James Hoe (CMU) - John Kubiatowicz (UCB) - David Pa_erson (UCB) 32