ILP Ends TLP Begins. ILP Limits via an Oracle

Similar documents
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Multithreaded Processors. Department of Electrical Engineering Stanford University

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Lecture-13 (ROB and Multi-threading) CS422-Spring

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

LIMITS OF ILP. B649 Parallel Architectures and Programming

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

TDT 4260 lecture 7 spring semester 2015

CS425 Computer Systems Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Exploitation of instruction level parallelism

ECSE 425 Lecture 25: Mul1- threading

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Hardware-Based Speculation


Multithreading: Exploiting Thread-Level Parallelism within a Processor

Computer Architecture Spring 2016

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Control Hazards. Prediction

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 14: Multithreading

CS 654 Computer Architecture Summary. Peter Kemper

Hardware Speculation Support

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Parallel Processing SIMD, Vector and GPU s cont.

Handout 2 ILP: Part B

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Control Hazards. Branch Prediction

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Simultaneous Multithreading Architecture

Lecture 18: Core Design, Parallel Algos

Instruction Level Parallelism (ILP)

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

EECS 452 Lecture 9 TLP Thread-Level Parallelism

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Getting CPI under 1: Outline

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

THREAD LEVEL PARALLELISM

Dynamic Control Hazard Avoidance

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Page 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Simultaneous Multithreading: a Platform for Next Generation Processors

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Dynamic Issue & HW Speculation. Raising the IPC Ceiling

Page 1. Raising the IPC Ceiling. Dynamic Issue & HW Speculation. Fix OOO Completion Problem First. Reorder Buffer In Action

Kaisen Lin and Michael Conley

Simultaneous Multithreading (SMT)

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Advanced processor designs

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Lecture 25: Board Notes: Threads and GPUs

CS 326: Operating Systems. CPU Scheduling. Lecture 6

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Mo Money, No Problems: Caches #2...

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Introducing Multi-core Computing / Hyperthreading

Processor (IV) - advanced ILP. Hwansoo Han

Limits to ILP. Limits to ILP. Overcoming Limits. CS252 Graduate Computer Architecture Lecture 11. Conflicting studies of amount

Multi-core Architectures. Dr. Yingwu Zhu

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Simultaneous Multithreading (SMT)

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Advanced Processor Architecture

Instruction Level Parallelism

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Modern Processor Architectures. L25: Modern Compiler Design

Multi-core Architectures. Dr. Yingwu Zhu

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiple Issue ILP Processors. Summary of discussions

Week 6 out-of-class notes, discussions and sample problems

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Multiple Instruction Issue. Superscalars

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Transcription:

ILP Ends TLP Begins Today s topics: Explore a perfect machine unlimited budget to see where ILP goes answer: not far enough Look to TLP & multi-threading for help everything has it s issues we ll look at some of them Apology a bit more data than usual try not to yawn LOUDLY 1 CS6810 ILP Limits via an Oracle Suspend reality and think of a perfect machine infinite number of rename registers» no Wax hazards» for window size of n: n 2 -n comparisons for each register field perfect branch & jump prediction» unbounded buffer of instructions available for execution perfect address alias analysis» independent loads can be moved ahead of stores perfect L1$ s» hit in 1 cycle as many XU s as will ever be needed» no structural stalls Infinite cost unrealistic simulate rather than build» allows exploration just how far can we get with ILP on a perfect machine and sequential code 2 CS6810 Page 1

IBM Power5 Most advanced superscalar processor to date 4 fetch 6 issue 88 integer rename regs, 88 float rename regs pipeline has over 200 instructions in flight» including 32 loads and 32 stores Not quite the Oracle but on the way consumes a lot of power target is blade server segment 3 CS6810 ILP w/ Infinite Window Size Looks great when compared w/ today s IPC < 3» if you can ignore the infinite cost 4 CS6810 Page 2

Limit Window Size ILP shrinks rapidly 2K = ~ 4M * 3 5-bit compares 512 = 785K compares 128 = 49K compares 32 = 3K compares Plus compares happen every cycle conclude 32 is doable but watty can improve by maybe 3x at great cost (remember this is still a perfect machine just w/ a limited window) 5 CS6810 Switch to Half-Infinity You do the math For the remaining data assume 2K window size» ~12 M 5-bit compares every clock» >10x bigger than anything that s been built 64 issue» ~10x more than anything real Why choose this given other restrictions it won t be a limit» can you say easier to simulate» I knew you could 6 CS6810 Page 3

Look at Semi-Real Branch Prediction Tnmt: 8K entry predictor Jump predictor: 2K entries 48K bits and 3% mispredict rate which is very good just expensive Standard: 512 entry 2-bit predictor Conclusion: have to predict integer codes are a problem yet highly important in modern data-center apps Nobody makes much money on floating point sad reality 7 CS6810 Limiting Rename Registers Integer codes remain problematic $ problem with FP remains but ILP looks good if you don t care about $ conclusion need around 64 renamed registers to make much of a difference 8 CS6810 Page 4

Alias Analysis Influence GL/STK heap ref s conflict but nothing else Inspection what can the compiler do? 9 CS6810 Ambitious but Possible? HAL 1 better than IBM in all letters 64 issue no restrictions» this one is actually ridiculous it does focus on ILP limits rather than structural stalls 1K entry tournament predictor» this has been done perfect disambiguation» note close to possible for small window sizes» impractical for large windows 64 register rename pool» ~100 have been done What do we get? Al the Harpy says a really good heater 10 CS6810 Page 5

And VIOLA! 3-4x improvement for a machine nobody will buy too hot, too costly Interesting study & clear conclusion: ILP is already past the point of diminishing return Programmer is going to need to help out w/ exposing parallelism Need a different type of HW support for parallelism 11 CS6810 Enter TLP Again not a new idea been around for > 10 years» Tullsen UW 1995 publishes the SMT idea» TERA MTA & IBM Pulsar show up in late 90 s both MT Thread vs. Process confusion process runs in it s own virtual memory space» no shared memory» lots of OS protection & overhead» communicate via message like channels e.g. pipes in Unix threads» share memory and therefore synchronization needed both are independent entities» with their own sets of registers and process state TLP difference» multiple threads can run concurrently or interleaved on the same processor» one at a time and context switch for processes 12 CS6810 Page 6

Multi-Threading 2 variants fine-grained MT e.g. TERA» round robin walk through threads next cycle next thread» TERA 128 threads built in 128 cycle load-use delay basic idea was to cover main memory latency and do away w/ caches great if you put every app into a 128 thread mold it failed and Burton goes to the dark side (a.k.a. Microsoft) coarse grained MT e.g. IBM Pulsar» sometimes called switch on miss» basic idea anytime something bad happens TLB or L2 miss switch to next runable thread some sort of fairness policy is required usually just round-robin suffices similar goal hide performance effect of long stalls 13 CS6810 Symmetric Multithreading Idea multiple independent threads increase number of parallel instructions to issue Thread 1 Thread 2 Cycles Superscalar Fine-Grained Multithreading Simultaneous Multithreading Thread 3 Thread 4 Idle» superscalar not enough ILP and idle on cache miss» FGMT not enough ILP in any one thread» SMT improves since broader set of independent instructions programmer supplied parallelism takes advantage of dynamic issue superscalar tactics 14 CS6810 Page 7

SMT Resource Perspective Each thread has it s own PC, next PC» next is needed for exceptions private logical registers» and mapping to renamed physical registers ROB» if shared a stall in one thread will stall the others Shared branch predictor» larger size will be needed main memory ports, TLB, page table» artifact of shared memory» more threads does increase memory pressure biggest problem is single ported L1$ s 15 CS6810 SMT Pipeline Structure Front End Front End Front End Front End Shared Front-end I-Cache Bpred Private Front-end Rename ROB PC Execution Engine Shared Exec Engine Regs DCache IQ FUs What about RAS, LSQ? 16 CS6810 Page 8

SMT Issues Single thread performance goes down competition w/ other threads for resources resource utilization goes up» hence throughput goes up Fetch who? which thread has priority?» unless set by user dynamic critical path can t be known in a small window setting LSQ and ROB partition sizes is one way of implementing a priority in later stages not so simple in Fetch ICOUNT» widely accepted heuristic» fetch each thread to roughly equalize processor resources better methods possible» BUT beware of creeping complexity power and validation costs can fall off a cliff 17 CS6810 4 Modern ish Processors CPU uarch Fetch/ Issue/ Ex XU s Clock (GHz) T s & area Power (Watts) Pent 4 Extreme Spec. Dyn. Issue, deep pipe, 2way SMT 3/3/4 7 Int 1 FP 3.8 125M 122 mm 2 115 Athlon 64 FX-57 Spec. Dyn Issue 3/3/4 6 Int 3 FP 2.8 114M 115 mm 2 104 1 Core of Power5 Spec, Dyn. Issue, SMT 8/4/8 6 int 2 FP 1.9 200M 300 mm 2 80 Itanium 2 EPIC, mostly static sched 6/5/11 9 int 2 FP 1.6 592M 423 mm 2 130 Power5 is dual core area, T s, power estimated for single core large die size is due to 9 MB L3 cache on chip 18 CS6810 Page 9

SPECint2000 19 CS6810 SPECfp2000 20 CS6810 Page 10

Efficiency Note: as we move into multi-core perf/watt becomes the critical efficiency metric Even better: energy-delay product since that is architecture and workload specific 21 CS6810 Concluding Remarks SMT is a big boost to the ILP game uses previous skills in dynamic superscalar architecture rules of thumb» double threads 1.6x performance gain ~10-15% power gain where does the curve saturate» depends on workload ~4/core seems to be a sweet spot time will tell» Sun Niagra Falls 8 cores, 8 threads/core simpler cores however performs well in the data-center Next leave processor side and examine the memory side» both need to be balanced and done right to win 22 CS6810 Page 11