Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Similar documents
Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture 18: Core Design, Parallel Algos

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Multithreaded Processors. Department of Electrical Engineering Stanford University

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture: Out-of-order Processors

" # " $ % & ' ( ) * + $ " % '* + * ' "

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

ILP Ends TLP Begins. ILP Limits via an Oracle

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading Processor

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

CS425 Computer Systems Architecture

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Simultaneous Multithreading Architecture

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Handout 2 ILP: Part B

Hardware-Based Speculation

Hyperthreading Technology

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

Out of Order Processing

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

CSE 490/590 Computer Architecture Homework 2

SPECULATIVE MULTITHREADED ARCHITECTURES

TDT 4260 lecture 7 spring semester 2015

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Limitations of Scalar Pipelines

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Cache Organizations for Multi-cores

Dynamically Controlled Resource Allocation in SMT Processors

Optimizing SMT Processors for High Single-Thread Performance

HW1 Solutions. Type Old Mix New Mix Cost CPI

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Pentium IV-XEON. Computer architectures M

Simultaneous Multithreading on Pentium 4

CS 654 Computer Architecture Summary. Peter Kemper

Processor Architecture

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

CSE502: Computer Architecture CSE 502: Computer Architecture

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Pipelining to Superscalar

Lecture: Branch Prediction

A Fine-Grain Multithreading Superscalar Architecture

Simultaneous Multithreading (SMT)

Lecture 14: Multithreading

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Exploitation of instruction level parallelism

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Getting CPI under 1: Outline

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Simultaneous Multithreading (SMT)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Simultaneous Multithreading and the Case for Chip Multiprocessing

Memory Hierarchies 2009 DAT105

Simultaneous Multithreading (SMT)

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Advanced Computer Architecture

EECS 470 Midterm Exam Winter 2015

5008: Computer Architecture


SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS152 Computer Architecture and Engineering. Complex Pipelines

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

ECSE 425 Lecture 25: Mul1- threading

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Kaisen Lin and Michael Conley

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Lecture 19: Instruction Level Parallelism

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Superscalar SMIPS Processor

Transcription:

Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1

Problem 1 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal LD R1 [R2] 3 abcd LD R3 [R4] 6 adde ST R5 [R6] 4 7 abba LD R7 [R8] 2 abce ST R9 [R10] 8 3 abba LD R11 [R12] 1 abba Mem.Acc 2

Problem 1 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1 [R2] 3 abcd 4 5 LD R3 [R4] 6 adde 7 8 ST R5 [R6] 4 7 abba 5 commit LD R7 [R8] 2 abce 3 6 ST R9 [R10] 8 3 abba 9 commit LD R11 [R12] 1 abba 2 10 3

Problem 2 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal LD R1 [R2] 3 abcd LD R3 [R4] 6 adde ST R5 [R6] 5 7 abba LD R7 [R8] 2 abce ST R9 [R10] 1 4 abba LD R11 [R12] 2 abba Mem.Acc 4

Problem 2 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1 [R2] 3 abcd 4 5 LD R3 [R4] 6 adde 7 8 ST R5 [R6] 5 7 abba 6 commit LD R7 [R8] 2 abce 3 7 ST R9 [R10] 1 4 abba 2 commit LD R11 [R12] 2 abba 3 5 5

Problem 3 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal LD R1 [R2] 3 abcd LD R3 [R4] 6 adde ST R5 [R6] 4 7 abba LD R7 [R8] 2 abce ST R9 [R10] 8 3 abba LD R11 [R12] 1 abba Mem.Acc 6

Problem 3 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1 [R2] 3 abcd 4 5 LD R3 [R4] 6 adde 7 8 ST R5 [R6] 4 7 abba 5 commit LD R7 [R8] 2 abce 3 4 ST R9 [R10] 8 3 abba 9 commit LD R11 [R12] 1 abba 2 3/10 7

Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling processor area, single thread performance barely improves Strategies for thread-level parallelism: multiple threads share the same large processor reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT) each thread executes on its own mini processor simple design, low interference between threads Chip Multi-Processing (CMP) or multi-core 8

How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Thread 1 Thread 2 Cycles Thread 3 Thread 4 Superscalar Fine-Grained Multithreading Simultaneous Multithreading Idle Superscalar processor has high under-utilization not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle has the highest probability of finding work for every issue slot 9

What Resources are Shared? Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) For correctness, each thread needs its own PC, IFQ, logical regs (and its own mappings from logical to phys regs) For performance, each thread could have its own ROB/LSQ (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing better utilization of resources Each additional thread costs a PC, IFQ, rename tables, and ROB cheap! 10

Pipeline Structure Front End Front End Front End Front End Private/ Shared Front-end I-Cache Bpred Private Front-end Rename ROB Execution Engine Shared Exec Engine Regs DCache IQ FUs 11

Resource Sharing Thread-1 R1 R1 + R2 R3 R1 + R4 R5 R1 + R3 Instr Fetch Instr Fetch R2 R1 + R2 R5 R1 + R2 R3 R5 + R3 Thread-2 Register File P65 P1 + P2 P66 P65 + P4 P67 P65 + P66 Instr Rename Instr Rename P76 P33 + P34 P77 P33 + P76 P78 P77 + P35 Issue Queue P65 P1 + P2 P66 P65 + P4 P67 P65 + P66 P76 P33 + P34 P77 P33 + P76 P78 P77 + P35 FU FU FU FU 12

Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) this effect can be mitigated by trying to prioritize one thread While fetching instructions, thread priority can dramatically influence total throughput a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 13

Pentium4 Hyper-Threading Two threads the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale) Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin), FUs, data cache, bpred 14

Multi-Programmed Speedup sixtrack and eon do not degrade their partners (small working sets?) swim and art degrade their partners (cache contention?) Best combination: swim & sixtrack worst combination: swim & art Static partitioning ensures low interference worst slowdown is 0.9 15

The Cache Hierarchy Core L1 L2 L3 Off-chip memory 16

Problem 1 Memory access time: Assume a program that has cache access times of 1-cyc (L1), 10-cyc (L2), 30-cyc (L3), and 300-cyc (memory), and MPKIs of 20 (L1), 10 (L2), and 5 (L3). Should you get rid of the L3? 17

Problem 1 Memory access time: Assume a program that has cache access times of 1-cyc (L1), 10-cyc (L2), 30-cyc (L3), and 300-cyc (memory), and MPKIs of 20 (L1), 10 (L2), and 5 (L3). Should you get rid of the L3? With L3: 1000 + 10x20 + 30x10 + 300x5 = 3000 Without L3: 1000 + 10x20 + 10x300 = 4200 18

Accessing the Cache Byte address 101000 Offset 8-byte words 8 words: 3 index bits Direct-mapped cache: each address maps to a unique address Data array Sets 19

The Tag Array Byte address 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array 20

Increasing Line Size Byte address 10100000 Tag Offset A large cache line size smaller tag array, fewer misses because of spatial locality 32-byte cache line size or block size Tag array Data array 21

Associativity Byte address 10100000 Set associativity fewer conflicts; wasted power because multiple data and tags are read Tag Way-1 Way-2 Tag array Compare Data array 22

Problem 2 Assume a direct-mapped cache with just 4 sets. Assume that block A maps to set 0, B to 1, C to 2, D to 3, E to 0, and so on. For the following access pattern, estimate the hits and misses: A B B E C C A D B F A E G C G A 23

Problem 2 Assume a direct-mapped cache with just 4 sets. Assume that block A maps to set 0, B to 1, C to 2, D to 3, E to 0, and so on. For the following access pattern, estimate the hits and misses: A B B E C C A D B F A E G C G A M MH MM H MM HM HMM M M M 24

Problem 3 Assume a 2-way set-associative cache with just 2 sets. Assume that block A maps to set 0, B to 1, C to 0, D to 1, E to 0, and so on. For the following access pattern, estimate the hits and misses: A B B E C C A D B F A E G C G A 25

Problem 3 Assume a 2-way set-associative cache with just 2 sets. Assume that block A maps to set 0, B to 1, C to 0, D to 1, E to 0, and so on. For the following access pattern, estimate the hits and misses: A B B E C C A D B F A E G C G A M MH M MH MM HM HMM M H M 26

Problem 4 64 KB 16-way set-associative data cache array with 64 byte line sizes, assume a 40-bit address How many sets? How many index bits, offset bits, tag bits? How large is the tag array? 27

Problem 4 64 KB 16-way set-associative data cache array with 64 byte line sizes, assume a 40-bit address How many sets? 64 How many index bits (6), offset bits (6), tag bits (28)? How large is the tag array (28 Kb)? 28

Title Bullet 29