Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Similar documents
Topics. Digital Systems Architecture EECE EECE Need More Cache?

Computer Architecture Spring 2016

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Administrivia. CMSC 411 Computer Systems Architecture Lecture 8 Basic Pipelining, cont., & Memory Hierarchy. SPEC92 benchmarks

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Computer Systems Architecture

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

CS3350B Computer Architecture

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

EE 4683/5683: COMPUTER ARCHITECTURE

Question?! Processor comparison!

Course Administration

Modern Computer Architecture

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Advanced Computer Architecture

Tutorial 11. Final Exam Review

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Lecture 11. Virtual Memory Review: Memory Hierarchy

Page 1. Memory Hierarchies (Part 2)

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

CS 136: Advanced Architecture. Review of Caches

Memory hierarchy review. ECE 154B Dmitri Strukov

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Handout 4 Memory Hierarchy

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

The Memory Hierarchy & Cache

Memory Hierarchy Review

Lecture 11 Reducing Cache Misses. Computer Architectures S

Locality and Data Accesses video is wrong one notes when video is correct

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

CSE 141 Computer Architecture Spring Lectures 17 Virtual Memory. Announcements Office Hour

Copyright 2012, Elsevier Inc. All rights reserved.

Introduction to OpenMP. Lecture 10: Caches

EITF20: Computer Architecture Part4.1.1: Cache - 2

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

CPE 631 Lecture 04: CPU Caches

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Computer Architecture Spring 2016

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

COSC 6385 Computer Architecture - Pipelining

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS61C Review of Cache/VM/TLB. Lecture 26. April 30, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Lec 11 How to improve cache performance

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

14:332:331. Week 13 Basics of Cache

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Memory Hierarchy. Slides contents from:

Cray XE6 Performance Workshop

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Memory Hierarchy. Slides contents from:

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Caching Basics. Memory Hierarchies

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Page 1. Multilevel Memories (Improving performance using a little cash )

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

14:332:331. Week 13 Basics of Cache

Instruction-Level Parallelism (ILP)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Introduction to cache memories

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Memory Hierarchy: Motivation

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Chapter-5 Memory Hierarchy Design

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Transcription:

Computer Organization CS 231-01 Improving Performance Dr. William H. Robinson November 8, 2004 Topics Money's only important when you don't have any. Sting Cache Scoreboarding http://eecs.vanderbilt.edu/courses/cs231/ 1 2 Opportunity for (Easy) Points Prepare for Quiz #4 Three Generic Data Hazards Read After Write (RAW) Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a dependence (in compiler nomenclature). This hazard results from an actual need for communication. 3 4

Three Generic Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Can t happen in MIPS 5-stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 Three Generic Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Can t happen in MIPS 5-stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes 5 6 What is a Cache? Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers a cache on variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) Bigger Proc/Regs L1-Cache L2-Cache Memory Disk, Tape, etc. Faster The Principle of Locality The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, H/W relied on locality for speed Adapted from David Patterson s CS 252 lecture notes. Copyright 2001 UCB. 7 8

Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU Q4: What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer) Types of Caches Direct-mapped Cache line only has 1 possible location Simple to implement Set-associative Cache line has a set of possible locations e.g. 4-way set-associative Fully-associative Cache line can be placed anywhere Costly in H/W to check entire cache for hit 9 10 Block Identification Cache Tag Generally tags checked in parallel for speed Block offset not needed for comparison Checking the index is redundant Also need valid bit Block Address Cache Index Block Offset Replacement Policy Random Spreads allocation uniformly Implemented with pseudorandom number generator Least Recently Used (LRU) Blocks not accessed for longest time are replaced Complex to implement First In First Out (FIFO) Determines the oldest block Approximates LRU 11 12

Write Policy: Write-Through vs. Write-Back Write-through: all writes update cache and underlying memory/cache Can always discard cached data - most up-to-date data is in memory Cache control bit: only a valid bit Write-back: all writes simply update cache Can t just discard cached data - may have to write it back to memory Cache control bits: both valid and dirty bits Write Policy: Write-Through vs. Write-Back Other Advantages: Write-through: Memory (or other processors) always have latest data Simpler management of cache Write-back: Much lower bandwidth, since data often overwritten multiple times Better tolerance to long-latency memory? 13 14 Write Policy 2: Write Allocate vs. No-Write Allocate What happens on write-miss Write allocate: allocate new cache line in cache Usually means that you have to do a read miss to fill in rest of the cache-line! No-write allocate (or write-around ): Simply send write data through to underlying memory/cache; don t allocate new cache line! Terminology Hit memory location found in cache Miss memory location not in cache Compulsory Cache is empty at the start of a program Conflict Another valid cache line is currently stored in the location Capacity Cache isn t large enough to hold the entire working set of a program 15 16

3Cs Absolute Miss Rate (SPEC92) 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 1-way 2-way 4-way Conflict 8-way Capacity 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 1-way 2-way 4-way 8-way Conflict Capacity 1 2 Compulsory increasingly small 4 8 16 Cache Size (KB) 32 64 128 Compulsory Cache Size (KB) 17 18 1 2 4 8 16 32 64 128 Compulsory Modern CPU Design In-Order Execution Issue instructions in program order Retire (complete) instructions in program order Pipelined instructions execute in stages Superscalar contains multiple functional units May not give optimal performance because of instruction dependencies Created whenever there is a dependence between instructions where pipelining overlap changes the order of access RAW WAW WAR 19 20

Data Hazards Read After Write (RAW) True data dependence (most common type) Must preserve program order to ensure correct execution Write After Read (WAR) Output dependence Only present in pipelines that write in more than one pipe stage allow instructions to proceed after previous instructions stalls Example: Pipelined, Superscalar CPU 2-way superscalar Can issue up to 2 instructions per cycle For instructions decoded in cycle n Execution starts in cycle n + 1 ADD/SUB completes in cycle n + 2 MUL completes in cycle n + 3 Write After Write (WAW) Anti-dependence Can occur when instructions are reordered 21 22 Rules for Issuing Instructions Scoreboarding: In-Order Issue/Completion True data dependence (RAW) Don t issue if any operand is being written Anti-dependence (WAR) Don t issue if result register is being read Output dependence (WAW) Don t issue if result register is being written 23 24

Scoreboarding: In-Order Issue/Completion Scoreboarding: In-Order Issue/Completion Instruction 4 has a true dependence (RAW) from Instruction 2 Stall until R4 is available 25 Instruction 2 actually completes during cycle 3 Must retire instructions in order 26 Summary The average memory access time (AMAT) of a cache can be improved by reducing miss rate Scoreboarding is a technique to monitor the dynamic scheduling of instructions In-order execution and in-order completion can limit performance 27