Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Similar documents
Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

Page 1. Memory Hierarchies (Part 2)

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture ELEC3441

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Appendix D. Controller Implementation

CS61C : Machine Structures

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Lecture 1: Introduction and Fundamental Concepts 1

Improving Cache Performance

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

EE 4683/5683: COMPUTER ARCHITECTURE

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

UNIVERSITY OF MORATUWA

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture 7 7 Refraction and Snell s Law Reading Assignment: Read Kipnis Chapter 4 Refraction of Light, Section III, IV

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

CS3350B Computer Architecture

Improving Cache Performance

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

27 Refraction, Dispersion, Internal Reflection

Lower Bounds for Sorting

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

ECE331: Hardware Organization and Design

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

Lecture 5. Counting Sort / Radix Sort

Graphs. Minimum Spanning Trees. Slides by Rose Hoberman (CMU)

Improving Template Based Spike Detection

Lecture 28: Data Link Layer

Course Administration

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

n Some thoughts on software development n The idea of a calculator n Using a grammar n Expression evaluation n Program organization n Analysis

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Elementary Educational Computer

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Ones Assignment Method for Solving Traveling Salesman Problem

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Examples and Applications of Binary Search

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

Computer Architecture

Lecture 1: Introduction and Strassen s Algorithm

3D Model Retrieval Method Based on Sample Prediction

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1

CSE 417: Algorithms and Computational Complexity

why study sorting? Sorting is a classic subject in computer science. There are three reasons for studying sorting algorithms.

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computers and Scientific Thinking

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

. Written in factored form it is easy to see that the roots are 2, 2, i,

condition w i B i S maximum u i

Uniprocessors. HPC Prof. Robert van Engelen

One advantage that SONAR has over any other music-sequencing product I ve worked

Evaluation scheme for Tracking in AMI

The Magma Database file formats

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

Optimal Mapped Mesh on the Circle

Memory Hierarchy Design (Appendix B and Chapter 2)

Chapter 24. Sorting. Objectives. 1. To study and analyze time efficiency of various sorting algorithms

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Computer Science Foundation Exam. August 12, Computer Science. Section 1A. No Calculators! KEY. Solutions and Grading Criteria.

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

GPUMP: a Multiple-Precision Integer Library for GPUs

CS 111: Program Design I Lecture 15: Objects, Pandas, Modules. Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Recursion. Recursion. Mathematical induction: example. Recursion. The sum of the first n odd numbers is n 2 : Informal proof: Principle:

Operating System Concepts. Operating System Concepts

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

EE123 Digital Signal Processing

6.854J / J Advanced Algorithms Fall 2008

COSC3330 Computer Architecture Lecture 19. Cache

Bank-interleaved cache or memory indexing does not require euclidean division

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

How do we evaluate algorithms?

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Recursion. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Review: Method Frames

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions:

Lecture 13: Validation

Transcription:

Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases: Miss pealty becomes more sigificat Greater proportio of time spet o memory stalls Icreasig clock rate: Memory stalls accout for more CPU cycles Ca t eglect cache behavior whe evaluatig system performace Chapter 5 Large ad Fast: Exploitig Memory Hierarchy

Morga Kaufma Publishers 26 February, 28 Review: Reducig Cache Miss Rates # Allow more flexible block placemet I a direct mapped cache a memory block maps to exactly oe cache block At the other extreme, we could allow a memory block to be mapped to ay cache block fully associative cache A compromise is to divide the cache ito sets, each of which cosists of ways (-way set associative) A memory block maps to a uique set - specified by the idex field - ad ca be placed ay where i that set Associative Caches Fully associative cache: Allow a give block to go i ay cache etry O a read, requires all blocks to be searched i parallel Oe comparator per etry (expesive) -way set associative: Each set cotais etries Block umber determies which set the requested item is located i: Search all etries i a give set at oce comparators (less expesive) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 2

Morga Kaufma Publishers 26 February, 28 Spectrum of Associativity For a cache with 8 etries: Four-Way Set Associative Cache 2 8 = 256 sets each with four ways (each with oe block) Tag Idex V Tag 2 253 254 255 Data V Tag 2 253 254 255 3 3 3 2 2 Byte offset 22 Data Idex 8 V Tag 2 253 254 255 Data V Tag 2 253 254 255 Way Way Way 2 Way 3 Data 32 4x select Hit Data Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 3

Morga Kaufma Publishers 26 February, 28 Aother Direct-Mapped Catch Example Cosider the mai-memory word referece strig 4 4 4 4 Start with a empty cache - all blocks iitially marked as ot valid miss 4 miss miss 4 miss 4 Mem() Mem() Mem(4) Mem() 4 miss 4 miss miss 4 miss 4 4 Mem(4) Mem() Mem(4) Mem() 8 requests, 8 misses Pig-pog effect due to coflict misses - two memory locatios that map ito the same cache block Set Associative Cache Example Cache Way Set V Tag Q: Is it there? Data Compare all the cache tags i the set to the high order 3 memory address bits to tell if the memory block is i the cache xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx Mai Memory Each word is 4 bytes Two low-order bits select the byte i the word Two Sets per Way Q2: How do we fid it? Use ext low order memory address bit to determie which cache set (ie, modulo the umber of sets i the cache) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 4

Morga Kaufma Publishers 26 February, 28 2-way Set Associative Memory Cosider the mai memory word referece strig 4 4 4 4 Start with a empty cache - all blocks iitially marked as ot valid miss 4 miss hit 4 hit Mem() Mem() Mem() Mem() Mem(4) Mem(4) Mem(4) 8 requests, 2 misses Solves the pig pog effect i a direct-mapped cache due to coflict misses sice ow two memory locatios that map ito the same cache set ca co-exist Rage of Set Associative Caches For a fixed size cache, each icrease by a factor of two i associativity doubles the umber of blocks per set (ie, the umber or ways) ad halves the umber of sets decreases the size of the idex by bit ad icreases the size of the tag by bit Used for tag compare Selects the set Selects the word i the block Tag Idex Block offset Byte offset Decreasig associativity Direct mapped (oly oe way) Smaller tags, oly a sigle comparator Icreasig associativity Fully associative (oly oe set) Tag is all the address bits except block ad byte offset Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 5

Morga Kaufma Publishers 26 February, 28 How Much Associativity is Right? Icreased associativity decreases miss rate But with dimiishig returs Simulatio of a system with 64KB D-cache, 6-word blocks, SPEC2 -way: 3% 2-way: 86% 4-way: 83% 8-way: 8% Costs of Set Associative Caches N-way set associative cache costs: N comparators - delay ad area MUX delay - set selectio - before data is available Data available after set selectio, ad Hit/Miss decisio I a direct-mapped cache, the cache block is available before the Hit/Miss decisio: So its ot possible to just assume a hit ad cotiue ad recover later if it was a miss Whe a miss occurs, which way s block do we pick for replacemet? Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 6

Morga Kaufma Publishers 26 February, 28 Replacemet Policy Direct mapped - o choice Set associative: Prefer o-valid etry, if there is oe Otherwise, choose amog etries i the set Least-recetly used (LRU) is commo: Choose the oe uused for the logest time: Simple for 2-way, maageable for 4-way, too hard beyod that Radom Oddly, gives about the same performace as LRU for high associativity Beefits of Set Associative Caches The choice of direct-mapped or set-associative depeds o the cost of a miss versus the beefit of a hit Largest gais are i goig from direct mapped to 2-way (2%+ reductio i miss rate) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 7

Morga Kaufma Publishers 26 February, 28 Reducig Cache Miss Rates #2 Use multiple levels of caches With advacig techology, we have more tha eough room o the die for bigger L caches or for a secod level of cache ormally a uified L2 cache - it holds both istructios ad data - ad i some cases eve a uified L3 cache For our example, CPI ideal of 2, cycle miss pealty (to mai memory) ad a 25 cycle miss pealty (to UL2$), 36% load/stores, a 2% (4%) L I$ (D$) miss rate, add a 5% UL2$ miss rate CPI stalls = 2 + 2 25 + 36 4 25 + 5 + 36 5 = 354 (as compared to 544 with o L2$) Multilevel Cache Desig Cosideratios Desig cosideratios for L ad L2 caches are differet: Primary cache should focus o miimizig hit time i support of a shorter clock cycle: Smaller capacity with smaller block sizes Secodary cache(s) should focus o reducig miss rate to reduce the pealty of log mai memory access times: Larger capacity with larger block sizes Higher levels of associativity The miss pealty of the L cache is sigificatly reduced by the presece of a L2 cache so it ca be smaller (ie, faster) but have a higher miss rate For the L2 cache, hit time is less importat tha miss rate: The L2$ hit time determies L$ s miss pealty Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 8

Morga Kaufma Publishers 26 February, 28 Usig the Memory Hierarchy Well FIGURE 58 Comparig Quicksort ad Radix Sort by (a) istructios executed per item sorted, (b) time per item sorted, ad (c) cache misses per item sorted This data is from a paper by LaMarca ad Lader [996] Although the umbers would chage for ewer computers, the idea still holds Due to such results, ew versios of Radix Sort have bee iveted that take memory hierarchy ito accout, to regai its algorithmic advatages (see Sectio 5) The basic idea of cache optimizatios is to use all the data i a block repeatedly before it is replaced o a miss Copyright 29 Elsevier, Ic All rights reserved Two Machies Cache Parameters Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 9

Morga Kaufma Publishers 26 February, 28 Cortex-A8 Data Cache Miss Rates FIGURE 545 Data cache miss rates for ARM Cortex-A8 whe ruig Miespec, a small versio of SPEC2 Applicatios with larger memory footprits ted to have higher miss rates i both L ad L2 Note that the L2 rate is the global miss rate; that is, coutig all refereces, icludig those that hit i L (See Elaboratio i Sectio 54) Mcf is kow as a cache buster Itel Core i7 92 Data Cache Miss Rates FIGURE 547 The L, L2, ad L3 data cache miss rates for the Itel Core i7 92 ruig the full iteger SPECCPU26 bechmarks Chapter 5 Large ad Fast: Exploitig Memory Hierarchy

Morga Kaufma Publishers 26 February, 28 Summary: Improvig Cache Performace Reduce the time to hit i the cache: Smaller cache Direct mapped cache Smaller blocks For writes: No write allocate o hit o cache, just write to write buffer Write allocate to avoid two cycles (first check for hit, the write) pipelie writes via a delayed write buffer to cache 2 Reduce the miss rate: Bigger cache More flexible placemet (icrease associativity) Larger blocks (6 to 64 bytes typical) Victim cache small buffer holdig most recetly replaced blocks Summary: Improvig Cache Performace 3 Reduce the miss pealty: Smaller blocks Use a write buffer to hold dirty blocks beig replaced so you do t have to wait for the write to complete before readig Check write buffer (ad/or victim cache) o read miss may get lucky For large blocks fetch, critical word first Use multiple cache levels L2 cache is ofte ot tied to CPU clock rate Chapter 5 Large ad Fast: Exploitig Memory Hierarchy

Morga Kaufma Publishers 26 February, 28 Summary: The Cache Desig Space Several iteractig dimesios: Cache size Block size Associativity Replacemet policy Write-through vs write-back Write allocatio The optimal choice is a compromise Depeds o access characteristics: Hard to predict Depeds o techology / cost Simplicity ofte wis Bad Cache Size Associativity Block Size Good Factor A Factor B Less More Cocludig Remarks Fast memories are small, large memories are slow: We wat fast, large memories L Cachig gives this illusio J Priciple of locality: Programs use a small part of their memory space frequetly Memory hierarchy L cache «L2 cache ««DRAM memory «disk Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 2