CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

Similar documents
CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Computer Architecture ELEC3441

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

CS61C : Machine Structures

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

Uniprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen

Outline. CSCI 4730 Operating Systems. Questions. What is an Operating System? Computer System Layers. Computer System Layers

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

Reducing SDRAM Energy Consumption in Embedded Systems Λ

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

Appendix D. Controller Implementation

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Session Initiated Protocol (SIP) and Message-based Load Balancing (MBLB)

UNIVERSITY OF MORATUWA

Operating System Concepts. Operating System Concepts

1. SWITCHING FUNDAMENTALS

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Lecture 13: Validation

SCI Reflective Memory

CS 111: Program Design I Lecture 15: Objects, Pandas, Modules. Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Computer Architecture ELEC2401 & ELEC3441

15-740/ Computer Architecture

Data diverse software fault tolerance techniques

Lecture 28: Data Link Layer

Lower Bounds for Sorting

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

SPIRAL DSP Transform Compiler:

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Lecture-18 (Cache Optimizations) CS422-Spring

Advanced Caching Techniques

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

CSE 417: Algorithms and Computational Complexity

Advanced Caching Techniques

Cache-Optimal Methods for Bit-Reversals

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Switch Construction CS

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Chapter 3 Classification of FFT Processor Algorithms

EE123 Digital Signal Processing

Firewall and IDS. TELE3119: Week8

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 5. Counting Sort / Radix Sort

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Outline. Applications of FFT in Communications. Fundamental FFT Algorithms. FFT Circuit Design Architectures. Conclusions

GPUMP: a Multiple-Precision Integer Library for GPUs

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

Elementary Educational Computer

Avid Interplay Bundle

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

Computer Architecture ELEC3441

Isn t It Time You Got Faster, Quicker?

Examples and Applications of Binary Search

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

Lecture 1: Introduction and Fundamental Concepts 1

Computers and Scientific Thinking

Architectural styles for software systems The client-server style

Ones Assignment Method for Solving Traveling Salesman Problem

CS553 Lecture Reuse Optimization: Common SubExpr Elim 3

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

Python Programming: An Introduction to Computer Science

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

The Value of Peering

Threads and Concurrency in Java: Part 1

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Threads and Concurrency in Java: Part 1

Weston Anniversary Fund

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

Overview. Chapter 18 Vectors and Arrays. Reminder. vector. Bjarne Stroustrup

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

Traffic Models and QoS

K-NET bus. When several turrets are connected to the K-Bus, the structure of the system is as showns

n Explore virtualization concepts n Become familiar with cloud concepts

Transcription:

CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago

Lecture Outlie Caches 2

Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access a cache? Fudametal parameters sets, block size, associativity Performace metric: AMAT 3

Cache Desig Decisios ad Tradeoffs

Cache Desig Cosideratios Orgaizatio: cache size, block size, associatively? Replacemet: what data to remove to make room i cache? Write policy: what do we do about writes? Istructios/data: Do we treat them separately? Performace optimizatio Icrease hit rate, reduce miss rate Reduce hit time Reduce miss pealty 5

Cache Size Cache size: total data (ot icludig tag) capacity bigger ca exploit temporal locality better Too large: adversely affects hit ad miss latecy smaller is faster => bigger is slower access time may degrade critical path Too small Does t exploit temporal locality well useful data replaced ofte hit rate workig set size Workig set: the whole set of data the executig applicatio refereces Withi a time iterval cache size 6

Block Size Block size is the data that is associated with a address tag Too small do t exploit spatial locality well have larger tag overhead Ca t take advatage of fast burst trasfers from memory Too large Too few total # of blocks à less temporal locality exploitatio Waste cache space, badwidth ad eergy if spatial locality is ot high hit rate block size 7

Critical-Word First Large cache blocks ca take a log time to fill ito the cache fill cache lie critical word first restart cache access before complete fill Example Assume 8-byte words ad 8-word cache block Applicatio wats to access the 4 th word ad miss i cache Fetch the 4 th word first, the the rest 8

Associativity How may blocks ca map to the same idex (or set)? Higher associativity ++ Higher hit rate (reduce coflict misses) -- Slower cache access time (hit latecy ad data access latecy) -- More expesive hardware (more comparators) -- Dimiishig returs from higher associativity Smaller associativity lower cost hit rate lower hit latecy Especially importat for L1 caches associativity 9

Cache Desig Cosideratios Orgaizatio: cache size, block size, associatively? Replacemet: what data to remove to make room i cache? Write policy: what do we do about writes? Istructios/data: Do we treat them separately? Performace optimizatio Icrease hit rate, reduce miss rate Reduce hit time Reduce miss pealty 10

Evictio/Replacemet Policy Which block i the set to replace o a cache miss? Ay ivalid block first If all are valid, cosult the replacemet policy 11

LRU (Least Recetly Used) Policy Idea: Evict the least recetly accessed block Problem: Need to keep track of access orderig of blocks Questio: 2-way set associative cache: What do you eed to implemet LRU perfectly? Questio: 4-way set associative cache: What do you eed to implemet LRU perfectly? How may differet orderigs possible for the 4 blocks i the set? How may bits eeded to ecode the LRU order of a block? What is the logic eeded to determie the LRU victim? 12

Approximatios of LRU Most moder processors do ot implemet true LRU (also called perfect LRU ) i highly-associative caches Why? True LRU is complex LRU is a approximatio to predict locality ayway (i.e., ot the best possible cache maagemet policy) Examples: Not MRU (ot most recetly used) May others 13

Radom Replacemet Policy LRU vs. Radom: Which oe is better? Example: 4-way cache, cyclic refereces to A, B, C, D, E 0% hit rate with LRU policy Set thrashig: Whe the program workig set i a set is larger tha set associativity Radom replacemet policy is better whe thrashig occurs I practice: Depeds o workload Average hit rate of LRU ad Radom are similar 14

Cache Desig Cosideratios Orgaizatio: cache size, block size, associatively? Replacemet: what data to remove to make room i cache? Write policy: what do we do about writes? Istructios/data: Do we treat them separately? Performace optimizatio Icrease hit rate, reduce miss rate Reduce hit time Reduce miss pealty 15

Write-Back vs. Write-Through Caches Whe do we write the modified data i a cache to the ext level? Write through: At the time the write happes Write back: Whe the block is evicted Write-back + Ca cosolidate multiple writes to the same block before evictio Potetially saves badwidth betwee cache levels + saves eergy -- Need a bit idicatig the block is dirty/modified Write-through + Simpler + All levels are up to date -- More badwidth itesive; o coalescig of writes 16

Allocate vs. No-Allocate O Write Caches Do we allocate a cache block o a write miss? Allocate o write miss: Yes No-allocate o write miss: No Allocate o write miss + Ca cosolidate writes istead of writig each of them idividually to ext level (assume write-back cache) + Simpler because write misses ca be treated the same way as read misses -- Reuires trasfer of the whole cache block No-allocate + Coserves cache space if locality of writes is low (potetially better cache hit rate) 17

Cache Desig Cosideratios Orgaizatio: cache size, block size, associatively? Replacemet: what data to remove to make room i cache? Write policy: what do we do about writes? Istructios/data: Do we treat them separately? Performace optimizatio Icrease hit rate, reduce miss rate Reduce hit time Reduce miss pealty 18

Istructio vs. Data Caches Separate or Uified? Uified: + Dyamic sharig of cache space: o overprovisioig that might happe with static partitioig (i.e., split I ad D caches) -- Istructios ad data ca thrash each other (i.e., o guarateed space for either) -- I ad D are accessed i differet places i the pipelie. Where do we place the uified cache for fast access? First level caches are almost always split Maily for the last reaso above Secod ad higher levels are almost always uified 19

A Word o Multi-level Cachig First-level (L1) caches (istructio ad data) Decisios very much affected by cycle time Small, lower associativity Secod-level (L2) caches Decisios eed to balace hit rate ad access latecy Usually large ad highly associative; latecy ot as importat Multi-level iclusio If data i L1 is always a subset of data i L2 + Easier cache aalysis + Easier coherece check -- Additioal logic to maitai iclusio -- Wasted space 20

Cache Desig Cosideratios Orgaizatio: cache size, block size, associatively? Replacemet: what data to remove to make room i cache? Write policy: what do we do about writes? Istructios/data: Do we treat them separately? Performace optimizatio Icrease hit rate, reduce miss rate Reduce hit time Reduce miss pealty 21

How to Improve Cache Performace Remember Average memory access time (AMAT) = ( hit-rate * hit-latecy ) + ( miss-rate * miss-latecy ) Three fudametal goals Reducig miss rate Reducig miss latecy/cost Reducig hit latecy/cost Tradeoffs! E.g., to reduce miss rate, hit/miss latecy ca icrease 22

Improvig Cache Performace Reducig miss rate Higher associativity, better replacemet policies, Reducig hit latecy/cost Smaller caches/associativity Reducig miss latecy/cost Multi-level caches Critical word first No-blockig caches (multiple cache misses i parallel) High-badwidth caches (multiple accesses per cycle) Cache-friedly software approaches 23

No-Blockig Caches

Hadlig Multiple Outstadig Accesses Goal: Eable cache access whe there is a pedig miss Hit uder miss Miss uder miss Idea: No-blockig or lockup-free caches 25

Beefits of No-Blockig Caches 1 hit uder miss: 9% ad 12.5% reductio i cache access latecy for SPECit ad SPECfp bechmarks Slide credit: Prof. Christos Kozyrakis, (EE282, Staford Uiversity) 26

Hadlig Multiple Outstadig Accesses Idea: Keep track of the status/data of misses that are beig hadled i Miss Status Hadlig Registers (MSHRs) A cache access checks MSHRs to see if a miss to the same block is already pedig. If pedig, a ew reuest is ot geerated If pedig ad the eeded data available, data forwarded to later load Reuires bufferig of outstadig miss reuests 27

Miss Status Hadlig Register Also called miss buffer Keeps track of Outstadig cache misses Pedig load/store accesses that refer to the missig cache block Fields of a sigle MSHR etry Valid bit Cache block address (to match icomig accesses) Cotrol/status bits (prefetch, whether it s issued to memory, which subblocks have arrived, etc) Data for each subblock Multiple store/load etries for each pedig load/store that access the cache block Valid, type (load or store), data size (how may bytes), which bytes i block are eeded, destiatio register or store buffer etry address 28

Miss Status Hadlig Register Etry 29

No-Blockig Cache Operatio O a cache miss: Search MSHRs for a pedig access to the same block Foud: Allocate a load/store etry i the same MSHR etry Not foud: Allocate a ew MSHR No free etry: stall Whe a subblock returs from the ext level i memory Check which loads/stores waitig for it Forward data to the load/store uit Deallocate load/store etry i the MSHR etry (mark as ivalid) Write subblock i cache or MSHR If last subblock, dellaocate MSHR (after writig the block i cache) 30

Eablig High Badwidth Caches

Multiple Memory Accesses per Cycle Processors ca geerate multiple cache/memory accesses per cycle E.g. superscalar Cache/memory ca receive multiple access reuests per cycle E.g., shared amog multiple processors How do we esure the cache/memory ca hadle multiple accesses i the same clock cycle? Solutios... Multiple baks 32

Cache Baks Idea: rather tha treat the cache as a sigle moolithic block, divide it ito idepedet baks that ca support simultaeous accesses Bits i address determies which bak a address maps to Address space partitioed ito separate baks Access to differet baks (e.g., blocks 0, 1, 2, 3) ca be doe i parallel. This mappig is called iterleavig 33

Cache Baks Advatages Icrease cache badwidth No icrease i data store area Power beefits Disadvatages Caot satisfy multiple accesses to the same bak This is a key issue, called bak coflicts, i.e., multiple accesses are to the same bak May techiues to avoid bak coflicts Bak utilizatio More complex logic (itercoect etwork) to distribute/collect accesses 34

Software Optimizatio Techiues

Geeral Approaches Restructurig data access patters Restructurig data layout Focus: improvig hit rate 36

Restructurig Data Access Patters (I) Array access example: if colum-major x[i+1,j] follows x[i,j] i memory x[i,j+1] is far away from x[i,j] Poor code for i = 1, rows for j = 1, colums sum = sum + x[i,j] Better code for j = 1, colums for i = 1, rows sum = sum + x[i,j] This is called loop iterchage 37

Restructurig Data Access Patters (II) Blockig Divide loops operatig o arrays ito computatio chuks so that each chuk ca hold its data i the cache Avoids cache coflicts betwee differet chuks of computatio Essetially: Divide the workig set so that each piece fits i the cache Blockig limitatios 1. there ca be coflicts amog differet arrays 2. array sizes may be ukow at compile/programmig time 38

Blockig Example // 3 N-by-N matrices x, y, z for (i = 0; i < N; i++){ for (j = 0; j < N; j++) } { } r = 0; for(k = 0; k < N; k++ ) r += y[i][k] * z[k][j]; x[i][j] += r; 39

Access Patter older accesses ew accesses

Blocked Access Patter Uoptimized Blocked

Blocked Code for (jj = 0; jj < N; jj += B){ for (kk = 0; kk < N; kk += B){ for (i = 0; I < N; I ++){ for (j = jj; j < mi(jj+b, N); j++){ r = 0; for (k = kk; k< mi(kk+b, N); k++){ r += y[i][k]*z[k][j]; } x[i][j] += r; } } } } 42

Restructurig Data Layout (I) struct Node { struct Node* ode; it key; char [256] ame; char [256] school; } while (ode) { if (odeàkey == iput-key) { // access other fields of ode } ode = odeàext; } Poiter based traversal (e.g., of a liked list) Assume a huge liked list (1M odes) ad uiue keys Why does the code o the left have poor cache hit rate? Other fields occupy most of the cache lie eve though rarely accessed! 43

Restructurig Data Layout (II) struct Node { struct Node* ode; it key; struct Node-data* ode-data; } struct Node-data { char [256] ame; char [256] school; } while (ode) { if (odeàkey == iput-key) { // access odeàode-data } ode = odeàext; } Idea: separate freuetlyused fields of a data structure ad pack them ito a separate data structure Who should do this? Programmer? Compiler? Profilig vs. dyamic Hardware? Who ca determie what is freuetly used? 44

Restructurig Data Layout (III) How about istructio layout? 45