Introduction to OpenMP. Lecture 10: Caches

Similar documents
Cray XE6 Performance Workshop

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Page 1. Multilevel Memories (Improving performance using a little cash )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

CMPSC 311- Introduction to Systems Programming Module: Caching

Assignment 1 due Mon (Feb 4pm

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Caching Basics. Memory Hierarchies

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Hierarchy: Caches, Virtual Memory

Key Point. What are Cache lines

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Cache Performance (H&P 5.3; 5.5; 5.6)

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Memory Hierarchy. Slides contents from:

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Advanced Memory Organizations

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CS3350B Computer Architecture

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Course Administration

CMPSC 311- Introduction to Systems Programming Module: Caching

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

www-inst.eecs.berkeley.edu/~cs61c/

Introduction. Memory Hierarchy

Memory Management! Goals of this Lecture!

EE 4683/5683: COMPUTER ARCHITECTURE

Chapter 5. Memory Technology

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Memory Hierarchies &

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Memory Hierarchy. Slides contents from:

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

CS3350B Computer Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Basic Memory Hierarchy Principles. Appendix C (Not all will be covered by the lecture; studying the textbook is recommended!)

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CS61C : Machine Structures

Logical Diagram of a Set-associative Cache Accessing a Cache

UCB CS61C : Machine Structures

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Advanced Computer Architecture

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

The memory gap. 1980: no cache in µproc; level cache on Alpha µproc

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Levels in memory hierarchy

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

UCB CS61C : Machine Structures

Lecture 2: Memory Systems

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

The University of Adelaide, School of Computer Science 13 September 2018

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

12 Cache-Organization 1

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

Chapter 8. Virtual Memory

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

The Memory Hierarchy Cache, Main Memory, and Virtual Memory

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

Transcription:

Introduction to OpenMP Lecture 10: Caches

Overview Why caches are needed How caches work Cache design and performance.

The memory speed gap Moore s Law: processors speed doubles every 18 months. True for last 35 years... Memory speeds (DRAM) are not keeping up (double every 5 years). In 1980, both CPU and memory cycles times were around 1 microsecond. Floating point add and memory load took about the same time. In 2000 CPU cycles times were around 1 nanosecond, memory cycle times around 100 nanoseconds. Memory load is 2 orders of magnitude more expensive than floating point add.

Principal of locality Almost every program exhibits some degree of locality. Tend to reuse recently accessed data and instructions. Two types of data locality: 1. Temporal locality A recently accessed item is likely to be reused in the near future. e.g. if x is read now, it is likely to be read again, or written, soon. 2. Spatial locality Items with nearby addresses tend to be accessed close together in time. e.g. ify[i]is read now, y[i+1] is likely to be read soon.

What is cache memory? Small, fast, memory. Placed between processor and main memory. Processor Cache Memory Main Memory

How does this help? Cache can hold copies of data from main memory locations. Can also hold copies of instructions. Cache can hold recently accessed data items for fast reaccess. Fetching an item from cache is much quicker than fetching from main memory. 1 nanosecond instead of 100. For cost and speed reasons, cache is much smaller than main memory.

Blocks A cache block is the minimum unit of data which can be determined to be present in or absent from the cache. Normally a few words long: typically 32 to 128 bytes. See later for discussion of optimal block size. N.B. a block is sometimes also called a line.

Design decisions When should a copy of an item be made in the cache? Where is a block placed in the cache? How is a block found in the cache? Which block is replaced after a miss? What happens on writes? Methods must be simple (hence cheap and fast to implement in hardware).

When to cache? Always cache on reads except in special circumstances If a memory location is read and there isn t a copy in the cache (read miss), then cache the data. What happens on writes depends on the write strategy: see later. N.B. for instruction caches, there are no writes

Where to cache? Cache is organised in blocks. Each block has a number: 32 bytes 0 1 2 3 4 1022 1023

Bit selection Simplest scheme is a direct mapped cache If we want to cache the contents of an address, we ignore the last n bits where 2 n is the block size. Block number (index) is: (remaining bits) MOD (no. of blocks in cache) next m bits where 2 m is number of blocks. Full address 01110011101011101 0110011100 10100 block index block offset

Set associativity Cache is divided into sets A set is a group of blocks (typically 2 or 4) Compute set index as: (remaining bits) MOD (no. of sets in cache) Data can go into any block in the set. Full address 011100111010111010 110011100 10100 set index block offset

Set associativity If there are k blocks in a set, the cache is said to be k-way set associative. 0 1 32 bytes 511 If there is just one set, the cache is fully associative.

How to find a cache block Whenever we load an address, we have to check whether it is cached. For a given address, find set where it might be cached. Each block has an address tag. address with the block index and block offset stripped off. Each block has a valid bit. if the bit is set, the block contains a valid address Need to check tags of all valid blocks in set for target address. Full address 011100111010111010 110011100 10100 tag set index block offset

Which block to replace? In a direct mapped cache there is no choice: replace the selected block. In set associative caches, two common strategies: Random Replace a block in the selected set at random. Least recently used (LRU) Replace the block in set which was unused for longest time. LRU is better, but harder to implement.

What happens on write? Writes are less common than reads. Two basic strategies: Write through Write data to cache block and to main memory. Normally do not cache on miss. Write back Write data to cache block only. Copy data back to main memory only when block is replaced. Dirty/clean bit used to indicate when this is necessary. Normally cache on miss.

Write through vs. write back With write back, not all writes go to main memory. reduces memory bandwidth. harder to implement than write through. With write through, main memory always has valid copy. useful for I/O and for some implementations of multiprocessor cache coherency. can avoid CPU waiting for writes to complete by use of write buffer.

Cache performance Average memory access cost = hit time + miss ratio x miss time time to load data from cache to CPU proportion of accesses which cause a miss time to load data from main memory to cache Can try to to minimise all three components

Cache misses: the 3 Cs Cache misses can be divided into 3 categories: Compulsory or cold start first ever access to a block causes a miss Capacity misses caused because the cache is not large enough to hold all data Conflict misses caused by too many blocks mapping to same set.

Block size Choice of block size is a tradeoff. Large blocks result in fewer misses because they exploit spatial locality. However, if the blocks are too large, they can cause additional capacity/conflict misses (for the same total cache size). Larger blocks have higher miss times (take longer to load)

Set associativity Having more sets reduces the number of conflict misses. 8-way set associate is almost as good as fully associative. Having more sets increases the hit time. takes longer to find the correct block. Conflict misses can also be reduced by using a victim cache a small buffer which stores the most recently evicted blocks helps prevent thrashing, where subsequent accesses all resolve to the same set.

Prefetching One way to reduce miss rate is to load data into cache before the load is issued. This is called prefetching Requires modifications to the processor must be able to support multiple outstanding cache misses. additional hardware is required to keep track of the outstanding prefetches number of outstanding misses is limited (e.g. 4 or 8): extra benefit from allowing more does not justify the hardware cost.

Hardware prefetching is typically very simple: e.g. whenever a block is loaded, fetch consecutive block. very effective for instruction cache less so for data caches, but can have multiple streams. requires regular data access patterns. Compiler can place prefetch instructions ahead of loads. requires extensions to the instruction set cost in additional instructions. no use if placed too far ahead: prefetched block may be replaced before it is used.

Multiple levels of cache One way to reduce the miss time is to have more than one level of cache. Processor Level 1 Cache Level 2 Cache Main Memory

Multiple levels of cache Second level cache should be much larger than first level. otherwise a level 1 miss will almost always be level 2 miss as well. Second level cache will therefore be slower still much faster than main memory. Block size can be bigger, too lower risk of conflict misses. Typically, everything in level 1 must be in level 2 as well (inclusion) required for cache coherency in multiprocessor systems.

Multiple levels of cache Three levels of cache are now commonplace. All 3 levels now on chip Common to have separate level 1 caches for instructions and data, and combined level 2 and 3 caches for both Complicates design issues need to design each level with knowledge of the others inclusion with differing block sizes coherency...

Memory hierarchy Speed (and cost) 1 cycle CPU Registers ~1 Kb Capacity 2-3 cycles L1 Cache ~100 Kb ~20 cycles L2 Cache ~1-10 Mb ~50 cycles L3 Cache ~10-50 Mb ~300 cycles Main Memory ~1 Gb