Architecture: Caching Issues in Performance

Similar documents
Architecture: Caching Issues in Performance

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Welcome to Part 3: Memory Systems and I/O

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

CUDA (Compute Unified Device Architecture)

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Structure of Computer Systems

Chapter 8. Virtual Memory

CS3350B Computer Architecture

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Memory Hierarchy: Caches, Virtual Memory

Basic Memory Hierarchy Principles. Appendix C (Not all will be covered by the lecture; studying the textbook is recommended!)

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

EE 4683/5683: COMPUTER ARCHITECTURE

Readings and References. Virtual Memory. Virtual Memory. Virtual Memory VPN. Reading. CSE Computer Systems December 5, 2001.

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

6 th Lecture :: The Cache - Part Three

Cache introduction. April 16, Howard Huang 1

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Key Point. What are Cache lines

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

administrivia talking about caches today including another in class worksheet

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Assignment 1 due Mon (Feb 4pm

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

ECE331: Hardware Organization and Design

Lecture 2: Memory Systems

Course Administration

CS 550 Operating Systems Spring File System

Cache Policies. Philipp Koehn. 6 April 2018

Memory hierarchy review. ECE 154B Dmitri Strukov

CSE 333 Lecture 9 - storage

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Physical characteristics (such as packaging, volatility, and erasability Organization.

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

LECTURE 11. Memory Hierarchy

Advanced Memory Organizations

Computer Science 146. Computer Architecture

CS3350B Computer Architecture

Page 1. Memory Hierarchies (Part 2)

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Outline. Operating Systems. File Systems. File System Concepts. Example: Unix open() Files: The User s Point of View

ECE331: Hardware Organization and Design

Memory Hierarchy. Mehran Rezaei

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Last Class. Demand Paged Virtual Memory. Today: Demand Paged Virtual Memory. Demand Paged Virtual Memory

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Recap: Machine Organization

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Characteristics of Memory Location wrt Motherboard. CSCI 4717 Computer Architecture. Characteristics of Memory Capacity Addressable Units

Memory Management! Goals of this Lecture!

Cache Optimisation. sometime he thought that there must be a better way

Lecture 03 Bits, Bytes and Data Types

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

EN1640: Design of Computing Systems Topic 06: Memory System

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

13-1 Memory and Caches

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

Computer and Digital System Architecture

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

Chapter 4: Memory Management. Part 1: Mechanisms for Managing Memory

Caches and Memory. Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.13, 5.15, 5.17

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

Improving Cache Performance

EN1640: Design of Computing Systems Topic 06: Memory System

CS241 Computer Organization Spring Principle of Locality

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Virtual Memory 1. Virtual Memory

Virtual Memory 1. Virtual Memory

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Recall from Tuesday. Our solution to fragmentation is to split up a process s address space into smaller chunks. Physical Memory OS.

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

High Performance Computing and Programming, Lecture 3

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

CS Computer Architecture

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

ECE331: Hardware Organization and Design

a process may be swapped in and out of main memory such that it occupies different regions

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Transcription:

Architecture: Caching Issues in Performance Mike Bailey mjb@cs.oregonstate.edu Problem: The Path Between a CPU Chip and Off-chip Memory is Slow CPU Chip Main Memory This path is relatively slow, forcing the CPU to wait for up to 200 clock cycles just to do a store to, or a load from, memory. Depending on your CPU s ability to process instructions out-of-order, it might go idle during this time. This is a huge performance hit! 1

Solution: Hierarchical Memory Systems, or Cache CPU Chip Main Memory The solution is to add intermediate memory systems. The one closest to the CPU is small and fast. The memory systems get slower and larger as they get farther away from the CPU. How is Cache Memory Actually Defined? In computer science, a cache is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data is expensive to fetch (due to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache is a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in the cache, future use can be made by accessing the cached copy rather than refetching or recomputing the original data, so that the average access time is shorter. Cache, therefore, helps expedite data access that the CPU would otherwise need to fetch from main memory. -- Wikipedia 2

Cache and Memory are Named by Distance Level from the CPU CPU Chip L1 L2 Main Memory Storage Level Characteristics Type of Storage On-chip Registers On-chip CMOS memory L1 L2 Memory Disk Off-chip DRAM main memory Typical Size < 100 KB <8MB <10GB Many GBs Typical Access.25 -.50.5 25.0 50-250 5,000,000 Time (ns) Bandwidth 50,000 500,000 5,000 20,000 2,500 10,000 50-500 (MB/sec) Managed by Compiler Hardware OS OS Disk Adapted from: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, Morgan-Kaufmann, 2007. (4 th Edition) Usually there are two L1 caches one for Instructions and one for Data. You will often see this referred to in data sheets as: L1 cache: 64KB + 64KB or I and D cache 3

Cache Hits and Misses When the CPU asks for a value from memory, and that value is already in the cache, it can get it quickly. This is called a cache hit When the CPU asks for a value from memory, and that value is not already in the cache, it will have to go off the chip to get it. This is called a cache miss Performance programming should strive to avoid as many cache misses as possible. That s why it is very helpful to know the cache structure of your CPU. Block 0 Block 1 Memory is arranged in blocks. The size of a block is typically 64 Kbytes. While cache might be multiple kilo- or megabytes, the bytes are transferred in much smaller quantities, each called a cache line. The size of a cache line is typically just 64 bytes. Block 2 Block 3 Possible Cache Architectures 1. Fully Associative cache lines from any block of memory can appear anywhere in cache. 2. Direct Mapped a cache line from a particular block of memory has only one place it could appear in cache. A memory block s cache line is: Cache line # = Memory block # % # lines the cache has 3. N-way Set Associative a cache line from a particular block of memory can appear in a limited number of places in cache. Each limited place is called a set of cache lines. A set contains N cache lines. A memory block s set number is: Set # = Memory block # % # sets the cache has The memory block can appear in any cache line in its set. N is typically 4 for L1 and 8 or 16 for L2. 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Set 0 Set 1 Set 2 Set 3 Note that Direct Mapped can be thought of as 1-Way Set Associative. Note that Fully Associative can be thought of as 1-Set Set-Associative. 4

What Happens When the Cache is Full and a New Piece of Memory Needs to Come In? 1. Random randomly pick a cache line to remove 2. Least Recently Used (LRU) remove the cache line which has gone unaccessed the longest 3. Oldest (FIFO, First-In-First-Out) remove the cache line that has been there the longest Actual Caches Today are N-way Set Associative This is like a good-news / bad-news joke: Good news: This keeps one specific block of memory from hogging all the cache lines. Bad news: It also means that a block that your program uses much more than the other blocks will need to over-use those N cache lines assigned to that block, and underuse the other cache lines assigned to the other blocks. 5

Actual Cache Architecture Let s try some reasonable numbers. Assume there is 1 GB of main memory, 512 KB of L2 cache, and 64-byte 16-way cache lines in the L2 cache. This means that there are 512K 8192 64 cache lines available all together in the cache. The 16-way means that there are 16 cache lines per set, so that the cache must be divided into: 8192 16 512 sets, which makes each memory block need to contain: 1GB 1, 073, 741,824 2,097,152 2MB 512 512 of main memory. What does this mean to you? Actual Cache Architecture Take Home Message A 2 MB memory block could be contained in 2MB 524,288 32,768 64 64 cache lines, but it only has 16 available. If your program is using data from just one block, and they are coming from all over the block and going back and forth, you will have many cache misses and those 16 cache lines will keep being reloaded and re-loaded. Performance will suffer. If your program is using data from just one block, and they are coming from just a (16 cache-lines) * (64 bytes/cache-line) = 1024-byte portion of the block, those 16 cache lines will never need to be re-loaded. Your cache hit rate will be 100%. Performance will be much better. To expect the 1024-situation is somewhat unrealistic. However, if you can arrange it so that most of the time you are traipsing through your data in-order, then at least you will not have nearly as many cache misses. It is the jumping around in memory that kills you. What does this tell you about linked lists? 6

How Bad Is It? -- Demonstrating the Cache-Miss Problem C and C++ store 2D arrays a row-at-a-time, like this: [ i ] [ j ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 float f = Array[ i ][ j ] ; For large arrays, would it be better to process the elements by row, or by column? Which will avoid the most cache misses? #include <stdio.h> #include <ctime> #include <cstdlib> #define N 10000 Demonstrating the Cache-Miss Problem float Array[N][N]; double Time( ); int main( int argc, char *argv[ ] ) float sum = 0.; double start = Time( ); for( int i = 0; i < N; i++ ) for( int j = 0; j < N; j++ ) sum += Array[i][j]; // access across a row double finish = Time( ); double row_secs = finish start; 7

Demonstrating the Cache-Miss Problem sum = 0.; start = Time( ); for( int j = 0; j < N; j++ ) for( int i = 0; i < N; i++ ) sum += Array[i][j]; // access down a column finish = Time( ); double col_secs = finish - start; fprintf( stderr, "N = %5d ; By rows = %lf ; By cols = %lf\n", N, row_secs, col_secs ); double Time( ) return (double)clock( ) / CLOCKS_PER_SEC; Demonstrating the Cache-Miss Problem Time, in seconds, to compute the array sums, based on row-first versus column-first order: N Col Time Row Time 10000 0.87 0.36 15000 194 1.94 084 0.84 20000 3.47 1.49 25000 5.45 2.32 30000 7.85 3.34 35000 10.71 4.56 40000 14.69 5.94 42500 17.12 6.7 45000 20.16 7.51 47500 20.99 8.36 50000 25.61 9.27 52500 29.9999 10.22 55000 52.12 11.23 60000 80.95 13.33 65000 156.4 15.71 160 140 120 100 80 60 40 20 0 0 10 20 30 40 50 60 70 N (Thousands) Col Time Row Time 8

Array-of-Structures vs. Structure-of-Arrays: struct xyz float x, y, z; Array[N]; X0 Y0 Z0 X1 Y1 Z1 X2 Y2 Z2 X3 Y3 Z3 float X[N], Y[N], Z[N]; X0 X1 X2 X3 Y0 Y1 Y2 Y3 Z0 Z1 Z2 Z3 1. Which is a better use of the cache if we are going to be using X-Y-Z triples a lot? 2. Which is a better use of the cache if we are going to be looking at all X s, then all Y s, then all Z s? Sometimes Good Object-Oriented Programming Style is Inconsistent with Good Cache Use: class xyz public: ; float x, y, z; xyz *next; xyz( ); static xyz *Head = NULL; xyz::xyz( ) xyz *n = new xyz; n->next = Head; Head = n; ; This is good OO style it encapsulates and isolates the data for this class. Once you have created a linked list whose elements could be all over memory, is it the best use of the cache? It might be better to create a large array of xyz structures and then have the constructor method pull new ones from that list. That would keep many of the elements close together while preserving the flexibility of the linked list. When you need more, allocate another large array and link to it. 9

What is Brian Apgar s Complaint About Cache? int ( *funcptr[ ] )( void ) = func0, func1, func2 ; for( int k = 0; k < MAX; k++ ) int i = ( funcptr[ j ] )( ); A function jump table requires two instruction cache lines: 1. The for-loop statements 2. The function that really gets called and one data cache line: 1. Your funcptr array Because you create this table, it is in data memory. Instruction Cache The PS2, PSP, and Xbox 360 have 2-way associative caches. Brian talks about a tight loop of indirect function calls being a problem for these systems. Data Cache But, those systems have 2-way associative caches for both the instructions and the data. So, there should be enough room. What, then, does Brian warn us about? A Class That Inherits from a Virtual Class uses a Jump Table ( vtable ) that is Contained in Instruction Memory class Bird public: virtual void f( ); ; class Raven : public Bird public: void f( ); // overrides Bird::f( ) ; Bird *b = new Bird( ); Raven *r = new Raven( ); Bird *birds[2] = b, r ; for( int k = 0; k < MAX; k++ ) for( int i = 0; i < 2; i++ ) birds[i]->f( ); So that the correct f( ) can be called at runtime, each class has a virtual function table, or vtable, containing a pointer to its own version of f( ). Because this vtable is built by the compiler, it ends up living where the compiler wants it, not where you want it. So, the virtual function jump table requires three instruction cache lines: 1. The for-loop statements 2. The f( ) function that really gets called 3. The virtual function table The PS2 and PSP have 2-way associative caches. If all of these 3 items are in the same memory block, they will attempt to use the same set in the cache. But, there are only two cache lines in each set. So, every pass through the k for-loop will cause a cache miss. (The Wii has an 8-way associative cache, so there is no similar problem there.) 10