Memory hierarchies: caches and their impact on the running time

Similar documents
Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Cache Memory and Performance

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Carnegie Mellon. Cache Memories

F28HS Hardware-Software Interface: Systems Programming

Problem: Processor Memory BoJleneck

Today Cache memory organization and operation Performance impact of caches

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

CMPSC 311- Introduction to Systems Programming Module: Caching

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Memory Management! Goals of this Lecture!

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

CMPSC 311- Introduction to Systems Programming Module: Caching

+ Random-Access Memory (RAM)

CS3350B Computer Architecture

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Read-only memory (ROM): programmed during production Programmable ROM (PROM): can be programmed once SRAM (Static RAM)

Memory Hierarchy. Announcement. Computer system model. Reference

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Problem: Processor- Memory Bo<leneck

CS 240 Stage 3 Abstractions for Practical Systems

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

211: Computer Architecture Summer 2016

Overview of the ECE Computer Software Curriculum. David O Hallaron Associate Professor of ECE and CS Carnegie Mellon University

Computer Organization: A Programmer's Perspective

Caches II. CSE 351 Autumn Instructor: Justin Hsia

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Giving credit where credit is due

How to Write Fast Numerical Code

Basic Memory Hierarchy Principles. Appendix C (Not all will be covered by the lecture; studying the textbook is recommended!)

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

How to Write Fast Numerical Code

CS241 Computer Organization Spring Principle of Locality

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Cray XE6 Performance Workshop

Introduction to OpenMP. Lecture 10: Caches

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Cache Memories October 8, 2007

211: Computer Architecture Summer 2016

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Types of Cache Misses. The Hardware/So<ware Interface CSE351 Winter Cache Read. General Cache OrganizaJon (S, E, B) Memory and Caches II

The Hardware/So<ware Interface CSE351 Winter 2013

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Cache Memory and Performance

Caches Spring 2016 May 4: Caches 1

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Exercises Dec 3, 2009

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Memory Hierarchy: Caches, Virtual Memory

Caches Concepts Review

CS 201 The Memory Hierarchy. Gerson Robboy Portland State University

Giving credit where credit is due

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Fundamentals of Computer Systems

LECTURE 11. Memory Hierarchy

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, FALL 2012

EE 4683/5683: COMPUTER ARCHITECTURE

CSE502: Computer Architecture CSE 502: Computer Architecture

Caches Design of Parallel and High-Performance Computing Recitation Session

Page 1. Multilevel Memories (Improving performance using a little cash )

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Advanced Memory Organizations

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Cache Lab Implementation and Blocking

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Page 1. Memory Hierarchies (Part 2)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Fundamentals of Computer Systems

Lecture 1: Course Overview

Course Overview CSCE 312. Instructor: Daniel A. Jiménez. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Last class. Caches. Direct mapped

CS3350B Computer Architecture

General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Transcription:

Memory hierarchies: caches and their impact on the running time Irene Finocchi Dept. of Computer and Science Sapienza University of Rome

A happy coincidence A fundamental property of hardware Different storage technologies have widely different access times. Faster technologies cost more per byte than slower ones and have less capacity. The gap between CPU and main memory speed is widening. A fundamental property of software Well-written programs tend to exhibit good locality The two properties complement each other! 2

The memory hierarchy Smaller, faster, and costlier (per byte) storage devices Larger, slower, and cheaper (per byte) storage devices L6: L5: L4: L3: L2: L1: L0: Regs L1 cache (SRAM) L2 cache (SRAM) L3 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) Remote secondary storage (distributed file systems, Web servers) CPU registers hold words retrieved from cache memory. L1 cache holds cache lines retrieved from the L2 cache. L2 cache holds cache lines retrieved from L3 cache L3 cache holds cache lines retrieved from memory. Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote network servers. 3

Caching in the memory hierarchy Cache: small, fast storage device that acts as a staging area for the data objects stored in a larger, slower device For each k, the faster and smaller storage at level k serves as a cache for the larger and slower storage device at level k+1 4

Basic principle of caching Level k: 4 9 14 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Data is copied between levels in block-sized transfer units 0 1 2 3 Level k+1: 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 5

Who is responsible for caching? Cache type What cached Managed by CPU registers L1 cache 4 or 8 byte words 64-byte block Compiler Hardware L2 cache 64-byte block Hardware L3 cache 64-byte block Hardware Virtual memory 4KB page Hardware + OS 6

Can we exploit caches? Same principles seen in external memory model: increase spatial/temporal locality But caches in the memory hierarchy far more complex 1. Parameter choice? 2. Automatic hardware management: how can we control which data are cached? Still, programmers can write cache-friendly code and caches can have a big impact 7

A simple experiment Copy 2048 x 2048 array of 32-bit integers copyij: copy by rows copyij: copy by columns 8

Array copy: row-major order Access by rows: void copyij (int src[2048][2048], int dst[2048][2048]) { int i,j; for (i = 0; i < 2048; i++) for (j = 0; j < 2048; j++) dst[i][j] = src[i][j]; } 9

Array copy: column-major order Access by columns: void copyji (int src[2048][2048], int dst[2048][2048]) { int i,j; for (j = 0; j < 2048; j++) for (i = 0; i < 2048; i++) dst[i][j] = src[i][j]; } 10

Performance analysis copyij and copyji differ only in data access patterns: copyij accesses by rows copyji accesses by columns On a Intel Core i7 with 2.7 GHz: copyij takes 5.2 msec, copyji takes 162 msec ( 30x slower!) copyij makes a better use of (spatial) locality 11

12 Is this due to memory access times?

General cache organization Caches automatically managed by hardware Need to understand how caches work to exploit them 13

General cache organization (S, E, B) E = 2 e lines per set set line S = 2 s sets v tag 0 1 2 B- 1 Cache size: C = S x E x B data bytes valid bit B = 2 b bytes per cache block (the data)

Cache read E = 2 e lines per set Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data starting at offset S = 2 s sets Address of word: t bits s bits b bits tag set index block offset data begins at this offset v valid bit tag 0 1 2 B- 1 B = 2 b bytes per cache block (the data)

Example: direct mapped cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes S = 2 s sets v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 Address of int: t bits 0 01 100 find set v tag 0 1 2 3 4 5 6 7

Example: direct mapped cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes valid? + match: assume yes = hit Address of int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset

Example: direct mapped cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes valid? + match: assume yes = hit Address of int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset int (4 Bytes) is here No match: old line is evicted and replaced

Direct-mapped cache simulation t=1 s=2 b=1 x xx x M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/set Address trace (reads, one byte per read): 0 [0000 2 ], 1 [0001 2 ], 7 [0111 2 ], 8 [1000 2 ], 0 [0000 2 ] miss hit miss miss miss Set 0 Set 1 Set 2 Set 3 v Tag Block

Why indexing with the middle bits? 00 01 10 11 4-set cache 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 High-order bit indexing 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Middle-order bit indexing Set index bits

E-way set associative cache E = 2: two lines per set Address of short int: Assume: cache block size 8 bytes t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

E-way set associative cache E = 2: two lines per set Assume: cache block size 8 bytes compare both Address of short int: t bits 0 01 100 valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset

E-way set associative cache E = 2: two lines per set Assume: cache block size 8 bytes compare both Address of short int: t bits 0 01 100 valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 short int (2 Bytes) is here block offset No match: One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU),

2-way set associative cache simulation t=2 s=1 b=1 xx x x M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set Address trace (reads, one byte per read): 0 [0000 2 ], miss 1 [0001 2 ], hit 7 [0111 2 ], miss 8 [1000 2 ], miss 0 [0000 2 ] hit v Tag Block Set 0 Set 1 0 0 0

Fully associative caches Recall C = S x E x B S = 1 and thus E = C/B All lines in the same set Larger associativity implies longer access times, but less misses

Intel Core i7 cache hierarchy Processor package Core 0 Regs L1 d-cache L1 i-cache L2 unified cache L3 unified cache (shared by all cores) Main memory Core 3 Regs L1 d-cache L1 i-cache L2 unified cache L1 i- cache and d- cache: 32 KB, 8- way, Access: 4 cycles L2 unified cache: 256 KB, 8- way, Access: 11 cycles L3 unified cache: 8 MB, 16- way, Access: 30-40 cycles Block size: 64 bytes for all caches.

Kinds of cache misses Cold (or compulsory) misses: happens when cache is empty, but not after cache has been warmed up Conflict misses: cache is large enough, but addresses map to the same block More frequent for smaller associativity Capacity misses: working set is larger than cache size

Conflict misses float dotprod(float x[8], float y[8]) { int i; float sum = 0.0; } for (i = 0; i < 8; i++) sum += x[i]*y[i]; return sum; Assume: S = 2, direct-mapped cache (E = 1) B = 16 bytes (4 floats) Address Set index 000000 2 0 0 000100 2 4 0 001000 2 8 0 001100 2 12 0 010000 2 16 1 010100 2 20 1 011000 2 24 1 011100 2 28 1 Address Set index 100000 2 32 0 100100 2 36 0 101000 2 40 0 101100 2 44 0 110000 2 48 1 110100 2 52 1 111000 2 56 1 111100 2 60 1 Trashing! B bytes of padding at the end of each array OK if E>1

Computer systems: a programmer s perspective Randal Bryant - David O Hallaron Pearson, second edition, 2010 Credits