Memory Hierarchy and Cache Ch 4-5. Memory Hierarchy Main Memory Cache Implementation

Similar documents
Memory Hierarchy and Cache Ch 4-5

Internal Memory Cache Stallings: Ch 4, Ch 5 Key Characteristics Locality Cache Main Memory

Virtual Memory (VM) Ch Copyright Teemu Kerola Teemu s Cheesecake

Organization. 5.1 Semiconductor Main Memory. William Stallings Computer Organization and Architecture 6th Edition

William Stallings Computer Organization and Architecture 6th Edition. Chapter 5 Internal Memory

Computer Organization. 8th Edition. Chapter 5 Internal Memory

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

Virtual Memory (VM) Ch 7.3

Teemu s Cheesecake. Virtual Memory (VM) Ch 7.3. Other Problems Often Solved with VM (3) Virtual Memory Ch 7.3

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Chapter 5 Internal Memory

Chapter 4 Main Memory

Large and Fast: Exploiting Memory Hierarchy

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

Characteristics of Memory Location wrt Motherboard. CSCI 4717 Computer Architecture. Characteristics of Memory Capacity Addressable Units

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CENG4480 Lecture 09: Memory 1

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Circle the memory type based on electrically re-chargeable elements

Basic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Recap: Machine Organization

William Stallings Computer Organization and Architecture 8th Edition. Chapter 5 Internal Memory

CpE 442. Memory System

Lecture 18: DRAM Technologies

CENG3420 Lecture 08: Memory Organization

Concept of Memory. The memory of computer is broadly categories into two categories:

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah

MEMORY. Objectives. L10 Memory

EN1640: Design of Computing Systems Topic 06: Memory System

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

CS429: Computer Organization and Architecture

CS152 Computer Architecture and Engineering Lecture 16: Memory System

CS429: Computer Organization and Architecture

Memory. Lecture 22 CS301

EE414 Embedded Systems Ch 5. Memory Part 2/2

Chapter 5. Internal Memory. Yonsei University

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

COMP3221: Microprocessors and. and Embedded Systems. Overview. Lecture 23: Memory Systems (I)

Assignment 1 due Mon (Feb 4pm

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Memory hierarchy Outline

LECTURE 11. Memory Hierarchy

ECE232: Hardware Organization and Design

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

Lecture 2: Memory Systems

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

CS3350B Computer Architecture

CSEE W4824 Computer Architecture Fall 2012

CS 320 February 2, 2018 Ch 5 Memory

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy


Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

LECTURE 5: MEMORY HIERARCHY DESIGN

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Memory Hierarchy and Caches

Lecture-14 (Memory Hierarchy) CS422-Spring

CPE300: Digital System Architecture and Design

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Memory Overview. Overview - Memory Types 2/17/16. Curtis Nelson Walla Walla University

ECSE-2610 Computer Components & Operations (COCO)

UNIT-V MEMORY ORGANIZATION

Announcement. Computer Architecture (CSC-3501) Lecture 20 (08 April 2008) Chapter 6 Objectives. 6.1 Introduction. 6.

Main Memory (RAM) Organisation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

EN1640: Design of Computing Systems Topic 06: Memory System

Computer Organization and Assembly Language (CS-506)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Memory Hierarchy Y. K. Malaiya

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Contents Slide Set 9. Final Notes on Textbook Chapter 7. Outline of Slide Set 9. More about skipped sections in Chapter 7. Outline of Slide Set 9

k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

The Memory Hierarchy & Cache

MIPS) ( MUX

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

ENGIN 112 Intro to Electrical and Computer Engineering

COSC 6385 Computer Architecture - Memory Hierarchies (III)

Memory Technologies. Technology Trends

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter 3. SEQUENTIAL Logic Design. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris.

Storage Technologies and the Memory Hierarchy

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

Introduction to cache memories

Memory Hierarchies &

Transcription:

Memory Hierarchy and Cache Ch 4-5 Memory Hierarchy Main Memory Cache Implementation 1

Teemu s Cheesecake hand table Register, on-chip cache, memory, disk, and tape speeds relative to times locating cheese for the cheese cake you are baking... Europa refridge- rator moon Fig. 4.1 (Jupiter) 0.5 sec (register) 1 sec (cache) 10 sec (memory) 12 days (disk) 4 years (tape) 2

Goal (4) I want my memory lightning fast I want my memory to be gigantic in size Register access viewpoint: data access as fast as HW register data size as large as memory Memory access viewpoint data access as fast as memory data size as large as disk cache HW solution HW help for SW solution virtual memory 3

Memory Hierarchy (5) Fig. 4.1 Most often needed data is kept close Access to small data sets can be made fast simpler circuits Faster is more expensive Large can be bigger and cheaper (per byte) Memory Hierarchy up: smaller, faster, more expensive, more frequent access down: bigger, slower, less expensive, less frequent access 4

Principle of locality (8) In any given time period, memory references occur only to a small subset of the whole address space The reason why memory hierarchies work Prob ( small data set ) = 99% Prob ( the rest ) = 1% (paikallisuus) Cost ( small data set ) = 2 µs Cost ( the rest ) = 20 µs Aver cost 99% * 2 µs + 1% * 20 µs = 2.2 µs Fig. 4.2 Average cost is close to the cost of small data set How to determine data for that small set? How to keep track of it? 5

Principle of locality (5) In any given time period, memory references occur only to a small subset of the whole address space (paikallisuus) Temporal locality: it is likely that a data item referenced a short time ago will be referenced again soon (ajallinen paikallisuus) Spatial locality: it is likely that a data items close to the one referenced a short time ago will be referenced soon (alueellinen paikallisuus memory 2511: 345 23 71 8 305 63 91 2 6

Memory Random access semiconductor memory give address & control, read/write data ROM, PROMS, FLASH system startup memory, BIOS (Basic Input/Output System) load and execute OS at boot also random access RAM Table 5.1 (Table 4.2 [Stall99]) normal memory accessible by CPU 7

RAM Dynamic RAM, DRAM simpler, slower, denser, bigger (bytes per chip) main memory? periodic refreshing required refresh required after read Static RAM, SRAM E.g., $0.12 / MB (year 2001) E.g., 60 ns access E.g., $0.70 / MB (year 2001) more complex (more chip area/byte), faster, smaller (bytes per chip) E.g., 5 ns access? cache? no periodic refreshing needed data remains until power is lost 8

16 Mb DRAM DRAM Access Fig. 5.3 4 bit data items 4M data elements, 2K * 2K square Address 22 bits Fig. 5.4 (b) row access select (RAS) column access select (CAS) interleaved on 11 address pins (Fig. 4.4 [Stal99]) (Fig. 4.5 (b) [Stal99]) Simultaneous access to many 16Mb memory chips to access larger data items Access 8 bit words in parallel? Need 8 chips. Fig. 5.5 (Fig. 4.6 [Stal99]) 9

SDRAM (Synchronous DRAM) 16 bits in parallel access 4 DRAMs (4 bits each) in parallel CPU clock synchronizes also the bus not by separate clock for the bus CPU knows how longs it takes make a reference it can do other work while waiting Faster than plain DRAM Current main memory technology (year 2001) E.g., $0.11 / MB (year 2001) 10

RDRAM (RambusDRAM) New technology, works with fast memory bus expensive Faster transfer rate than with SDRAM Faster access than SDRAM E.g., $0.40 / MB (year 2001)? E.g., 1.6 GB/sec vs. 200 MB/sec (?) E.g., 38 ns vs. 44 ns Fast internal Rambus channel (800 MHz) Rambus memory controller connects to bus Speed slows down with many memory modules serially connected on Rambus channel not good for servers with 1 GB memory (for now!) 5% of memory chips (year 2000), 12% (2005)? 11

Original invention Flash memory Fujio Masuoka, Toshiba Corp., 1984 non-volatile, data remains with power off slow to write ( program ) Nand-Flash, 1987 Fujio Masuoka lowers the wiring per bit to one-eighth that of the Flash Memory's 12

Intel ETOX Flash Intel, 1997 A single transistor with the addition of an electrically isolated polysilicon floating gate capable of storing charge (electrons) Negatively charged electrons act as a barrier between the control gate and the floating gate. Depending on the flow through the floating gate (more or less than 50%) it has value 1 or 0. Read/Write data in small blocks use high voltage to write, and Fowler-Nordheim Tunneling to clear http://developer.intel.com/technology/ itj/q41997/articles/art_1.htm 13

Intel StrataFlash Flash cell is analog, not digital storage Use different charge levels to store 2 bits (or more!) of data in each flash cell http://developer.intel.com/technology/ itj/q41997/articles/art_1.htm 14

Flash Implementations BIOS (PC s, phones, other hand-held devices...) Toshiba SmartMedia, 2-256 MB... Sony Memory Stick, 2-1024 MB CompactFlash, 8-512 MB... PlayStation II Memory Card, 8 MB MMC - MultiMedia Card, 32-128 MB Fuji XD Picture Card 32-256 MB Hand-held phone memories 15

16

Cache Memory (välimuisti) Problem: how can I make my (main) memory as fast as my registers? Answer: (processor) cache keep most probably referenced data in fast cache close to processor, and rest of it in memory much smaller than main memory (much) more expensive (per byte) than memory most of data accesses only to cache 90% 99%? Fig. 4.3 & 4.6 (Fig. 4.13 & 4.16 [Stal99]) 17

Memory references with cache (5) Data is in cache? Data is only in memory? Read it to cache CPU waits until data available Hit Miss Fig. 4.5 (Fig. 4.15 [Stall99]) Many blocks (cache lines) help for temporal locality many different data items in cache Fig. 4.4 Large blocks help for spatial locality lots of nearby data available Fixed cache size? Select many or large? (can not have both!) (Fig. 4.14 [Stall99]) 18

Size Mapping function Cache Features (6) how to find data in cache? Replacement algorithm which block to remove to make room for a new block? Write policy how to handle writes? (kuvausfunktio) (poistoalgoritmi) (kirjoituspolitiikka) Line size (block size)? Number of caches? Types of caches? (rivin tai lohkon koko) 19

Cache Size Bigger is better in general Bigger may be slower lots of gates, cumulative gate delay? Too big might be too slow! Help: 2- or 3-level caches 1KW (4 KB), 128MW (512 MB)? regs L1 L2 L3 mem cpu chip 20

Mapping: Memory Address (3) Alpha AXP issues 34 bit memory addresses Use block address to locate block in cache With cache hit, block offset is controlling a multiplexer to select right word Cache line size block = block size 34 bit address (byte address) block address offset = 2 5 = 32 bytes 29 bits 5 = 4 words max physical address space = 2 34 = 16GB Number of possible blocks in physical address space = 2 29 = 512M blocks 21

Mapping (2) Given a memory block address, is that block in cache? where is it there? Three solution methods direct mappings fully associative mapping set associative mapping 22

Direct Mapping (6) Every block has only one possible location (cache line number) in cache determined by index field bits 34 bit address (byte address) tag Unique bits that are different for each block, stored in each cache line Block address (in memory) index block offset 21 8 5 Cache line size = block size = 2 5 = 32 bytes Fixed addr in cache, cache size = 2 8 = 256 blocks = 8 KB Fig. 4.7 (s = 29, r = 8, w = 5) (Fig. 4.17 [Stall99]) Fig. 7.10 [PaHe98] (suora kuvaus) 23

Word = 4 bytes (here) bit address byte address) Direct Mapping Example (5) Cache line size = block size = 2 3 = 8 bytes = 64 bits ReadW I2, 0xA4 tag data 0xA4 = 1010 0100 2 64 000: tag index offset 2 3 3 001: 10 100 100 010: 011: 01 54 A7 00 91 23 66 32 11 100: 11 77 55 55 66 66 22 44 22 No match 101: 01 65 43 21 98 76 65 43 32? 110: = Read new memory block from memory 111: address 0xA0=1010 0000 to cache location 100, 24.9.2003 update tag, Copyright and Teemu then Kerola continue 2003 with data access 24

Direct Mapping Example 2 (5) ReadW I2, 0xB4 0xB4 = 1011 0100 tag index offset 2 3 3 10 110 100? = Match Start with 4th byte 000: 001: 010: 011: 100: 101: 110: 111: 11 77 55 55 66 66 22 44 22 01 65 43 21 98 76 65 43 32 10 00 11 22 33 44 55 66 77 tag data 2 64 01 54 A7 00 91 23 66 32 11 25

Fully Associative (täysin assosiatiivinen kuvaus) Mapping (5) Every block can be in any cache line tag must be complete block address 34 bit address (byte address) Unique bits that are different for each block Block address (in memory) tag block offset Cache line size = block size = 2 5 = 32 bytes 29 5 Cache line can be anywhere Cache size can be any number of blocks Fig. 4.9 (s = 29, w = 5) (Fig. 4.19 [Stal99]) 26

Fully Associative Example (5) Match or ReadW I2, 0xB4 0xA4 = 1011 0100 tag offset 5 3 10110 100 =? =? =? =? 000: 001: 010: 011: 100: 101: cache tag data 5 64 11011 12 34 56 78 9A 01 23 45 10111 87 00 32 89 65 A1 B2 00 00011 87 54 00 89 65 A1 B2 00 10100 54 A7 00 91 23 66 32 11 00111 77 55 55 66 66 22 44 22 10100 65 43 21 98 76 65 43 32 =? =? =? =? 110: 10110 00 11 22 33 44 55 66 77 111: 10011 87 54 32 89 65 A1 B2 00 27

Fully Associative Mapping Lots of circuits tag fields are long - wasted space! each cache line tag must be compared simultaneously with the memory address tag lots of wires lots of comparison circuits Final comparison or has large gate delay did any of these 64 comparisons match? how about 262144 comparisons? Can use it only for small caches Large surface area on chip 2 log(64) = 8 levels of binary or-gates 18 levels? 28

Set Associative Mapping (6) With set size k=2, every block has 2 possible locations in cache Possible location of block is determined by set (index) field bits block 34 bit address tag set (index) offset (byte address) 22 7 5 Unique bits that are different for each block, stored in each cache line Fig. 5.8 [HePa96] Fig. 7.19 [PaHe98] (joukkoassosiatiivinen kuvaus) Cache line size = block size = 2 5 = 32 bytes Nr of sets v=2 7 =128 blocks = 4 KB Total cache size vk=2*4 KB= 8 KB Fig. 4.11 (confusing, complex?) (E.g., k=2, s = 29, d = 7, w = 5) (Fig. 4.21 [Stall99]) 29

Two definitions for Set in Set Associative Mapping Term set is the set of all possible locations where referenced memory block can be Field set of memory address determines this set [Stal03], [Stal99] Cache memory is split into multiple sets, and the referenced memory block can be in only one location in each set Field index of memory address determines possible location of referenced block in each set [HePa96], [PaHe98] 30

Two definitions for Set in Set Associative Mapping Stallings set Hennessy-Patterson set set 0 set 1 set i Memory block i can be in any of these locations index i: set 0 set 15 4 blocks 16 blocks set 3 31

tallings ets ennessy s ne order) 2-way Set Associative Cache Set 0 Set 1 Set 2 Set 3 00: 01: 10: 11: 00: tag data 1 st lines in each set 3 64 110 12 34 56 78 9A 01 23 45 110 100 101 011 101 101 3 bit tag 01: set size 2 10: 2 cache lines per set 4 sets 2 bits for 11: 111 set index 8 byte cache lines 3 bits for byte address in cache line 87 00 32 89 65 A1 B2 00 87 54 00 89 65 A1 B2 00 54 A7 00 91 23 66 32 11 77 55 55 66 66 22 44 22 65 43 21 98 76 65 43 32 00 11 22 33 44 55 66 77 87 54 32 89 65 A1 B2 00 2 nd lines in each set 32

Set Associative Example (6) Match ReadW I2, 0xB4 0xB4 = 1011 0100 tag set offset 3 2 3 101 10 100 or =? =? cache tag data 1 st lines in each se 3 64 00: 110 12 34 56 78 9A 01 23 45 01: 110 87 00 32 89 65 A1 B2 00 10: 100 87 54 00 89 65 A1 B2 00 11: 101 54 A7 00 91 23 66 32 11 00: 011 77 55 55 66 66 22 44 22 01: 101 65 43 21 98 76 65 43 32 10: 101 00 11 22 33 44 55 66 77 11: 111 87 54 32 89 65 A1 B2 00 2 nd lines in each set 33

Set Associative Mapping Set associative cache with set size 2 = 2-way cache Degree of associativity v? Usually 2 v large? Fig. 7.16 [PaHe98] More data items (v) in one set less collisions within set final comparison (matching tags?) gate delay? v maximum (nr of cache lines) fully associative mapping v minimum (1) direct mapping 34

Replacement Algorithm Which cache block (line) to remove to make room for new block from memory? Direct mapping case trivial First-In-First-Out (FIFO) Least-Frequently-Used (LFU) Random Which one is best? Chip area? Fast? Easy to implement? 35

Write Policy How to handle writes to memory? Write through each write goes always to memory each write is a cache miss! Write back (läpikirjoittava) (lopuksi kirjoittava takaisin kirjoittava?) write cache block to memory only when it is replaced in cache memory may have stale (old) data cache coherence problem (välimuistin yhteneväisyysongelma) 36

Line size How big cache line? Optimise for temporal or spatial locality? bigger cache line is better for spatial locality More cache lines is better for temporal locality Datareferences and code references behave in a different way Best size varies with program or program phase 2-8 words? word = 1 float?? 37

Number/types of Caches (3) One cache too large for best results Unified vs. split cache same cache for data and code, or not? split cache: can optimise structure separately for data and code Multiple levels of caches L1 - same chip as CPU L2 - same package or chip as CPU older systems: same board L3 - same board as CPU (yhdistetty, erilliset) Fig. 4.13 (Fig. 4.23 [Stal99]) 38

-- End of Ch. 4-5: Cache Memory -- http://www.intel.com/procs/servers/feature/cache/unique.htm The Pentium Pro processor's unique multi-cavity chip package brings L2 cache memory closer to the CPU, delivering higher performance for business-critical computing needs. 39