UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

Similar documents
ECE468 Computer Organization and Architecture. Virtual Memory

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory hierarchy review. ECE 154B Dmitri Strukov

Modern Computer Architecture

Virtual Memory - Objectives

ECE4680 Computer Organization and Architecture. Virtual Memory

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Advanced Computer Architecture

Virtual Memory: From Address Translation to Demand Paging

ECE331: Hardware Organization and Design

Handout 4 Memory Hierarchy

Chapter 8. Virtual Memory

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

CPE 631 Lecture 04: CPU Caches

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

The Memory Hierarchy & Cache

EE 4683/5683: COMPUTER ARCHITECTURE

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Memory Hierarchy: Motivation

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

CS 61C: Great Ideas in Computer Architecture. Virtual Memory

Virtual Memory: From Address Translation to Demand Paging

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Memory Hierarchy: The motivation

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture 11. Virtual Memory Review: Memory Hierarchy

Why memory hierarchy

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Memory Technologies. Technology Trends

Page 1. Multilevel Memories (Improving performance using a little cash )

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

1. Memory technology & Hierarchy

Lecture 2: Memory Systems

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

Agenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Memory Hierarchy. Mehran Rezaei

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Virtual Memory. Motivation:

CS3350B Computer Architecture

CSC Memory System. A. A Hierarchy and Driving Forces

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Question?! Processor comparison!

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Lecture-14 (Memory Hierarchy) CS422-Spring

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Memory Hierarchy: Caches, Virtual Memory

ECE468 Computer Organization and Architecture. Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

V. Primary & Secondary Memory!

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

ECE232: Hardware Organization and Design

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Virtual Memory. Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

CS152 Computer Architecture and Engineering Lecture 18: Virtual Memory

Recap: Machine Organization

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CSE 141 Computer Architecture Spring Lectures 17 Virtual Memory. Announcements Office Hour

Virtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

EN1640: Design of Computing Systems Topic 06: Memory System

Caching Basics. Memory Hierarchies

EN1640: Design of Computing Systems Topic 06: Memory System

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Modern Virtual Memory Systems. Modern Virtual Memory Systems

Transcription:

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568/668 Part 11 Memory Hierarchy - I Israel Koren ECE568/Koren Part.11.1 ECE568/Koren Part.11.2 Ideal Memory Infinitely large and very short access time However, this is technologically infeasible Solution: Memory hierarchy Large and slow units, and Small and fast units Levels in hierarchy characterized by Total capacity Access time Transfer rate - bandwidth Unit of transfer Cost per byte CPU Big Memory Longer wires CPU Small Memory Higher Fan out Page 1

Capacity Access Time Cost CPU Registers 100s Bytes <1 ns Levels of the Memory Hierarchy Cache 10s-1000s K Bytes 1-10 ns $10/ MByte Main Memory 10s-1000sM Bytes 60ns- 200ns $1/ MByte Disk 100s-1000sG Bytes 10 ms (10,000,000 ns) $0.0031/ MByte CPU Registers Cache Memory Instr. Operands Blocks Pages Disk (hard drive/ssd) Staging Xfer Unit prog./compiler 4-8 bytes HW cache control 16-128 bytes OS 512-4K bytes Upper Level faster Larger Lower Level ECE568/Koren Part.11.3 Power 7 On-Chip Caches 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unified L2$/core 8-cycle latency 32MB Shared L3$ Embedded DRAM 25-cycle latency ECE568/Koren Part.11.4 Page 2

Major Properties Inclusion Property - M_i M_{i+1} M_i - upper level, closer to the CPU Coherence property - copies in successive memory levels must be consistent Two update methods (i) Write-through - immediate update (ii) Write-back - delay update of M_{i+1} until the item in M_i is replaced.» Replacement policies To Processor M_i Upper Level M_{i+1} Lower Level Block Y From Processor Copy of X Block X ECE568/Koren Part.11.5 The Principle of Locality The Principle of Locality: Program accesses a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Probability of reference 0 Address Space 2^{n} ECE568/Koren Part.11.6 Page 3

Memory Hierarchy Technology 1-T DRAM Cell Random Access (RAM): same access time for all locations DRAM: Dynamic Random Access Memory» High density, low power, cheap, but slow» Dynamic: need to be refreshed regularly SRAM: Static Random Access Memory» Low density, high power, expensive, fast word» Static: content will last forever (until lose power) The Main Memory: DRAMs; Caches: SRAMs Non-so-random Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM V REF Storage capacitor bit ECE568/Koren Part.11.7 Layout of SRAM cell vs DRAM cell ECE568/Koren Part.11.8 Page 4

Memory Hierarchy - General Principles Solve the problems of Speed gap Memory size Cache - Main memory: Speed Block Main memory - Disk: Capacity Page A computer system may have one of these or both ECE568/Koren Part.11.9 32 16 Four-issue 3GHz superscalar accessing 100ns DRAM could execute 1,200 instructions during time for one memory access! ECE568/Koren Part.11.10 Processor- Memory Gap (latency) Normalized Latency per Core 8 Number of Cores 4 2 1 L1 1B 100M 10M 100M 1M 10M 1M 1000 100 10 100 1 10 1 L2 FSBus DRAM HDD Clocks of Latency Page 5

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit_Rate: the fraction of memory access found in the upper level Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss_Rate = 1 - (Hit_Rate) To Processor From Processor Upper Level Memory Block X Lower Level Memory Block X Block Y ECE568/Koren Part.11.11 Average Memory Access time (AMAT) To Processor From Processor Upper Level Memory Block X Lower Level Memory Block X Block Y Simplified expression AMAT = Hit_Rate x T_upper + (1- Hit_Rate) x T_lower A more precise expression AMAT = Hit_Rate x Hit_Time + Miss_Rate x Miss_Penalty Hit_time = T_upper + Time_to_determine_hit/miss Miss_penalty = Time_to_determine_hit/miss + Time_to_replace_a_block_in_the_upper_level + Time_to_deliver_the_data_to_processor Hit_Time << Miss_Penalty (50-100 cycles upon miss in cache; tens of thousands cycles upon miss in main memory) ECE568/Koren Part.11.12 Page 6

Block (Page) Size AMAT = Hit_Rate x Hit_Time + Miss_Rate x Miss_Penalty Miss_penalty Block_size Block_size Block_size ECE568/Koren Part.11.13 Copyright Koren UMass 2016 Impact on Processor Performance Given a processor with Base CPI 1.1 1 cycle on-chip cache Data Miss 1.5 Inst Miss 0.5 Base CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory data operations get 50 cycle miss penalty Suppose that 1% of instructions fetch get same miss penalty CPI = Base CPI + average stalls per instruction = 1.1(cycles/ins) +[0.30 (DataMops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/instmop) x 50 (cycle/miss)] = (1.1 + 1.5 +.5) cycle/ins = 3.1 64% of the time the proc is stalled waiting for memory! AvgMemAccessTime=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]= 2.54 ECE568/Koren Part.11.14 Page 7

Four Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? M_{i+1} (Write strategy) To Processor M_i Upper Level Lower Level Block Y From Processor Block X ECE568/Koren Part.11.15 M_{i+1} 0 1 2 3 4 5 6 7 8 9 A B C D E F 16 blocks ECE568/Koren Part.11.16 Q1&2: Mapping of Blocks (Pages) 0 1 M_i 0 1 CPU 2 3 0 3 4 block frames 6 3 7 2 Block Number Block Offset Frame Number Block Offset Copyright Koren UMass 2016 Page 8

Virtual Memory Main memory secondary storage (hard drive) hierarchy: Objective: Provide a very large memory space Example: Virtual memory space 2^{32} = 4 GByte Physical memory 2^{27} = 128 Mbyte virtual and physical address space partitioned into parts of equal size: 2 KByte/page = 2^{11} Virtual Address Hit_time = 50-100 CPU cycles Page Number Page Offset Miss_penalty = 10^6 cycles Miss_rate = 1% Page Table Page Frame Number Page Offset Address Mapping Physical Address ECE568/Koren Part.11.17 Adapted from UCB and other Copyright sources Koren UMass 2016 Address Mapping missing page fault Processor VA Addr Trans Mechanism fault handler 0 Main Memory PA physical address Secondary Memory OS performs this transfer Disk ECE568/Koren Part.11.18 Page 9

Paging Organization (Protection) Kernel/User Mode Virtual Address Virtual Page No. (VPN) offset Read/Write Protection Check Address Translation Exception? Physical Address Physical Page No. (PPN) offset 21 11 VA VPN offset ECE568/Koren Part.11.19 Page Table Base Addr index into page table Page Table V Access Rights Frame no. table located in physical memory PPN offset PA Physical memory address Page Tables in Physical Memory PT User 1 VA1 User 1 Virtual Address Space VA1 PT User 2 Physical Memory User 2 Virtual Address Space ECE568/Koren Part.11.20 Page 10

Page Table Page Table Entry (PTE) contains: A bit to indicate if a page exists (V) PPN (physical page number) for a memory-resident page DPN (address on disk) for a page on the disk Access rights bits for protection OS sets the Page Table Base Register whenever active user process changes ECE568/Koren Part.11.21 PT Base Register Page Table PPN PPN DPN PPN DPN PPN PPN DPN DPN DPN PPN PPN Offset VPN VPN=7 Offset Virtual address Data Pages Data word Address Translation Overhead Virtual Page Number Page Table Page Offset Page Frame Number Page Offset VA PA Virtual memory 2^{32} = 4 GByte Physical memory 2^{27} = 128MByte Page size = 2 KByte Page table is a large data structure in memory: entries! Two memory accesses for every load, store, or instr. fetch!!! Cache the address translations - use content-addressable memory (CAM, a.k.a Associative memory) External tag No. of entries = If external tag & internal Page # match tag data Virtual page # Page Frame # V 40 bits ECE568/Koren Part.11.22 Copyright Koren UMass 2016 Page 11

Translation Lookaside Buffer (TLB) Rely on temporal locality: keep only the most recently used pages in TLB 32 to 512 entries TLB_Hit_time = 1 cycle Miss_penalty = 20-100 cycles Hit_rate = 99% PAGE TABLE X cycles to access memory Memory_Hit_time = 0.99 (X+1)+0.01 (2X) = 1.01X+1 = 41.4 for X=40 I-TLB & D-TLB ECE568/Koren Part.11.23 of page frame 1 Page Frame 0 Page Frame 1 Page Frame N-1 Main Memory Adapted from H. Stone, UCB and High other sources Performance Computer Architecture, AW 1993 Page Table Size Problem Increase page size Also: transfer of pages to/from disk more efficient However: Lower Hit_rate Internal fragmentation Solution 1: Hashing (reduce from 2^{21} to 2^{16}) Hashing function takes 21 bits and outputs 16 bits. It performs some arithmetic operation (ignoring overflows) and then a mod k operation (k=2^{16}). The result is an integer in [0,k-1] The resulting table (called inverted page table) of 64K entries is sometimes divided into several smaller tables Possible aliasing» The table entry must include the virtual page number ECE568/Koren Part.11.24 Copyright Koren UMass 2016 Page 12

Solution 2: Segmented Memory 0 1 Virtual Address Segment Number Page Number Base of Segment Table Displacement Address within Page Paged segments * No table has more than 2048 entries * Only active PTs reside in memory * Penalty? SEGMENT TABLE 0 1 Base address of Page Table PAGE TABLE Base+0 Base+1 Base address of Page H. Stone, High Performance Computer Architecture, AW 1993 PAGE (in memory) ECE568/Koren Part.11.25 Q3: Page Replacement Policies Goal: minimize number of page faults Effectiveness depends on: (1) Page size (2) No. of available frames (3) Program behavior Policies - examples: Optimal algorithm (upper bound) Random replacement (lower bound) First-In First-Out (FIFO) - longest time, e.g., 12131415 Least Recently Used (LRU) - was not used recently for the longest time Least Frequently Used (LFU) - least referenced Example: Page x 1111 0000 Page y 0000 1000 Page z 0000 0111 ECE568/Koren Part.11.26 Copyright Koren UMass 2016 Page 13

LRU vs. FIFO page size: 4 words; 3 page frames: a,b,c; 10 pages: 0,1,2,..., 9 Word Trace: 0,1,2,3, 4,5,6,7, 8, 16,17, 9,10,11, 12, 28,29,30, 8,9,10, 4,5, 12, 4,5 Page Trace: 0 1 2 4 2 3 7 2 1 3 1 LRU OPT FIFO PF 0 1 2 4 2 3 7 2 1 3 1 a 0 0 0 4 4 4 7 7 7 3 3 b 1 1 1 1 3 3 3 1 1 1 c 2 2 2 2 2 2 2 2 2 Faults * * * * * * * * a 0 0 0 4 4 3 7 7 7 3 3 b 1 1 1 1 1 1 1 1 1 1 c 2 2 2 2 2 2 2 2 2 Faults * * * * * * * a 0 0 0 4 4 4 4 2 2 2 2 b 1 1 1 1 3 3 3 1 1 1 c 2 2 2 2 7 7 7 3 3 Faults * * * * * * * * * Hit Rate ECE568/Koren Part.11.27 Implementing FIFO & LRU FIFO - linked list Add new page at tail, remove from head Improvement: Put pages on the free-page list, scan it upon page fault and re-establish (soft page fault - no disk access) LRU - most commonly used policy Use (or Reference) bit - set when page accessed; OS periodically sorts and moves the referenced pages to the top & resets all Use bits A more accurate implementation - (software) Age Counter, +1 if not referenced, reset if referenced Must provide timer interrupt to update LRU bits ECE568/Koren Part.11.28 Copyright Koren UMass 2016 Page 14

The four questions (virtual storage) 1) Where can a page be placed? Fully associative 2) How is a page found? TLB + PTs Page tables map virtual address to physical address TLBs make virtual memory practical - temporal and spatial locality 3) What block is replaced on miss? Replacement policy 4) How are writes handled? Always Write Back Minimize penalty by using a Dirty Bit ECE568/Koren Part.11.29 TLB Summary Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Access bits: No_access, Read_only, Execute_only, RW&Execute ECE568/Koren Part.11.30 Page 15

AMD s Bobcat core ECE568/Koren Part.11.31 Source: AMD 2010 Page 16