Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Similar documents
Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

EE 660: Computer Architecture Advanced Caches

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

EITF20: Computer Architecture Part4.1.1: Cache - 2

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CS422 Computer Architecture

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Lecture 11 Reducing Cache Misses. Computer Architectures S

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

EITF20: Computer Architecture Part4.1.1: Cache - 2

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Types of Cache Misses: The Three C s

CMSC 611: Advanced Computer Architecture. Cache and Memory

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

Advanced optimizations of cache performance ( 2.2)

Computer Architecture Spring 2016

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Advanced Caching Techniques

Advanced Caching Techniques

Copyright 2012, Elsevier Inc. All rights reserved.

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache performance Outline

Cache Performance (H&P 5.3; 5.5; 5.6)

CSC 631: High-Performance Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Page 1. Multilevel Memories (Improving performance using a little cash )

CPE 631 Lecture 06: Cache Design

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Aleksandar Milenkovich 1

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Inside out of your computer memories (III) Hung-Wei Tseng

CSE 502 Graduate Computer Architecture

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy. Slides contents from:

Memory Hierarchy Design. Chapter 5

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 2: Memory Hierarchy Design Part 2

Copyright 2012, Elsevier Inc. All rights reserved.

CS/ECE 250 Computer Architecture

Chapter 2: Memory Hierarchy Design Part 2

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture notes for CS Chapter 2, part 1 10/23/18

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Pollard s Attempt to Explain Cache Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Adapted from David Patterson s slides on graduate computer architecture

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

+ Random-Access Memory (RAM)

ECE 4750 Computer Architecture, Fall 2018 T04 Fundamental Memory Microarchitecture

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter-5 Memory Hierarchy Design

LECTURE 5: MEMORY HIERARCHY DESIGN

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Memory Hierarchy Basics

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Memory Hierarchies 2009 DAT105

EC 513 Computer Architecture

COMPUTER ORGANIZATION AND DESIGN

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter 5. Memory Technology

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Lec 12 How to improve cache performance (cont.)

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

CS3350B Computer Architecture

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Transcription:

ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab 1 due today Reading: Chapter 5.1 5.3 2 1

Overview How to improve cache performance Recent research: Flash cache 3 Improving Cache Performance Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: 4 2

Small, Simple Caches On many machines today the cache access sets the cycle time Hit time is therefore important beyond its effect on AMAT tag index byte tag index byte = data = = mux data hit hit 5 Way Predicting Caches (MIPS R10000 L2) Use processor address to index into way prediction table Look in predicted way at given index, then: 6 3

Improving Cache Performance Decrease Hit Time Decrease Miss Rate Decrease Miss Penalty 7 Causes for Cache Misses Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with Capacity: cache is too small to hold all data needed by the program - misses that would occur even under Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with 8 4

Effect of Cache Parameters Larger cache size Higher associativity Larger block size 9 Victim Cache (HP7200) CPU RF L1 Data Cache Unified L2 Cache Victim cache is a small asociative back-up cache, added to a direct mapped cache, which holds recently evicted blocks 10 5

Prefetching Speculate on future instruction and data accesses and fetch them into cache(s) Varieties of prefetching Hardware prefetching Software prefetching Mixed schemes What types of misses does prefetching affect? 11 Issues in Prefetching Usefulness Timeliness Cache and bandwidth pollution CPU RF L1 Instruction L1 Data Unified L2 Cache Prefetched data 12 6

Hardware Instruction Prefetch Alpha 21064 CPU RF Req block Stream Buffer L1 Instruction Prefetched instruction block Req block Unified L2 Cache 13 Hardware Data Prefetching Prefetch-on-miss: One Block Lookahead (OBL) scheme Strided prefetch 14 7

Software Prefetching Compiler-directed prefetching compiler can analyze code and know where misses occur for(i=0; i < N; i++) { prefetch( &a[i + 1] ); prefetch( &b[i + 1] ); SUM = SUM + a[i] * b[i]; } Issues? What property do we require of the cache for prefetching to work? 15 Compiler Optimizations Restructuring code affects the data block access sequence Group data accesses together to improve spatial locality Re-order data accesses to improve temporal locality Prevent data from entering the cache Useful for variables that will only be accessed once before being replaced Needs mechanism for software to tell hardware not to cache data (instruction hints or page table bits) Kill data that will never be used again Streaming data exploits spatial locality but not temporal locality Replace into dead cache locations 16 8

Array Merging Some weak programmers may produce code like: int val[size]; int key[size]; and proceed to reference val and key in lockstep Problem? 17 Loop Interchange for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? 18 9

Loop Fusion for(i=0; i < N; i++) for(j=0; j < M; j++) a[i][j] = b[i][j] * c[i][j]; for(i=0; i < N; i++) for(j=0; j < M; j++) d[i][j] = a[i][j] * c[i][j]; What type of locality does this improve? 19 Blocking for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } x y z j k j i i k Not touched Old access New access 20 10

Blocking for(jj=0; jj < N; jj=jj+b) for(kk=0; kk < N; kk=kk+b) for(i=0; i < N; i++) for(j=jj; j < min(jj+b,n); j++) { r = 0; for(k=kk; k < min(kk+b,n); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } x y z j k j i i k 21 Improving Cache Performance Decrease Hit Time Decrease Miss Rate Decrease Miss Penalty Some cache misses are inevitable when they do happen, want to service as quickly as possible 22 11

Multilevel Caches A memory cannot be large and fast Increasing sizes of cache at each level CPU L1 L2 DRAM AMAT = Hit time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit time L2 + Miss Rate L2 x Miss Penalty L2 What is 2 nd -level miss rate? local miss rate number of cache misses / cache accesses global miss rate number of cache misses / CPU memory refs 23 L2 Cache Hit Time vs. Miss Rate 24 12

Reduce Read Miss Penalty CPU RF Data Cache Write buffer Unified L2 Cache Let read misses bypass writes Problem? Solution? 25 Early Restart Decrease miss penalty with no new hardware well, okay, with some more complicated control Strategy: impatience! There is no need to wait for entire line to be fetched Early Restart as soon as the requested word (or double word) of the cache block arrives, let the CPU continue execution If CPU references another cache line or a later word in the same line: stall Early restart is often combined with the next technique 26 13

Critical Word First Improvement over early restart request missed word first from memory system send it to the CPU as soon as it arrives CPU consumes word while rest of line arrives Example: 32B block (8 words), miss on address 20 0 4 8 12 16 20 24 28 words return from memory system as follows: 27 Sub-blocking Tags are too large, i.e., too much overhead Simple solution: Larger blocks, but miss penalty could be large. Sub-block placement (aka sector cache) A valid bit added to units smaller than the full block, called sub-blocks Only read a sub-block on a miss If a tag matches, is the word in the cache? 100 300 204 1 1 1 1 1 1 0 0 0 1 0 1 28 14

Recent Research: Flash Cache Slides from Taeho Kgil and Trevor Mudge (U of Michigan) Roadmap for Memory - Flash 2005 2007 2009 2011 2013 Cell Density of SRAM MBytes / cm 2 11 17 28 46 74 Cell Density of DRAM MBytes / cm 2 153 243 458 728 1,154 Cell Density of NAND 339 / Flash Memory (SLC / MLC) MBytes / cm 2 713 604 / 1,155 864 / 1,696 1,343 / 5,204 2,163 / 8,653 From ITRS 2005 Roadmap 29 Power/Performance of DRAM and Flash Density (Gb /cm 2 ) $/Gb Active Power Idle Power Read latency Write latency Erase latency DRAM 0.7 48 878mW 80mW 55ns 55ns N/A NAND 1.42 21 27mW 6μW 25μs 200μs 1.5ms Performance and power for 1Gb memory from Samsung Datasheets 2003 Flash good for idle power optimization 1000 less power than DRAM Flash not so good for low access latency usage model DRAM still required for decent access latencies 30 15

Cost of Power and Cooling 50% 25% World wide cost of purchasing and operating servers Source: IDC 31 System-Level Power Consumption SunFire T2000 Power running SpecJBB Processor 26% AC/DC conversion 15% Fans 10% 16GB memory 22% Disk 4% I/O 22% Processor 16GB memory I/O Disk Service Processor Fans AC/DC conversion Total Power 271W From Sun talk given by James Laudon 32 16

Network bandwidth - Mbps (Throughput) Case for Flash as 2 nd Disk Cache Many server workloads use a large working-set (100 s of MBs ~ 10 s of GB and even more) Large working-set is cached to main memory to maintain high throughput Large portion of DRAM to disk cache Many server applications are read intensive than write intensive 32GB DRAM on SunFire T2000 consumes idle power 45W Flash memory consumes 1000x less idle power than DRAM Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content Client requests are spatially and temporally a zipf like distribution Ex) 90% of client requests are to 20% of files 33 Latency vs. Throughput 1,200 MP4 MP8 MP12 1,000 800 600 400 200 0 12us 25us 50us 100us 400us 1600us disk cache access latency to 80% of files Specweb99 34 17

Lifetime - years Overall Architecture Processors 1GB DRAM Main memory HDD ctrl. Hard Disk Baseline Drivewithout FlashCache 35 Flash Lifetime 10000.00 SURGE SPECWeb99 Financial1 WebSearch1 1000.00 100.00 10.00 1.00 0.10 0.01 0% 20% 40% 60% 80% 100% Flash memory size (percentage of working set size) 36 18

max. tolerable P/E cycles Programmable Flash Controller Programmable Flash memory controller GF Field LUT BCH configuration Descriptor Density Descriptor BCH Encode / Decode P/E Descriptor Flash Address Flash Data (Writes) Bit Error (Yes / No) External Interface CRC Encode / Decode Flash Density Control NAND Flash Memory Flash Data (Reads) Flash Program / Erase time control 37 Impact of ECC 7,100,000 stdev = 0 stdev = 10% of mean stdev = 5% of mean stdev = 20% of mean 6,100,000 5,100,000 4,100,000 3,100,000 2,100,000 1,100,000 100,000 0 2 4 6 8 10 12 # of correctable errors (code strength) 38 19

Network Bandwidth - Mbps Lifetime - years Flash Lifetime w/ Programmable Controller 10000.00 SURGE SPECWeb99 Financial1 WebSearch1 1000.00 100.00 10.00 1.00 0.10 0.01 0% 20% 40% 60% 80% 100% Flash memory size (percentage of working set size) 39 Overall Performance - Mbps 1,200 MP4 MP8 MP12 1,000 800 600 400 200 0 DRAM 32MB + FLASH 1GB DRAM 64MB + FLASH 1GB DRAM 128MB +FLASH 1GB DRAM 256MB +FLASH 1GB DRAM 512MB +FLASH 1GB DRAM 1GB Specweb99 40 20

Overall Power - W Overall Main Memory Power 3 2.5 2.5W read power write power idle power 2 1.5 1 0.5 1.6W 0.6W 0 DDR2 1GB active DDR2 1GB powerdown DDR2 128MB + Flash 1GB SpecWeb99 41 21