Computer Architecture Spring 2016

Similar documents
Caches and Prefetching

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

15-740/ Computer Architecture

Computer Architecture Spring 2016

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

Advanced optimizations of cache performance ( 2.2)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

CS7810 Prefetching. Seth Pugsley

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Chapter 2: Memory Hierarchy Design Part 2

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Chapter 2: Memory Hierarchy Design Part 2

Lecture notes for CS Chapter 2, part 1 10/23/18

EITF20: Computer Architecture Part4.1.1: Cache - 2

Lecture 7 - Memory Hierarchy-II

EITF20: Computer Architecture Part4.1.1: Cache - 2

History Table. Latest

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Introduction to OpenMP. Lecture 10: Caches

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Portland State University ECE 587/687. Caches and Prefetching

Memory Hierarchy. Bojian Zheng CSCD70 Spring 2018

Memory Consistency. Challenges. Program order Memory access order

LECTURE 5: MEMORY HIERARCHY DESIGN

CSC 631: High-Performance Computer Architecture

Cache Performance (H&P 5.3; 5.5; 5.6)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchies 2009 DAT105

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Caches. Samira Khan March 23, 2017

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Copyright 2012, Elsevier Inc. All rights reserved.

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Prefetching. Fall 2007 Prof. Thomas Wenisch. Correlating Prediction Table. Latest. Prefetch A3.

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Cray XE6 Performance Workshop

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Spatial Memory Streaming (with rotated patterns)

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Page 1. Multilevel Memories (Improving performance using a little cash )

1/19/2009. Data Locality. Exploiting Locality: Caches

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

EC 513 Computer Architecture

Advanced Computer Architecture

Advanced Computer Architecture

Computer Architecture Spring 2016

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

A best-offset prefetcher

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Memory Hierarchy. Slides contents from:

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate


Computer Architecture. Unit 4: Caches

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

CS3350B Computer Architecture

Cache performance Outline

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Threshold-Based Markov Prefetchers

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Cache Optimisation. sometime he thought that there must be a better way

EE 660: Computer Architecture Advanced Caches

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Transcription:

omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University

Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity, (& coherence) misses Why not conflict? ig challenges: Knowing what to fetch Fetching useless blocks wastes resources Knowing when to fetch Too early clutters storage (or gets thrown out before use) Fetching too late defeats purpose of pre -fetching

Without prefetching: Prefetching(2/3) L1 L2 DRAM With prefetching: Load time Total Load-to-Use Latency Data Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load Data Somewhat improved Latency Prefetching must be accurate and timely

Prefetching(3/3) Without prefetching: With prefetching: Run Load time Prefetching removes loads from critical path

ommon Types of Prefetching Software Next-Line, Adjacent-Line Next-N-Line Stream uffers Stride Localized (e.g., P-based) Pointer orrelation

Software Prefetching(1/4) ompiler/programmer places prefetch instructions Put prefetched value into Register (binding, also called hoisting ) May prevent instructions from committing ache (non-binding) Requires ISA support May get evicted from cache before demand

Software Prefetching(2/4) Hoisting must be aware of dependencies R1 = [R2] PREFETH[R2] A A A R1 = R1-1 R1 = R1-1 R1 = [R2] R3 = R1+4 R3 = R1+4 R1 = [R2] R3 = R1+4 (ache misses in red) Hopefully the load miss is serviced by the time we get to the consumer Using a prefetch instruction can avoid problems with data dependencies

Software Prefetching(3/4) for (I = 1; I < rows; I++) { for (J = 1; J < columns; J++) { prefetch(&x[i+1,j]); sum = sum + x[i,j]; } }

Software Prefetching(4/4) Pros: Gives programmer control and flexibility Allows time for complex (compiler) analysis No (major) hardware modifications needed ons: Hard to perform timely prefetches At IP=2 and 100-cycle memory move load 200 inst. earlier Might not even have 200 inst. in current function Prefetching earlier and more often leads to low accuracy Program may go down a different path Prefetch instructions increase code footprint May cause more I$ misses, code alignment issues

Hardware Prefetching(1/3) Hardware monitors memory accesses Looks for common patterns Guessed addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetchers look like READ requests to the hierarchy Although may get special prefetched flag in the state bits Prefetchers trade bandwidth for latency Extra bandwidth used only when guessing incorrectly Latency reduced only when guessing correctly No need to change software

Hardware Prefetching(2/3) Processor I- TL L1 I-ache Registers L1 D- ache D-TL Potential Prefetcher Locations L2 ache L3 ache (LL) Main Memory (DRAM)

Hardware Prefetching(3/3) Processor I- TL L1 I-ache Registers L1 D- ache D-TL Intel ore2 Prefetcher Locations L2 ache L3 ache (LL) Real PUs have multiple prefetchers Usually closer to the core (easier to detect patterns) Prefetchingat LL is hard (cache is banked and hashed)

Next-Line(or Adjacent-Line) Prefetching On request for line X, prefetchx+1 (or X^0x1) Assumes spatial locality Often a good assumption Should stop at physical (OS) page boundaries an often be done efficiently Adjacent-line is convenient when next-level block is bigger Prefetch from DRAM can use bursts and row-buffer hits Works for I$ and D$ Instructions execute sequentially Large data structures often span multiple blocks Simple, but usually not timely

Next-N-LinePrefetching On request for line X, prefetchx+1, X+2,, X+N N is called prefetch depth or prefetch degree Must carefully tune depth N. Large N is More likely to be useful (correct and timely) More aggressive more likely to make a mistake Might evict something useful More expensive need storage for prefetched lines Might delay useful request on interconnect or port Still simple, but more timely than Next-Line

Stream uffers (1/3) What if we have multiple inter-twined streams? A,, A+1, +1, A+2, +2, an use multiple stream buffers to track streams Keep next-n available in buffer On request for line X, shift buffer and fetch X+N+1 into it an extend to quasi-sequential stream buffer On request Y in [X X+N], advance by Y-X+1 Allows buffer to work when items are skipped Requires expensive (associative) comparison

Figures from Jouppi Improving Direct-Mapped ache Performance by the Addition of a Small Fully-Associative ache and Prefetch uffers, ISA 90 Stream uffers (2/3)

Stream uffers (3/3) an support multiple streams in parallel. an support multiple streams in parallel

Stride Prefetching(1/2) Elements in array of structs olumn in matrix Access patterns often follow a stride Accessing column of elements in a matrix Accessing elements in array of structs Detect stride S, prefetch depth N Prefetch X+1 S, X+2 S,, X+N S

Stride Prefetching(2/2) Must carefully select depth N Same constraints as Next-N-Line prefetcher How to determine if A[i] A[i+1] or X Y? Wait until A[i+2] (or more) an vary prefetch depth based on confidence More consecutive strided accesses higher confidence New access to A+3N = Last Addr Stride ount A+2N + N 2 + >2 Do prefetch? A+4N Update count

Localized Stride Prefetchers (1/2) What if multiple strides are interleaved? No clearly-discernible stride ould do multiple strides like stream buffers Expensive (must detect/compare many strides on each access) Accesses to structures usually localized to an instruction Miss pattern looks like: Load R1 = [R2] A, X, Y, A+N, X+N, Y+N, A+2N, X+2N, Y+2N, Load R3 = [R4] (X-A) (X-A) (X-A) Add R5, R1, R3 (Y-X) (Y-X) (Y-X) Store [R6] = R5 (A+N-Y) (A+N-Y) (A+N-Y) Use an array of strides, indexed by P

Localized Stride Prefetchers (2/2) Store P, last address, last stride, and count in RPT On access, check RPT (Reference Prediction Table) Same stride? count++ if yes, count--or count=0 if no If count is high, prefetch (last address + stride*n) Pa: 0x409A34 Load R1 = [R2] Tag Last Addr Stride ount Pb: 0x409A38 Pc: 0x409A40 Load R3 = [R4] Store [R6] = R5 0x409 0x409 A+3N N 2 + X+3N N 2 If confident about the stride (count > min ), prefetch (A+4N) 0x409 Y+2N N 1

Other Patterns Sometimes accesses are regular, but no strides Linked data structures (e.g., lists or trees) A D E F Linked-list traversal D A F Actual memory layout (no chance to detect a stride) E

Pointer Prefetching (1/2) Data filled on cache miss (512 bits of data) 8029 0 1 4128 90120230 90120758 14 4128 Nope Nope Nope Nope Maybe! Maybe! Nope Nope struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right; }; Go ahead and prefetch these (needs some help from the TL) This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) Pointers usually look different

Pointer Prefetching (2/2) Relatively cheap to implement Don t need extra hardware to store patterns Limited lookahead makes timely prefetches hard an t get next pointer until fetched data block Stride Prefetcher: X X+N Access Latency Access Latency X+2N Access Latency Pointer Prefetcher: A Access Latency Access Latency Access Latency

Pair-wise Temporal orrelation (1/2) Accesses exhibit temporal correlation If E followed D in the past if we see D, prefetch E orrelation Table Linked-list traversal D F D E 10 A D E F A F A? 00 11 D Actual memory layout F E D 11 11 E A E F 01 an use recursively to get more lookahead

Pair-wise Temporal orrelation (2/2) Many patterns more complex than linked lists an be represented by a Markov Model Required tracking multiple potential successors Number of candidates is called breadth orrelation Table.2.6 1.0 A.67 Markov Model D E F 1.0.33.5.2.6.2.2 Recursive breadth & depth grows exponentially.5 D F A E D F A E E D A 11 11 11 11 11 11 E?? F? 01 00 01 00 10 00

Increasing orrelation History Length Longer history enables more complex patterns Use history hash for lookup Increases training time DFS traversal: ADEAFGA A A D D E F E D A D E F G E A A Much better accuracy, exponential storage cost

Spatial orrelation (1/2) Database Page in Memory (8k) page header tuple data tuple slot index Memory Irregular layout non-strided Sparse can t capture with cache blocks ut, repetitive predict to improve MLP Large-scale repetitive spatial access patterns

Spatial orrelation (2/2) Logically divide memory into regions Identify region by base address Store spatial pattern (bit vector) in correlation table Region A Region Px: A Py: orrelation Table Px 110 1010001 111 Py 110 0001101 111 + To prefetch FIFO

Evaluating Prefetchers ompare against larger caches omplex prefetcher vs. simple prefetcher with larger cache Primary metrics overage: prefetched hits / base misses Accuracy: prefetched hits / total prefetches Timeliness: latency of prefetched blocks / hit latency Secondary metrics Pollution: misses / (prefetched hits + base misses) andwidth: total prefetches + misses / base misses Power, Energy, Area...

Hardware Prefetcher Design Space What to prefetch? Predictors regular patterns (x, x+8, x+16, ) Predicted correlated patterns (A ->,..->J, A..->K, ) When to prefetch? On every reference lots of lookup/prefetcher overhead On every miss patterns filtered by caches On prefetched-data hits (positive feedback) Where to put prefetched data? Prefetch buffers aches

Data L1 What s Inside Today s hips P-localized stride predictors Short-stride predictors within block prefetch next block Instruction L1 L2 Predict future P prefetch Stream buffers Adjacent-line prefetch