Far Fetched Prefetching? Tomofumi Yuki INRIA Rennes Antoine Morvan ENS Cachan Bretagne Steven Derrien University of Rennes 1

Similar documents
ECE 5730 Memory Systems

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

EE482c Final Project: Stream Programs on Legacy Architectures

Control Hazards. Prediction

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Portland State University ECE 587/687. Caches and Prefetching

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Chapter 2: Memory Hierarchy Design Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

November 7, 2014 Prediction

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Dynamic Control Hazard Avoidance

CS222: Cache Performance Improvement

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Multicore and Parallel Processing

UNIT I (Two Marks Questions & Answers)

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

HPC VT Machine-dependent Optimization

Chapter 2: Memory Hierarchy Design Part 2

Comparing Memory Systems for Chip Multiprocessors

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Lecture notes for CS Chapter 2, part 1 10/23/18

Tiling: A Data Locality Optimizing Algorithm

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Case Study 1: Optimizing Cache Performance via Advanced Techniques

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Cache Injection on Bus Based Multiprocessors

CUDA Performance Considerations (2 of 2)

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Derivation of Efficient FSM from Loop Nests

Cache-oblivious Programming

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EECS150 - Digital Design Lecture 09 - Parallelism

CS 426 Parallel Computing. Parallel Computing Platforms

Optimising for the p690 memory system

TORSTEN HOEFLER

Παράλληλη Επεξεργασία

Computer Architecture and Engineering. CS152 Quiz #2. March 3rd, Professor Krste Asanovic. Name:

Efficient Nested Loop Pipelining in High Level Synthesis using Polyhedral Bubble Insertion

Compiling for Advanced Architectures

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Control flow graphs and loop optimizations. Thursday, October 24, 13

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Lecture 11 Reducing Cache Misses. Computer Architectures S

Optimising with the IBM compilers

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Leveraging Cache Coherence in Active Memory Systems

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Comprehensive Review of Data Prefetching Mechanisms

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

EITF20: Computer Architecture Part4.1.1: Cache - 2

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Polyèdres et compilation

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Advanced optimizations of cache performance ( 2.2)

Control Hazards. Branch Prediction

Memory Hierarchy. Slides contents from:

Memory Hierarchy Basics

Lecture 2: Single processor architecture and memory

Dynamic Accommodation of Performance, Power, and Reliability Tradeoffs

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Co-synthesis and Accelerator based Embedded System Design

The Polyhedral Model Is More Widely Applicable Than You Think

Lecture 7 - Memory Hierarchy-II

Program Transformations for the Memory Hierarchy

Optimisation p.1/22. Optimisation

Reducing Conflict Misses with Set Associative Caches

Data Dependence Analysis

Cache Optimisation. sometime he thought that there must be a better way

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Code Optimizations for High Performance GPU Computing

CS422 Computer Architecture

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Advanced Caching Techniques

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Fourier-Motzkin and Farkas Questions (HW10)

Performance and Code Tuning. CSCE 315 Programming Studio, Fall 2017 Tanzir Ahmed

When Caches Aren t Enough: Data Prefetching Techniques

Transcription:

Far Fetched Prefetching? Tomofumi Yuki INRIA Rennes Antoine Morvan ENS Cachan Bretagne Steven Derrien University of Rennes 1

Memory Optimizations n Memory Wall n Memory improves slower than processor n Very little improvement in latency n Making sure processors have data to consume n Software: tiling, prefetching, array contraction n Hardware: cache, prefetching n Important for both speed and power n Further memory takes more power 2

Prefetching n Anticipate future memory accesses and start data transfer in advance n Both HW and SW versions exist n Hides latency of memory accesses n Cannot help if bandwidth is the issue n Inaccurate prefetch is harmful n Consumes unnecessary bandwidth/energy n Pressure on caches 3

This talk n A failure experience based on our attempt to improve prefetching n We target cases where trip count is small n Prefetch instructions must be placed in previous iterations of the outer loop n Use polyhedral techniques to find where to insert prefetch 4

Outline n Software Prefetching n When it doesn t work n Improving prefetch placement n Code Generation n Simple Example n Summary and Next Steps 5

Software Prefetching [Mowry 96] n Shift the iterations by prefetching distance n Simple yet effective method for regular computations for (i=0; i<n; i++)! = foo(a[i], )! for (i=-4; ; i++)! Prologue prefetch(a[i+4]);! for (i=0; ; i++)! = foo(a[i], )! prefetch(a[i+4]);! for (i= ; i<n; i++)! = foo(a[i], Epilogue )! prefetch distance = 4 6

When Prefetching Works, When It Doesn t, and Why n Difficult to statically determine prefetching distance n They suggest use of tuning n Not the scope of our work n Interference with HW prefetchers n Cannot handle short stream of accesses n Limitation for both software and hardware n We try to handle short streams [Lee, Kim, and Vuduc HiPEAC 2013] 7

Problem with Short Streams n When N=5 n Most prefetches are issued too early n Not enough computation to hide the latency n Simply translating iterations in the innermost loop is not sufficient for (i=-4; i<0; i++)! prefetch(a[i+4]);! for (i=0; ; i++)! = foo(a[i], )! prefetch(a[i+4]);! for (i= ; i<5; i++)! = foo(a[i], )! prefetch distance = 4 8

2D Illustration n Large number of useless prefetches j! i! 9

Lexicographical Shift n The main idea to improve prefetch placement j! i! 10

Outline n Code Generation n Polyhedral representation n Avoiding redundant prefetch n Simple Example n Summary and Next Steps 11

Prefetch Code Generation n Main Problem: Placement n Prefetching distance n Manually given parameter in our work n Lexicographically shifting the iteration spaces n Avoid reading the same array element twice n Avoid prefetching the same line multiple times n We use the polyhedral representation to handle these difficulties 12

Polyhedral Representation n Represent iteration space (of polyhedral programs) as a set of points (polyhedra) n Simply a different view of the program for (i=1; i<=n; i++)! for (j=1; j<=i; j++)! S0! j! i=n!i=j! Domain(S0) = [i,j] : 1 j i N! j=1! n Manipulate using convenient polyhedral representation, and then generate loops S0 N! i! 13

Transforming Iteration Spaces n Expressed as affine transformations n Shifting by constant is even simpler n Example: Shift the iteration by 4 along j n (i,j i,j-4)! j! Domain(S0) = [i,j] : 1 j i N! S0 Domain(S0 ) = [i,j] : 1 i N && -3 j i-4! S0 N! i! 14

Lex. Shift as Affine Transform n Piece-wise affine transformation n (i,j i,j-1) if j>1 or i=1! n (i,j i-1,i-1) if j=1 and i>1! n Apply n times for prefetch distance n Domain(S0) = [i,j] : 1 j i N! j! n S0 S0 Domain(S0 ) = [i,j] : <complicated>! N! i! 15

Avoiding Redundant Prefetch: Same Element n Given: n Target array: A n Set of statement instances that read from A n Array access functions for each read access n Find the set of first readers: n Statement instances that first read from each element in A n Find the lex. min among the set of points accessing the same array 16

Avoiding Redundant Prefetch: Same Cache Line n Let an element of array A be ¼ of a line. n The following prefetches the same line 4 times for i! prefetch(a[i+4]);! = foo(a[i], );! n We apply unrolling to avoid redundancy for i! prefetch(a[i+4]);! = foo(a[i], );! = foo(a[i+1], );! = foo(a[i+2], );! = foo(a[i+3], );! 17

Outline n Simple Example n Simulation results n Summary and Next Steps 18

Simple Example n Contrived example that better work n Expectation: when M is small and N is large, lexicographical shift should work better n Compare between n unroll only for(i=0; i<n; i+=1)! for(j=0; j<m; j+=1)! sum = sum + A[i][j];! n Mowery prefetching (shift in innermost) n Proposed (lexicographic shift) 19

Search for Simulators n Need simulators to experiment with n Memory latency n Number of outstanding prefetches n Line size, and so on n Tried on many simulators n XEEMU: Intel Xscale n SoCLib: SoC n gem5: Alpha/ARM/SPARC/x86 n VEX: VLIW 20

Simulation with VEX (HP Labs) n Miss penalty: 72 cycles cycles misses prefetches effective speedup original 658k 2015 - - - unroll 610k 2015 - - 1.08 mowry 551k 1020 2000 992 1.19 lex. shift 480k 25 2001 1985 1.37 n We see what we expect n Effective: number of useful prefetches n But, for more complex examples, the benefit diminishes (<3%) 21

Lex. Shift was Overkill n Precision in placement does not necessary translate to benefit n Computationally very expensive n Power of piecewise affine function by the prefetch distance n Takes a lot of memory and time n High control overhead n Code size more than doubles compared to translation in the innermost loop 22

Summary n Prefetching doesn t work with short streams n Epilogue dominates n Can we refine prefetch placement? n Lexicographic Shift n Increase overlap of prefetch and computation n Can be done with polyhedral representation n But, it didn t work n Overkill 23

Another Possibility n Translation in outer dimensions j! i! n Keeps things simple by selecting one appropriate dimension to shift n Works well for rectangular spaces 24

Coarser Granularity n Lexicographic shift at statement instance level seems too fine grained n Can it work with tiles? n Prefetch for next tile as you execute one 25

Questions? 26