Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

Similar documents
Loop Transformations! Part II!

The Processor Memory Hierarchy

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

The So'ware Stack: From Assembly Language to Machine Code

Naming in OOLs and Storage Layout Comp 412

Demand- Paged Virtual Memory

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

The Software Stack: From Assembly Language to Machine Code

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

Compiling for Advanced Architectures

Intermediate Representations

Runtime Support for OOLs Object Records, Code Vectors, Inheritance Comp 412

Instruction Selection: Preliminaries. Comp 412

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

Coarse-Grained Parallelism

Code Shape II Expressions & Assignment

MEMORY: SWAPPING. Shivaram Venkataraman CS 537, Spring 2019

Arrays and Functions

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

Tiling: A Data Locality Optimizing Algorithm

Runtime Support for Algol-Like Languages Comp 412

The Polyhedral Model (Transformations)

Advanced optimizations of cache performance ( 2.2)

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

Handling Assignment Comp 412

Memory Management! Goals of this Lecture!

Control flow graphs and loop optimizations. Thursday, October 24, 13

Introduction to Optimization Local Value Numbering

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Parallelisation. Michael O Boyle. March 2014

Program Transformations for the Memory Hierarchy

Program Op*miza*on and Analysis. Chenyang Lu CSE 467S

Computer Architecture Spring 2016

Storage Management 1

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit

Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations:

Supercomputing in Plain English Part IV: Henry Neeman, Director

Lecture 1 Introduc-on

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Intermediate Representations

Implementing Control Flow Constructs Comp 412

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

CS 293S Parallelism and Dependence Theory

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Optimising for the p690 memory system

Just-In-Time Compilers & Runtime Optimizers

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

CSC D70: Compiler Optimization Memory Optimizations

CS399 New Beginnings. Jonathan Walpole

Cache Performance (H&P 5.3; 5.5; 5.6)

Pipelining Exercises, Continued

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

This lecture. Virtual Memory. Virtual memory (VM) CS Instructor: Sanjeev Se(a

Simone Campanoni Loop transformations

Global Register Allocation via Graph Coloring

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1

Today: Segmentation. Last Class: Paging. Costs of Using The TLB. The Translation Look-aside Buffer (TLB)

Optimizer. Defining and preserving the meaning of the program

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

COSC3330 Computer Architecture Lecture 20. Virtual Memory

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

CS370 Operating Systems

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

15 Sharing Main Memory Segmentation and Paging

virtual memory Page 1 CSE 361S Disk Disk

Memory Management. Dr. Yingwu Zhu

Operating Systems. 09. Memory Management Part 1. Paul Krzyzanowski. Rutgers University. Spring 2015

Advanced Compiler Construction Theory And Practice

Virtual Memory: Concepts

CS 61C: Great Ideas in Computer Architecture. Virtual Memory

16 Sharing Main Memory Segmentation and Paging

EE 4683/5683: COMPUTER ARCHITECTURE

198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14

Transla'on Out of SSA Form

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

Opera&ng Systems ECE344


Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

Habanero Extreme Scale Software Research Project

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

This Unit: Main Memory. Virtual Memory. Virtual Memory. Other Uses of Virtual Memory

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Transcription:

COMP 506 Rice University Spring 2017 Loops and Locality with an introduc-on to the memory hierarchy source code Front End IR OpJmizer IR Back End target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 506 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educajonal insjtujons may use these materials for nonprofit educajonal purposes, provided this copyright nojce is preserved Most of this material is not in EaC2e

Op1miza1on From Lecture 14 Compilers operate at mul1ple granulari1es or scopes Local techniques Work on a single basic block Maximal length sequence of straight-line code Regional techniques Consider muljple blocks, but less than whole procedure Single loop, loop nest, dominator region, Intraprocedural, or global, techniques Operate on an enjre procedure Common of compilajon Interprocedural, or whole-program, techniques Operate on > 1 procedure, up to whole program LogisJcal issues related to accessing the code (but just one) (op-mize in the linker?) COMP 506, Spring 2017 2

The Opportuni1es Loop Op1miza1on Compilers have always focused on loops They have higher execujon counts than code outside loops They have repeated operajons and related operajons Much of the real work of compujng takes place inside loops There are several effects to acack Loop overhead Decrease the control-structure cost for each iterajon Locality SpaJal Locality use of co-resident Temporal Locality reuse of the same at different Jmes Parallelism Move loops with indepent operajons to inner or outer posijon 1 COMP 515 COMP 1 Innermost 506, Spring loop makes 2017 sense for vector machines; outermost loop makes sense for muljprocessors. 3

Elimina1ng Overhead Loop Unrolling (the oldest trick in the book) To reduce overhead, replicate the body Overhead is the increment, test, and branch do i = 1 to 100 by 1 a(i) = a(i) + b(i) becomes (unroll by 4) do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) Sources of Improvement Less overhead per useful operajon Longer basic blocks for local opjmizajon COMP 506, Spring 2017 4

Elimina1ng Overhead Loop Unrolling With Unknown Bounds Generate extra loops to handle cases smaller than the unroll factor do i = 1 to n by 1 a(i) = a(i) + b(i) becomes (unroll by 4) While loop needs an explicit update for variable i You will find code like this in the BLAS and in BitBlt i = 1 while (i+3 < n) do a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) i = i + 4 while (i < n ) do a(i) = a(i) + b(i) i = i + 1 COMP 506, Spring 2017 5

Elimina1ng Overhead One Other Use For Unrolling Eliminate copies at the of a loop t 1 = b(0) do i = 1 to 100 t 2 = b(i) a(i) = a(i) + t 1 + t 2 t 1 = t 2 becomes (unroll by 2 and rename) t 1 = b(0) do i = 1 to 100 by 2 t 2 = b(i) a(i) = a(i) + t 1 + t 2 t 1 = b(i+1) a(i+1) = a(i+1) + t 2 + t 1 More Complex Cases MulJple cycles of cross-iterajon copies Use LCM of cycle lengths as unroll factor Result has been rediscovered many Jmes [214] COMP 506, Spring 2017 6

Locality-Driven Improvement Loop Fusion Two loops iterate over the same iterajon space Convert them into a single loop do i = 1 to n c(i) = a(i) + b(i) do j = 1 to n d(j) = a(j) * e(j) becomes (fuse) do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) Advanes Fewer total operajons (lower overhead) Longer basic blocks for local opjmizajon & scheduling Can convert reuse between loops to reuse within a loop COMP 506, Spring 2017 8

Locality-Driven Improvement Loop Fusion Two loops iterate over the same iterajon space Convert them into a single loop do i = 1 to n c(i) = a(i) + b(i) do j = 1 to n d(j) = a(j) * e(j) becomes (fuse) Advanes Fewer total operajons (lower overhead) Longer basic blocks for local opjmizajon & scheduling Can convert reuse between loops to reuse within a loop do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) This transforma1on is safe if and only if the fused loop does not change the values used or defined by any statement in either loop. COMP 506, Spring 2017 9

Locality-Driven Improvement Loop Fusion Two loops iterate over the same iterajon space Convert them into a single loop do i = 1 to n c(i) = a(i) + b(i) do j = 1 to n d(j) = a(j) * e(j) becomes (fuse) For large enough arrays, a(x) will not be in the cache by the 1me the second loop tries to reuse it. Advanes Fewer total operajons (lower overhead) Longer basic blocks for local opjmizajon & scheduling Can convert reuse between loops to reuse within a loop do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) a(x) will almost certainly be in the cache at the second use. Safety is expressed in terms of depences: essen1ally, the same values COMP 506, Spring 2017 flow to the same places. 10

Locality-Driven Improvement Loop Distribu1on (or fission) Single loop with muljple indepent statements Can transform it into muljple indepent loops Reads b, c, e, f, h, & k Writes a, d, & g do i = 1 to n a(i) = b(i) + c(i) d(i) = e(i) * f(i) g(i) = h(i) - k(i) becomes (fission) do i = 1 to n a(i) = b(i) + c(i) do i = 1 to n d(i) = e(i) * f(i) do i = 1 to n g(i) = h(i) - k(i) Advanes Loops in the transformed code can have a smaller cache footprint More reuse in the cache leads to faster execujon Enables other transformajons, such as vectorizajon Reads b & c Writes a Reads e & f Writes d Reads h & k Writes g DistribuJon is safe if all the statements that form a cycle in the COMP depence 506, Spring graph 2017 up in the same loop (see COMP 515) 11

Locality-Driven Improvement Loop Interchange Reorders Loops To Improve Locality Swap inner and outer loops to rearrange the iterajon space do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) becomes (interchange) do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) Effect Improves spajal reuse by using more elements per cache line Goal is to get as much reuse into the inner loop as possible COMP 506, Spring 2017 12

Locality-Driven Improvement Loop Interchange Reorders Loops To Improve Locality Swap inner and outer loops to rearrange the iterajon space do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) In Fortran s column-major order, a(4,4) would lay out as 1,1 2,1 3,1 4,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4 As licle as 1 element used per line cache line Effect Improves spajal reuse by using more elements per cache line Goal is to get as much reuse into the inner loop as possible COMP 506, Spring 2017 13

Locality-Driven Improvement Loop Interchange Reorders Loops To Improve Locality Swap inner and outer loops to rearrange the iterajon space Aaer interchange, direc1on of Itera1on is changed 1,1 2,1 3,1 4,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4 cache line Runs down cache lines do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) Root cause of the speed difference in the array example from the 1 st COMP 506 lecture Effect Improves spajal reuse by using more elements per cache line Goal is to get as much reuse into the inner loop as possible COMP 506, Spring 2017 14

Locality-Driven Improvement Loop Interchange Reorders Loops To Improve Locality Swap inner and outer loops to rearrange the iterajon space do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) becomes do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) If arrays are stored in row-major order, the same effects occur with the opposite order of loops and subscripts. Effect Improves spajal reuse by using more elements per cache line Goal is to get as much reuse into the inner loop as possible COMP 506, Spring 2017 15

Locality-Driven Improvement Loop Permuta1on Generalizes Interchange to Mul1ple Loops Interchange (2 loops) is the degenerate case of two perfectly nested loops In more general seengs, the transformajon is called permuta-on Safety PermutaJon is safe iff no depences are reversed That is, the flow of from defini-ons to uses is preserved Effects Change the order of access and the order of computajon Move accesses closer together in Jme increased temporal locality Move computajons further apart in Jme cover pipeline latencies COMP 506, Spring 2017 16

The Big Picture Loop op1miza1ons can radically change locality For programs that are memory bound, loop opjmizajon is the primary way to find improvements Change the order of iterajon, change paferns of memory accesses Safety condi1ons and opportuni1es The formal statements of the safety condijons typically involve depence analysis (see COMP 515) Many formulajons of the transformajons Polyhedral analysis Unimodular transformajons Ad-hoc and one-off techniques Safety expressed in terms of depences: essen1ally, the same values flow to the same places. Improving memory-bound programs is possible, but takes some knowledge Most run-of-the-mill compilers do not perform op-miza-ons this complex COMP 506, Spring 2017 17

Address Space Layout We have seen this drawing several Jmes in COMP 506 Most language run1mes layout the address space in a similar way Stacks Growth space for stacks Heap Code Globals Pieces (stack, heap, code, & globals) may move, but all will be there Stack and heap grow toward each other (if heap grows) Arrays live on one of the stacks, in the global area, or in the heap The picture shows one virtual address space. The hardware supports one virtual address space per process. How does a virtual address space map into physical memory? Java Memory Layout COMP 506, Spring 2017 18

How Does Address Space Mapping Work? The Big Picture Compiler s view virtual address spaces S t a c k H e a p C o d e S G t l a & o t b i a c l S t a c k H e a p C o d e S G t l a & o t b i a c l S t a c k H e a p C o d e S G t l a & o t b i a c l... S t a c k H e a p C o d e S G t l a & o t b i a c l OS view... TLB 0 high Physical address space 1980 Hardware view TLB is an address cache used by the OS to speed virtual-to-physical address translajon. A processor may have COMP > 1 level 506, of Spring TLB. 2017 19

Cache structure macers for performance, not correctness More Address Space Mapping Of course, the Hardware view is no longer that simple Main Memory L2 Cache... 0 high Data & Code... TLB L1 Caches Data Code Data Code Processor Cores Registers Registers Many COMP processors 506, Spring now 2017 include L3 caches; L4 caches are on their way. 20

Cache Memory L3 L2 L1 Core Data Data & Code Data & Code Registers Code Typically shared among 2 cores TLB Modern hardware features mul1ple levels of cache & of TLB L1 is typically private to a core L2 (and beyond) is typically shared between cores and between code (I) and (D) Most caches are inclusive Item in L1 in L2 in L3 Some are exclusive (L1 not in L2) Most caches are set associajve 2, 4 or 8 way TLBs are also associajve Lifle documentajon Difficult to detect or measure COMP 506, Spring 2017 21

Cache Memory The primary func1on of a cache is to provide fast memory near the core L1 is a couple of cycles and small L2 is slower than L1 and larger; L3 is slower and larger, This Laptop (Core i7) L1 5 cycles 32KB L2 13 cycles 256KB L3 36 cycles 4,096KB COMP 506, Spring 2017 22

Cache Memory The primary func1on of a cache is to provide fast memory near the core L1 is a couple of cycles and small L2 is slower than L1 and larger; L3 is slower, again, and larger The other func1on of a cache is to map addresses Cache is organized into blocks, or lines Each line consists of a and a set of words Cache block or line COMP 506, Spring 2017 23

Cache Memory To make good use of cache memory, the code must reuse values. SpaJal reuse refers to the use of more than one word in a line. Temporal reuse refers to reuse of the same word over Jme. The primary func1on of a cache is to provide fast memory near the core L1 is a couple of cycles and small L2 is slower than L1 and larger; L3 is slower, again, and larger The other func1on of a cache is to map addresses Cache is organized into blocks, or lines Each line consists of a and a set of words Cache block or line A full cache is a set of lines Address maps into 3 parts:, index, and offset address index offset COMP 506, Spring 2017 index is a manyto-one map 24

Cache Memory Caches differ in how they appor1on the and index bits A direct-mapped cache has one line per index Cache lookup is simple The index bits are an ordinal index into the set of lines index offset t s o Direct-mapped cache Line 0 Cache Do the s match? Line 1 Line 2 Line 3 Line 2 2-1 Line 2 s rest of address 0000001 0000100 COMP 506, Spring 2017 A direct mapped cache has s lines. Capacity is the sum of the sizes of the lines. 25

Cache Memory Caches differ in how they appor1on the and index bits A set-associa1ve cache has muljple lines per index index maps to a set, lookup matches s within the set Small content-addressable memory 1 for each set 2-way Set-Associa1ve Cache A set-associajve cache has 2 s sets. For a given total size, s is smaller than in direct mapped. The is longer; the index is shorter. Set 0 COMP 506, Spring 2017 Set 1 Set 2 Set 2 s -2 Set 2 s -1 Way 1 Way 0 index offset 26 1 somejmes called associajve memory.

What Happens on a Load? The hardware must find the in this complex hierarchy Assume that the address is in a register, e.g. load r 0 => r 1 Assume set-associajve cache Assume cache s are virtual addresses Sequence of Events for a load 1. Processor looks in L1 cache Index maps to a set, then an associa-ve search on the s in the set If found (a cache hit), return the value; otherwise 2. Processor looks in L2 cache Index maps to a set, then an associa-ve search on the s in the set If found (a cache hit), return the value; otherwise 3. And so on COMP 506, Spring 2017 27

What Happens on a Load? What about virtual to physical address transla1on? The address in the load is a virtual address If the load misses in all caches, we need a physical address Caches can be designed to operate on virtual or physical addresses L1 is typically indexed by virtual addresses L2 and above are typically indexed by physical addresses Physically-addressed cache virtual address transla1on during lookup Involves understanding the map from virtual pages to physical pages Involves cooperajon between hardware and the operajng system Worst case behavior involves walking the page tables (ooen locked in L2 or L3) Design of virtual memory systems is covered in a good OS course COMP 506, Spring 2017 28

Cache Memory L3 L2 L1 Core Data Data & Code Data & Code Registers Code Typically shared among 2 cores TLB The TLB plays a key role in virtual to physical address mapping Small cache that maps virtual addresses to physical addresses Holds subset of (acjve) pages that are in virtual memory Tag is v-addr, content is p-addr Physically ged cache must translate v-addr to p-addr TLB hit access can conjnue TLB miss search to bring page into TLB, then conjnue (or reissue) the access A page-fault on the way to an L1 lookup is a lot of delay COMP 506, Spring 2017 29

Cache Memory L3 L2 L1 Core Data Data & Code Data & Code Registers Code Typically shared among 2 cores TLB The TLB plays a key role in virtual to physical address mapping Small cache that maps virtual addresses to physical addresses Holds subset of (acjve) pages that are in virtual memory Tag is v-addr, content is p-addr Most processors use a virtually ged L1 cache, with physical s in upper-level caches Removes TLB role in L1 lookup TLB can be as fast as L1, so it is not a problem for L2 and beyond Physical s are smaller than virtual s fewer gates, less area, lower power consumpjon COMP 506, Spring 2017 30

What Happens on a Load? index offset t s o Careful design can let the TLB lookup & index set lookup run in parallel By playing with the size of t, s, and o, the cache designer can separate index lookup from virtual-to-physical translajon If s + o log 2 (pagesize) then the index and offset bits are the same in physical & virtual addresses If s + o log 2 (pagesize), then the processor can start both the L1 lookup to find the set and the TLB lookup to translate the address at the same Jme By the Jme it has found the set, it should have the from the physical address (unless the lookup misses in the TLB) In effect, associajvity lets cache capacity grow without increasing the number of bits in the index field of the address Do manufacturers play this game? Absolutely. My laptop has a 32,768 byte L1 cache, with 64 byte lines, for 512 lines. It is 8-way set associajve, which means 64 sets. Thus, s = 6, o = 6, and s + o = 12 bits. 2 12 = 4,096, which is the pagesize. COMP With my 506, laptop s Spring cache 2017 parameters, a 4-way associajve cache would need 32 byte lines to keep s + o 12. 32