CS377P Programming for Performance Single Thread Performance Caches I

Similar documents
Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

211: Computer Architecture Summer 2016

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Inside out of your computer memories (III) Hung-Wei Tseng

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

CISC 360. Cache Memories Nov 25, 2008

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Giving credit where credit is due

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Cache Memories October 8, 2007

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

14:332:331. Week 13 Basics of Cache

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Lecture 7 - Memory Hierarchy-II

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Introduction to OpenMP. Lecture 10: Caches

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Cache Performance (H&P 5.3; 5.5; 5.6)

MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16

Memory Hierarchy: Caches, Virtual Memory

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Chapter 5. Memory Technology

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CS241 Computer Organization Spring Principle of Locality

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Cray XE6 Performance Workshop

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Last class. Caches. Direct mapped

+ Random-Access Memory (RAM)

Today Cache memory organization and operation Performance impact of caches

Memory Hierarchy. Slides contents from:

EITF20: Computer Architecture Part4.1.1: Cache - 2

Cache memories are small, fast SRAM based memories managed automatically in hardware.

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Effect of memory latency

ECE468 Computer Organization and Architecture. Memory Hierarchy

Administrivia. Expect new HW out today (functions in assembly!)

Memory Hierarchy. Slides contents from:

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 261 Fall Mike Lam, Professor. Memory

L7: Performance. Frans Kaashoek Spring 2013

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Memory Hierarchy. Announcement. Computer system model. Reference

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Advanced Memory Organizations

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Speicherarchitektur. Who Cares About the Memory Hierarchy? Technologie-Trends. Speicher-Hierarchie. Referenz-Lokalität. Caches

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

EE 660: Computer Architecture Advanced Caches

CSE502: Computer Architecture CSE 502: Computer Architecture

CS 261 Fall Mike Lam, Professor. Memory

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

MEMORY HIERARCHY DESIGN

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Systems and Performance Engineering

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Key Point. What are Cache lines

Lecture 2: Caches. belongs to Milo Martin, Amir Roth, David Wood, James Smith, Mikko Lipasti

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

EE 4683/5683: COMPUTER ARCHITECTURE

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Handout 4 Memory Hierarchy

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Course Administration

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Transcription:

CS377P Programming for Performance Single Thread Performance Caches I Sreepathi Pai UTCS September 21, 2015

Outline 1 Introduction 2 Caches 3 Performance of Caches

Outline 1 Introduction 2 Caches 3 Performance of Caches

Matrix Multiply IJK Multiplying two matrices: A (m n) B (n k) C (m k) [result] Here: m = n = k for(ii = 0; ii < m; ii++) for(jj = 0; jj < n; jj++) for(kk = 0; kk < k; kk++) C[ii * k + kk] += A[ii * n + jj] * B[jj * k + kk];

Matrix Multiply IKJ for(ii = 0; ii < m; ii++) for(kk = 0; kk < k; kk++) for(jj = 0; jj < n; jj++) C[ii * k + kk] += A[ii * n + jj] * B[jj * k + kk];

Performance of the two versions? on 1024x1024 matrices of ints which is faster? by how much?

Performance of the two versions on 1024x1024 matrices Time for IJK: 0554 s ± 0003s (95% CI) Time for IKJ: 6618 s ± 0032s (95% CI)

What caused the nearly 12X slowdown? Matrix Multiply has a large number of arithmetic operations But the number of operations did not change Matrix Multiply also refers to a large number of array elements Order in which they access elements changed But why should this matter?

Die shot of a processor (IBM Power 8)

Die shot of a processor (IBM Power 8) extremetech

Outline 1 Introduction 2 Caches 3 Performance of Caches

The question What are caches?

Answer? Caches are a kind of fast(er) memory

Obligatory question Why don t we build entire memory systems out of cache memory? see also: joke about black boxes in aeroplanes

Physical Issues Not all memory types are equal Consider: SRAM, DRAM and magnetic storage Speed to access Depends on size and type of memory SRAM > DRAM > Magnetic storage Density of storing Bits per square millimeter SRAM < DRAM < Magnetic storage

The Memory Hierarchy Registers managed by compiler logic L1 cache small (10s KB), usually 1-cycle access SRAM (also logic ) L2 cache largish (100s KB), 10s of cycles SRAM L3 cache usually on multicores much larger (MB), 100s of cycles SRAM or (recently) embedded DRAM RAM off-chip, large (GB) DRAM HDD Magnetic/Rotating Storage (TBs) Flash memory (GBs)

Performance of the hierarchy? Why structure memory in a hierarchy? Each level of hierarchy adds a delay Time to access memory increases! Or does it?

Performance of the hierarchy Structures in memory hierarchy duplicate stored further away original meaning of the word cache If is found at closer to processor (ie hit), read it from there Otherwise (ie miss), pass request one level up the hierarchy

Why the hierarchy works in practice Data Reuse (or locality ) Temporal (same will be referred again) Spatial ( close to each other in space will be referred close to each other in time) Speed differences Time to access L1: 1ns Branch mispredict: 3ns Time to access L2: 4ns Main memory access time: 100ns SSD access time: 16µs Rotating media access time: < 5 ms From Latency Numbers Every Programmer Should Know

The cache equation (informal) Assume a one-level cache (ie cache + RAM): latency = latency hit or latency = latency miss

The cache equation for one level of caches latency avg = (fraction hit ) latency hit + (1 fraction hit ) latency miss

Outline 1 Introduction 2 Caches 3 Performance of Caches

Cache Organization RAM is directly addressable Caches duplicate RAM contents accept same addresses translate that address internally Translation is a many-to-one function obviously, since caches are much smaller than RAM Therefore caches store: (original address or part of original address) is used to verify address

Cache Lookup Building blocks of translation functions: Direct Mapped Associative Lookup

A Direct Mapped Cache Converts address to cache location index = fn(address) 0 1 2 3 4 5 n

A Fully Associative Cache Searches for address in cache also known as content-addressable memory address match

Set Associative Caches 4-way set-associative cache Lookup a set based on address (direct-mapped) Lookup address only within the set (associative) index = fn(address) 0 1 2 3 4 5 n

The 3C model of cache misses Compulsory (or Cold) misses first reference to always occur (?) Capacity misses in cache is evicted once cache is full miss to being evicted from cache these are absent in an infinite cache Conflict misses miss due to many-to-one conflict two different addresses map to the same cache address these do not occur in a fully associative cache Due to Mark Hill

Next class Reducing misses using programming techniques

Summary Memory accesses are key to program performance nearly always the bottleneck in most programs The memory hierarchy lowers memory access latency Exploits locality of references Size/speed tradeoffs Caches are organized differently than RAM smaller implications for performance transparent to programmer not transparent to performance!