CSCI 402: Computer Architectures. Performance of Multilevel Cache

Similar documents
Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE331: Hardware Organization and Design

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

ECE331: Hardware Organization and Design

Cache memories are small, fast SRAM based memories managed automatically in hardware.

211: Computer Architecture Summer 2016

Today Cache memory organization and operation Performance impact of caches

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories October 8, 2007

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN ARM

COMPUTER ORGANIZATION AND DESIGN

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

CISC 360. Cache Memories Nov 25, 2008

Giving credit where credit is due

211: Computer Architecture Summer 2016

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

ECE 331 Hardware Organization and Design. UMass ECE Discussion 11 4/12/2018

COMPUTER ORGANIZATION AND DESIGN

Carnegie Mellon. Cache Memories

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

High-Performance Parallel Computing

Example. How are these parameters decided?

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Last class. Caches. Direct mapped

Cache Impact on Program Performance. T. Yang. UCSB CS240A. 2017

Memory Hierarchy. Announcement. Computer system model. Reference

Name: 1. Caches a) The average memory access time (AMAT) can be modeled using the following formula: AMAT = Hit time + Miss rate * Miss penalty

COSC 3406: COMPUTER ORGANIZATION

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Computer Organization - Overview

Introduction to Computer Systems

Introduction to Computer Systems

COSC 6385 Computer Architecture - Exercises

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Cache Memory and Performance

Introduction to Computer Systems: Semester 1 Computer Architecture

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Christos Kozyrakis Stanford University

Dependability and ECC

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Main Memory Supporting Caches

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Announcements. EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Measuring Performance. Memory Performance

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures

Computer Organization: A Programmer's Perspective

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

COSC 6385 Computer Architecture - Memory Hierarchies (III)

Supercomputer Field Data. DRAM, SRAM, and Projections for Future Systems

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Main Memory (Fig. 7.13) Main Memory

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

University of Western Ontario, Computer Science Department CS3350B, Computer Architecture Quiz 1 (30 minutes) January 21, 2015

Program Optimization

Hardware Implementation of Single Bit Error Correction and Double Bit Error Detection through Selective Bit Placement for Memory

Background. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory.

Adapted from David Patterson s slides on graduate computer architecture

CS61C : Machine Structures

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Dense Matrix Multiplication

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Recall

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Cache Memory and Performance

Advanced Computer Architecture (CS620)

CS 61C: Great Ideas in Computer Architecture Dependability. Today s Lecture. Smart Phone. Core. FuncQonal Unit(s) Logic Gates

The Memory Hierarchy & Cache

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

416 Distributed Systems. Errors and Failures Feb 9, 2018

Matrix Multiplication

LECTURE 5: MEMORY HIERARCHY DESIGN

Transistors and Wires

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Fundamentals of Quantitative Design and Analysis

CS/EE 6810: Computer Architecture

Transcription:

CSCI 402: Computer Architectures Memory Hierarchy (5) Fengguang Song Department of Computer & Information Science IUPUI Performance of Multilevel Cache Main Memory CPU L1 cache L2 cache Given CPU base CPI = 1, clock rate = 4GHz Miss rate per instruction = 2% //an average Main memory access time = 100ns Q: What is the Effective CPI? = Base CPI + Miss Cycles Per Instruction = Base CPI + Miss rate per instruction x miss penalty For now, suppose only L-1 cache (i.e., no L2) Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 cycle + 2% 400 = 9 cycles 3 1

Multilevel Cache Example CPU 2% 0.5% L1 cache L2 cache Now, suppose we add L-2 cache L2 cache access time = 5ns L2 global miss rate = 0.5% L2 hit time, and how many cycles? 5ns --> 20 cycles //4GHz CPU L2 miss time, and how many cycles? 100 ns --> 400 cycles Effective CPI = 1 + 2% 20 + 0.5% 400 = 3.4cycles Performance ratio = 9 / 3.4 = 2.6x Main Memory Effective CPI = Base CPI + L1 Miss Rate x L2 Hit cycles + L2 global miss rate x L2 Miss cycles AMAT = L1 Hit Time + L1 Miss Rate x L2 Hit Time + L2 global miss rate x L2 Miss Time 4 How Do We Get the AMAT Formula? The original version: AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 And, Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 è AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 x Miss Penalty L2 ) Definitions: Local miss rate misses in the cache divided by the total number of accesses to the cache (i.e., Miss rate L2 ) Often referred to cache miss rate //ignoring local. Global miss rate misses in the cache divided by the total number of accesses generated by the CPU (i.e., Miss Rate L1 x Miss Rate L2 ) Note: Global Miss Rate is DIFFERENT from Local Miss Rate! 2

Multilevel Cache Design Considerations L-1 cache Focus on the minimal hit time (as fast as possible) L-2 cache Focus on low miss rate (to avoid memory access) Hit time has less overall impact in L2 cache. Result: L-1 cache usually smaller than L-2 cache 6 How Caches Affect Software Performance? Time complexity Cache Misses depend on your code s memory access patterns, which also depend on: your algorithm design compiler optimization for memory access Wallclock Time Example: Radix sort (see next slide) VS Quick sort 8 3

Two digits --> two rounds 9 Another Example: DGEMM Assuming Cache block size = 32B (i.e., big enough for 4 double s, 4 8 = 32) Suppose n is very large (matrix size is ' ') Approximate 1/n as 0.0 Cache not even big enough to hold 2 rows. Analysis Method: Look at memory access pattern of the inner loop k j j i k i A B (inner loop variable = k) C 10 4

Matrix multiplication (ijk) /* ijk */ for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; Inner loop: (*,j) (i,j) (i,*) A B C Approx. Cache Miss Rates a b c 0.25 1.0 0.0 1 miss every 1 miss every 4 th access access Row-wise Columnwise Fixed Note: Assuming n is very large. 11 Matrix multiplication (jik) /* jik */ for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; Inner loop: (*,j) (i,j) (i,*) A B C Approx. Miss Rates a b c 0.25 1.0 0.0 Row-wise Columnwise Fixed 12 5

Matrix multiplication (kij) /* kij */ r = a[i][k];/* keep in reg */ for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) A B C (i,*) Approx. Miss Rates a b c 0.0 0.25 0.25 Fixed Row-wise Row-wise Generating partial sums for C 13 Matrix multiplication (ikj) /* ikj */ r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Approx. Miss Rates a b c 0.0 0.25 0.25 Inner loop: (i,k) (k,*) (i,*) A B C Fixed Row-wise Row-wise 14 6

Matrix multiplication (jki) /* jki */ r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Approx. Miss Rates a b c 1.0 0.0 1.0 Column - wise Fixed Columnwise 15 Matrix multiplication (kji) /* kji */ r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Approx. Miss Rates a b c 1.0 0.0 1.0 Inner loop: (*,k) (k,j) (*,j) A B C Fixed Columnwise Columnwise 16 7

Summary of Matrix Multiplication ijk sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; jik sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum kij (best) r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; ikj (best) r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r*b[k][j]; jki r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; kji r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; 17 Software Optimization via Technique of Blocking n Goal: Maximize accesses to data before it is replaced in cache n Consider inner loops of DGEMM (the ijk version) for (int i = 0; i < n; ++i) for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; 18 8

Cache Blocked DGEMM 1 #define BLOCKSIZE 32 2 void do_block (int n, int si, int sj, int sk, double *A, double 3 *B, double *C) 4 { 5 for (int i = si; i < si+blocksize; ++i) 6 for (int j = sj; j < sj+blocksize; ++j) 7 { 8 double cij = C[i+j*n];/* cij = C[i][j] */ 9 for( int k = sk; k < sk+blocksize; k++ ) 10 cij += A[i+k*n] * B[k+j*n];/* cij+=a[i][k]*b[k][j] */ 11 C[i+j*n] = cij;/* C[i][j] = cij */ 12 13 14 void dgemm (int n, double* A, double* B, double* C) 15 { 16 for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 17 for ( int si = 0; si < n; si += BLOCKSIZE ) 18 for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 19 do_block(n, si, sj, sk, A, B, C); 20 19 Blocked DGEMM Access Pattern How it works: Why? 2x Faster! Unoptimized Blocked 20 9

Dependability and Memory A system alternates between two states: Accomplishment and Interruption: Service accomplishment Service delivered as specified Restoration Failure Service interruption Different from the specified service 22 Dependability Measures How to measure how dependable a system is? 2 related terms Reliability : Mean time to failure (MTTF) It is a measurement of service accomplishment Service interruption time: Mean time to repair (MTTR) It is a measurement of service interruption Mean time between failures (MTBF) MTBF = MTTR + MTTF Availability = MTTF / (MTTF + MTTR) To improve Availability Increase MTTF: fault avoidance, fault tolerance, fault forecasting Reduce MTTR: improved tools and processes for diagnosis and repair 23 10

Memory Error and Hamming SEC Code Soft Error (or transient error) //one ore more bits may flip Why? In a modern chip, devices are so small that cosmic rays or alpha particles can change the value of bits that are stored in registers/cache, or when they are simply moving across the chips. In low-voltage low-power CPU, even worse. Because of small voltage difference between 0 and 1. Hence, SEC (Single Error Correcting) and DED (Double Error Detecting) Hamming distance: Minimum number of bits that are different between two bit patterns If minimum distance = 3, we can provide 1-bit error correcting 24 What is the Memory Fault Rate Today? WHAT KIND OF SRAM FAULTS OCCUR IN PRACTICE? 1000 L2 Cache L2 Data Array Jaguar Cielo 1000 L3 Cache L3 Data Array Jaguar Cielo 735 Relative Monthly Fault Rate 100 10 1 1 0.39 172 77 Relative Monthly Fault Rate 100 10 1 1 2.5 219 0.1 0.1 Permanent Transient Permanent Transient Most SRAM faults are transient, especially in mature process technologies 5 MEMORY ERRORS IN MODERN SYSTEMS OCTOBER 2, 2014 PUBLIC Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013 11

Encoding for SEC To calculate Hamming code: Number the bits from 1 starting on the left i.e., 1, 2, 3, 16,..., 32 All positions that are a power 2 are parity bits E.g., use 12 bits to encode 8 data bits (4 parity bits) Each parity bit will check certain bits: 26 Parity bit 1 covers all bit positions which have the rightmost bit set: bit 1 (the parity bit itself), 3, 5, 7, 9, etc. Parity bit 2 covers all bit positions which have the 2nd to the right bit set: bit 2 (the parity bit itself), 3, 6, 7, 10, 11, etc. Parity bit 4 covers all bit positions which have the 3rd to the right bit set: bits 4 7, 12 15, 20 23, etc. Parity bit 8 covers all bit positions which have the 4th to the right bit set: bits 8 15, 24 31, 40 47, etc. 27 12

Note: Syndrome bits = 0000 2 indicates no error 28 DED Code DED: Double Error Detecting Coding: Add an additional parity bit for the whole word (i.e., p n ) So that we make Hamming distance = 4 Decoding: Let H be the original group of SEC parity bits H even, p n even, no error H odd, p n odd, correctable single bit error H even, p n odd, error in the p n bit H odd, p n even, double error occurred (but cannot correct) Assumption: 1 bit error is very important; 2 bits error is rare; 3 bits error is so rare (that we can ignore it). Note: Current ECC DRAM uses SEC/DEC Using 8 bits to protect each 64 bits (therefore, DIMMS are 72 bits wide). 30 13

We have finished the cache subject! Remark: Cache could be the most important topic in computer architectures. Next topic: Virtual Memory 31 A New Deeper Memory Hierarchy Registers Instructions and Operands Cache Cache blocks Memory Pages Disk Cache is the cache for main memory VM: main memory is the cache for disks 32 14