Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Similar documents
Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics

Computer Architecture ELEC3441

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

COMPUTER ORGANIZATION AND DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

Instruction and Data Streams

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Copyright 2012, Elsevier Inc. All rights reserved.

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Uniprocessors. HPC Prof. Robert van Engelen

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

Multiprocessors. HPC Prof. Robert van Engelen

The University of Adelaide, School of Computer Science 13 September 2018

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Appendix D. Controller Implementation

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

UNIVERSITY OF MORATUWA

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

CS61C : Machine Structures

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

Memory Hierarchy. Slides contents from:

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 1: Introduction and Fundamental Concepts 1

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

Caching Basics. Memory Hierarchies

Ones Assignment Method for Solving Traveling Salesman Problem

Page 1. Memory Hierarchies (Part 2)

Operating System Concepts. Operating System Concepts

Reducing SDRAM Energy Consumption in Embedded Systems Λ

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

1. SWITCHING FUNDAMENTALS

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Lecture 28: Data Link Layer

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Improving Cache Performance

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

K-NET bus. When several turrets are connected to the K-Bus, the structure of the system is as showns

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

Arquitectura de Computadores

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Elementary Educational Computer

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Improving Cache Performance

Lecture 1: Introduction and Strassen s Algorithm

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

State-space feedback 6 challenges of pole placement

CSE 2021: Computer Organization

1&1 Next Level Hosting

Transcription:

Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive per bit tha slower memory Solutio: orgaize memory system ito a hierarchy Etire addressable memory space available i largest, slowest memory Icremetally smaller ad faster memories, each cotaiig a subset of the memory below it, proceed i steps up toward the processor Temporal ad spatial locality isures that early all refereces ca be foud i smaller memories Gives the illusio of a large, fast memory beig preseted to the processor Itroductio Copyright 2012, Elsevier Ic. All rights reserved. 2

Memory Performace Gap Itroductio Copyright 2012, Elsevier Ic. All rights reserved. 3 Memory Hierarchy Desig Memory hierarchy desig becomes more crucial with recet multi-core processors: Aggregate peak badwidth grows with # cores: Itel Core i7 ca geerate two refereces per core per clock Four cores ad 3.2 GHz clock 25.6 billio* 64-bit data refereces/secod + 12.8 billio* 128-bit istructio refereces = 409.6 GB/s! DRAM badwidth is oly 6% of this (25 GB/s) Requires: Multi-port, pipelied caches Two levels of cache per core Shared third-level cache o chip Itroductio * US billio = 10 9 Copyright 2012, Elsevier Ic. All rights reserved. 4

The Memory Hierarchy The BIG Picture Commo priciples apply at all levels of the memory hierarchy Based o otios of cachig At each level i the hierarchy Block placemet Fidig a block Replacemet o a miss Write policy 5.5 A Commo Framework for Memory Hierarchies Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 5 Direct Mapped Cache Locatio determied by address Direct mapped: oly oe choice (Block address) modulo (#Blocks i cache) #Blocks is a power of 2 Use low-order address bits Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 6

Associative Caches Fully associative Allow a give block to go i ay cache etry Requires all etries to be searched at oce Comparator per etry (expesive) -way set associative Each set cotais etries Block umber determies which set (Block umber) modulo (#Sets i cache) Search all etries i a give set at oce comparators (less expesive) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 7 How Much Associativity Icreased associativity decreases miss rate But with dimiishig returs Simulatio of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 8

Block Placemet Determied by associativity Direct mapped (1-way associative) Oe choice for placemet -way set associative choices withi a set Fully associative Ay locatio Higher associativity reduces miss rate Icreases complexity, cost, ad access time Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 9 Replacemet Policy Direct mapped: o choice Set associative Prefer o-valid etry, if there is oe Otherwise, choose amog etries i the set Least-recetly used (LRU) Choose the oe uused for the logest time Simple for 2-way, maageable for 4-way, too hard beyod that Radom Gives approximately the same performace as LRU for high associativity Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 10

Write Policy Write-through Update both upper ad lower levels Simplifies replacemet, but may require write buffer Write-back Update upper level oly Update lower level whe block is replaced Need to keep more state Virtual memory Oly write-back is feasible, give disk write latecy Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 11 Memory Hierarchy Basics sets => -way set associative Direct-mapped cache => oe block per set Fully associative => oe set Itroductio Writig to cache: two strategies Write-through Immediately update lower levels of hierarchy Write-back Oly update lower levels of hierarchy whe a updated block is replaced Both strategies use write buffer to make writes asychroous Copyright 2012, Elsevier Ic. All rights reserved. 12

Memory Hierarchy Basics CPU exec-time = (CPU clock-cycles + Mem stall-cycles ) Clock cycle time Itroductio CPU exec-time = (IC CPI CPU + Mem stall-cycles ) Clock cycle time Mem stall-cycles = IC... Miss rate... Mem accesses... Miss pealty... Copyright 2012, Elsevier Ic. All rights reserved. 13 Memory Hierarchy Basics CPU exec-time = (CPU clock-cycles + Mem stall-cycles ) Clock cycle time Itroductio Mem stall-cycles = IC Misses Istructio Miss Pealty Note1: miss rate/pealty are ofte differet for reads ad writes Note2: speculative ad multithreaded processors may execute other istructios durig a miss Reduces performace impact of misses Copyright 2012, Elsevier Ic. All rights reserved. 14

Cache Performace Example Give I-cache miss rate = 2% D-cache miss rate = 4% Miss pealty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of istructios Miss cycles per istructio I-cache: D-cache: Actual CPI = 2 +?? +?? =?? Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 15 Cache Performace Example Give I-cache miss rate = 2% D-cache miss rate = 4% Miss pealty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of istructios Miss cycles per istructio I-cache: 0.02 100 = 2 D-cache: 0.36 0.04 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 16

Memory Hierarchy Basics Miss rate Fractio of cache access that result i a miss Itroductio Causes of misses (3C s +1) Compulsory First referece to a block Capacity Blocks discarded ad later retrieved Coflict Program makes repeated refereces to multiple addresses from differet blocks that map to the same locatio i the cache Coherecy Differet processors should see same value i same locatio Copyright 2012, Elsevier Ic. All rights reserved. 17 The 3C s i diff cache sizes Itroductio Coflict Copyright 2012, Elsevier Ic. All rights reserved. 18

The cache coherece pb Processors may see differet values through their caches: Cetralized Shared-Memory Architectures Copyright 2012, Elsevier Ic. All rights reserved. 19 Cache Coherece Coherece All reads by ay processor must retur the most recetly writte value Writes to the same locatio by ay two processors are see i the same order by all processors (Coherece defies the behaviour of reads & writes to the same memory locatio) Cosistecy Whe a writte value will be retured by a read If a processor writes locatio A followed by locatio B, ay processor that sees the ew value of B must also see the ew value of A (Cosistecy defies the behaviour of reads & writes with respect to accesses to other memory locatios) Cetralized Shared-Memory Architectures Copyright 2012, Elsevier Ic. All rights reserved. 20

Eforcig Coherece Coheret caches provide: Migratio: movemet of data Replicatio: multiple copies of data Cache coherece protocols Directory based Sharig status of each block kept i oe locatio Soopig Each core tracks sharig status of each block Cetralized Shared-Memory Architectures Copyright 2012, Elsevier Ic. All rights reserved. 21 Memory Hierarchy Basics Six basic cache optimizatios: Larger block size Reduces compulsory misses Icreases capacity ad coflict misses, icreases miss pealty Larger total cache capacity to reduce miss rate Icreases hit time, icreases power cosumptio Higher associativity Reduces coflict misses Icreases hit time, icreases power cosumptio Multilevel caches to reduce miss pealty Reduces overall memory access time Givig priority to read misses over writes Reduces miss pealty Avoidig address traslatio i cache idexig Reduces hit time Itroductio Copyright 2012, Elsevier Ic. All rights reserved. 22

Multilevel Caches Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster tha mai memory Mai memory services L-2 cache misses Some high-ed systems iclude L-3 cache Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 23 Multilevel Cache Example Give CPU base CPI = 1, clock rate = 4GHz Miss rate/istructio = 2% Mai memory access time = 100s With just primary cache Miss pealty =??? = 400 cycles Effective CPI = 1 +??? = 9 Now add L-2 cache Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 24

Multilevel Cache Example Give CPU base CPI = 1, clock rate = 4GHz Miss rate/istructio = 2% Mai memory access time = 100s With just primary cache Miss pealty = 100s/0.25s = 400 cycles Effective CPI = 1 + 0.02 400 = 9 Now add L-2 cache Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 25 Example (cot.) Now add L-2 cache Access time = 5s Global miss rate to mai memory = 0.5% Primary miss with L-2 hit Pealty = 5s/0.25s = 20 cycles Primary miss with L-2 miss Extra pealty = 400 cycles CPI = 1 + 0.02 20 + 0.005 400 = 3.4 Performace ratio = 9/3.4 = 2.6 Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 26

Multilevel O-Chip Caches Itel Nehalem 4-core processor Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache 3-Level Cache Orgaizatio L1 caches (per core) L2 uified cache (per core) L3 uified cache (shared) Itel Nehalem /a: data ot available L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacemet, hit time /a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacemet, write-back/ allocate, hit time /a 256KB, 64-byte blocks, 8-way, approx LRU replacemet, writeback/allocate, hit time /a 8MB, 64-byte blocks, 16-way, replacemet /a, write-back/ allocate, hit time /a AMD Optero X4 L1 I-cache: 32KB, 64-byte blocks, 2-way, approx LRU replacemet, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, approx LRU replacemet, write-back/ allocate, hit time 9 cycles 512KB, 64-byte blocks, 16-way, approx LRU replacemet, writeback/allocate, hit time /a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 28

29 Itel ew cache approach with Skylake AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 30 https://www.servethehome.com/itel-xeo-scalable-processor-family-microarchitecture-overview/ AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 https://www.servethehome.com/itel-xeo-scalable-processor-family-microarchitecture-overview/ Itel ew cache approach with Skylake

Te Advaced Optimizatios Reducig the hit time 1. Small & simple first-level caches 2. Way-predictio Icrease cache badwidth 3. Pipelied cache access 4. Noblockig caches 5. Multibaked caches Reducig the miss pealty 6. Critical word first 7. Mergig write buffers Reducig the miss rate 8. Compiler optimizatios Reducig the miss pealty or miss rate via parallelism 9. Hardware prefetchig of istructios ad data 10. Compiler-cotrolled prefetchig AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 31 1. Small ad simple 1 st level caches Small ad simple first level caches Critical timig path: addressig tag memory, the comparig tags, the selectig correct set Direct-mapped caches ca overlap tag compare ad trasmissio of data Lower associativity reduces power because fewer cache lies are accessed Advaced Optimizatios Copyright 2012, Elsevier Ic. All rights reserved. 32

L1 Size ad Associativity Advaced Optimizatios Access time vs. size ad associativity Copyright 2012, Elsevier Ic. All rights reserved. 33 L1 Size ad Associativity Advaced Optimizatios Eergy per read vs. size ad associativity Copyright 2012, Elsevier Ic. All rights reserved. 34

2. Way Predictio To improve hit time, predict the way to pre-set mux Mis-predictio gives loger hit time Predictio accuracy > 90% for two-way > 80% for four-way I-cache has better accuracy tha D-cache First used o MIPS R10000 i mid-90s Used o ARM Cortex-A8 Exted to predict block as well Way selectio Icreases mis-predictio pealty Advaced Optimizatios Copyright 2012, Elsevier Ic. All rights reserved. 35 3. Pipeliig Cache Pipelie cache access to improve badwidth Examples: Petium: 1 cycle Petium Pro Petium III: 2 cycles Petium 4 Core i7: 4 cycles Advaced Optimizatios Icreases brach mis-predictio pealty Makes it easier to icrease associativity Copyright 2012, Elsevier Ic. All rights reserved. 36

4. Noblockig Caches Allow hits before previous misses complete Hit uder miss Hit uder multiple miss L2 must support this I geeral, processors ca hide L1 miss pealty but ot L2 miss pealty Advaced Optimizatios Copyright 2012, Elsevier Ic. All rights reserved. 37 5. Multibaked Caches Orgaize cache as idepedet baks to support simultaeous access ARM Cortex-A8 supports 1-4 baks for L2 Itel i7 supports 4 baks for L1 ad 8 baks for L2 Advaced Optimizatios Iterleave baks accordig to block address Copyright 2012, Elsevier Ic. All rights reserved. 38

6. Critical Word First, Early Restart Critical word first Request missed word from memory first Sed it to the processor as soo as it arrives Early restart Request words i ormal order Sed missed work to the processor as soo as it arrives Advaced Optimizatios Effectiveess of these strategies depeds o block size ad likelihood of aother access to the portio of the block that has ot yet bee fetched Copyright 2012, Elsevier Ic. All rights reserved. 39 7. Mergig Write Buffer Whe storig to a block that is already pedig i the write buffer, update write buffer Reduces stalls due to full write buffer Do ot apply to I/O addresses Advaced Optimizatios No write bufferig Write bufferig Copyright 2012, Elsevier Ic. All rights reserved. 40

8. Compiler Optimizatios Loop Iterchage Swap ested loops to access memory i sequetial order Advaced Optimizatios Blockig Istead of accessig etire rows or colums, subdivide matrices ito blocks Requires more memory accesses but improves locality of accesses Copyright 2012, Elsevier Ic. All rights reserved. 41 9. Hardware Prefetchig Fetch two blocks o miss (iclude ext sequetial block) Advaced Optimizatios Petium 4 Pre-fetchig Copyright 2012, Elsevier Ic. All rights reserved. 42

10. Compiler Prefetchig Isert prefetch istructios before data is eeded No-faultig: prefetch does t cause exceptios Advaced Optimizatios Register prefetch Loads data ito register Cache prefetch Loads data ito cache Combie with loop urollig ad software pipeliig Copyright 2012, Elsevier Ic. All rights reserved. 43 Summary Advaced Optimizatios Copyright 2012, Elsevier Ic. All rights reserved. 44