COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Size: px

Start display at page:

Download "COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Large and Fast: Exploiting Memory Hierarchy"

Anastasia Phillips
6 years ago
Views:

1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy Priciple of Locality Programs access a small proportio of their address space at ay time Temporal locality Items accessed recetly are likely to be accessed agai soo e.g., istructios i a loop, iductio variables Spatial locality Items ear those accessed recetly are likely to be accessed soo E.g., sequetial istructio access, array data 5.1 Itroductio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 2

Takig Advatage of Locality Memory hierarchy Store everythig o disk Copy recetly accessed (ad earby) items from disk to smaller DRAM memory Mai memory Copy more recetly accessed (ad earby) items from

2 Takig Advatage of Locality Memory hierarchy Store everythig o disk Copy recetly accessed (ad earby) items from disk to smaller DRAM memory Mai memory Copy more recetly accessed (ad earby) items from DRAM to smaller SRAM memory Cache memory attached to CPU Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 3 Memory Hierarchy Levels Block (aka lie): uit of copyig May be multiple words If accessed data is preset i upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is abset Miss: block copied from lower level Time take: miss pealty Miss ratio: misses/accesses = 1 hit ratio The accessed data supplied from upper level Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 4

3 Memory Techology Static RAM (SRAM) 0.5s 2.5s, $2000 $5000 per GB Dyamic RAM (DRAM) 50s 70s, $20 $75 per GB Magetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Access time of SRAM Capacity ad cost/gb of disk 5.2 Memory Techologies Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 5 DRAM Techology Data stored as a charge i a capacitor Sigle trasistor used to access the charge Must periodically be refreshed Read cotets ad write back Performed o a DRAM row Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 6

4 Advaced DRAM Orgaizatio Bits i a DRAM are orgaized as a rectagular array DRAM accesses a etire row Burst mode: supply successive words from a row with reduced latecy Double data rate (DDR) DRAM Trasfer o risig ad fallig clock edges Quad data rate (QDR) DRAM Separate DDR iputs ad outputs Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 7 DRAM Geeratios Year Capacity $/GB Kbit $ Kbit $ Mbit $ Mbit $ Mbit $ Mbit $ Mbit $ Mbit $ Trac Tcac Mbit $ Gbit $50 0 '80 '83 '85 '89 '92 '96 '98 '00 '04 '07 Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 8

DRAM Performace Factors Row buffer Allows several words to be read ad refreshed i parallel Sychroous DRAM Allows for cosecutive accesses i bursts without

Memory Hierarchy 9 Icreasig Memory Badwidth 4-word wide memory Miss pealty = 1 + 15 + 1 = 17 bus cycles Badwidth = 16 bytes / 17 cycles = 0.

5 DRAM Performace Factors Row buffer Allows several words to be read ad refreshed i parallel Sychroous DRAM Allows for cosecutive accesses i bursts without eedig to sed each address Improves badwidth DRAM bakig Allows simultaeous access to multiple DRAMs Improves badwidth Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 9 Icreasig Memory Badwidth 4-word wide memory Miss pealty = = 17 bus cycles Badwidth = 16 bytes / 17 cycles = 0.94 B/cycle 4-bak iterleaved memory Miss pealty = = 20 bus cycles Badwidth = 16 bytes / 20 cycles = 0.8 B/cycle Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 10

Flash Storage Novolatile semicoductor storage 100 1000 faster tha disk Smaller, lower power, more robust But more $/GB (betwee disk ad DRAM) 6.

embedded systems NAND flash: bit cell like a NAND gate Deser (bits/area), but block-at-a-time access Cheaper per GB Used for USB keys, media storage, Flash

6 Flash Storage Novolatile semicoductor storage faster tha disk Smaller, lower power, more robust But more $/GB (betwee disk ad DRAM) 6.4 Flash Storage Chapter 6 Storage ad Other I/O Topics 11 Flash Types NOR flash: bit cell like a NOR gate Radom read/write access Used for istructio memory i embedded systems NAND flash: bit cell like a NAND gate Deser (bits/area), but block-at-a-time access Cheaper per GB Used for USB keys, media storage, Flash bits wears out after 1000 s of accesses Not suitable for direct RAM or disk replacemet Wear levelig: remap data to less used blocks Chapter 6 Storage ad Other I/O Topics 12

Disk Storage Novolatile, rotatig magetic storage 6.

Each sector records Sector ID Data (512 bytes, 4096 bytes proposed) Error

fields ad gaps Access to a sector ivolves Queuig delay if other accesses are

7 Disk Storage Novolatile, rotatig magetic storage 6.3 Disk Storage Chapter 6 Storage ad Other I/O Topics 13 Disk Sectors ad Access Each sector records Sector ID Data (512 bytes, 4096 bytes proposed) Error correctig code (ECC) Used to hide defects ad recordig errors Sychroizatio fields ad gaps Access to a sector ivolves Queuig delay if other accesses are pedig Seek: move the heads Rotatioal latecy Data trasfer Cotroller overhead Chapter 6 Storage ad Other I/O Topics 14

8 Disk Access Example Give 512B sector, 15,000rpm, 4ms average seek time, 100MB/s trasfer rate, 0.2ms cotroller overhead, idle disk Average read time 4ms seek time + ½ / (15,000/60) = 2ms rotatioal latecy / 100MB/s = 0.005ms trasfer time + 0.2ms cotroller delay = 6.2ms If actual average seek time is 1ms Average read time = 3.2ms Chapter 6 Storage ad Other I/O Topics 15 Disk Performace Issues Maufacturers quote average seek time Based o all possible seeks Locality ad OS schedulig lead to smaller actual average seek times Smart disk cotroller allocate physical sectors o disk Preset logical sector iterface to host SCSI, ATA, SATA Disk drives iclude caches Prefetch sectors i aticipatio of access Avoid seek ad rotatioal delay Chapter 6 Storage ad Other I/O Topics 16

Cache Memory Cache memory The level of the memory hierarchy closest to the

3 The Basics of Caches Give accesses X 1,, X 1, X How do we kow if the data

Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 17 Direct Mapped Cache

9 Cache Memory Cache memory The level of the memory hierarchy closest to the CPU 5.3 The Basics of Caches Give accesses X 1,, X 1, X How do we kow if the data is preset? Where do we look? Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 17 Direct Mapped Cache Locatio determied by address Direct mapped: oly oe choice (Block address) modulo (#Blocks i cache) #Blocks is a power of 2 Use low-order address bits Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 18

10 Tags ad Valid Bits How do we kow which particular block is stored i a cache locatio? Store block address as well as the data Actually, oly eed the high-order bits Called the tag What if there is o data i a locatio? Valid bit: 1 = preset, 0 = ot preset Iitially 0 Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 19 Cache Example 8-blocks, 1 word/block, direct mapped Iitial state Idex V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 20

11 Cache Example Word addr Biary addr Hit/miss Cache block Miss 110 Idex V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 21 Cache Example Word addr Biary addr Hit/miss Cache block Miss 010 Idex V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 22

12 Cache Example Word addr Biary addr Hit/miss Cache block Hit Hit 010 Idex V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 23 Cache Example Word addr Biary addr Hit/miss Cache block Miss Miss Hit 000 Idex V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 24

Cache Example Word addr Biary addr Hit/miss Cache block 18 10 010 Miss 010 Idex V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100

13 Cache Example Word addr Biary addr Hit/miss Cache block Miss 010 Idex V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 25 Address Subdivisio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 26

14 Example: Larger Block Size 64 blocks, 16 bytes/block To what block umber does address 1200 map? Block address = ë1200/16û = 75 Block umber = 75 modulo 64 = Tag Idex Offset 22 bits 6 bits 4 bits Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 27 Block Size Cosideratios Larger blocks should reduce miss rate Due to spatial locality But i a fixed-sized cache Larger blocks Þ fewer of them More competitio Þ icreased miss rate Larger blocks Þ pollutio Larger miss pealty Ca override beefit of reduced miss rate Early restart ad critical-word-first ca help Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 28

15 Cache Misses O cache hit, CPU proceeds ormally O cache miss Stall the CPU pipelie Fetch block from ext level of hierarchy Istructio cache miss Restart istructio fetch Data cache miss Complete data access Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 29 Write-Through O data-write hit, could just update the block i cache But the cache ad memory would be icosistet Write through: also update memory But makes writes take loger e.g., if base CPI = 1, 10% of istructios are stores, write to memory takes 100 cycles Effective CPI = = 11 Solutio: write buffer Holds data waitig to be writte to memory CPU cotiues immediately Oly stalls o write if write buffer is already full Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 30

16 Write-Back Alterative: O data-write hit, just update the block i cache Keep track of whether each block is dirty Whe a dirty block is replaced Write it back to memory Ca use a write buffer to allow replacig block to be read first Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 31 Write Allocatio What should happe o a write miss? Alteratives for write-through Allocate o miss: fetch the block Write aroud: do t fetch the block Sice programs ofte write a whole block before readig it (e.g., iitializatio) For write-back Usually fetch the block Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 32

Example: Itrisity FastMATH Embedded MIPS processor 12-stage pipelie Istructio ad data access o each cycle Split cache: separate I-cache ad D-cache Each 16KB: 256 blocks 16 words/block D-cache:

17 Example: Itrisity FastMATH Embedded MIPS processor 12-stage pipelie Istructio ad data access o each cycle Split cache: separate I-cache ad D-cache Each 16KB: 256 blocks 16 words/block D-cache: write-through or write-back SPEC2000 miss rates I-cache: 0.4% D-cache: 11.4% Weighted average: 3.2% Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 33 Example: Itrisity FastMATH Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 34

18 Mai Memory Supportig Caches Use DRAMs for mai memory Fixed width (e.g., 1 word) Coected by fixed-width clocked bus Bus clock is typically slower tha CPU clock Example cache block read 1 bus cycle for address trasfer 15 bus cycles per DRAM access 1 bus cycle per data trasfer For 4-word block, 1-word-wide DRAM Miss pealty = = 65 bus cycles Badwidth = 16 bytes / 65 cycles = 0.25 B/cycle Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 35 Measurig Cache Performace Compoets of CPU time Program executio cycles Icludes cache hit time Memory stall cycles Maily from cache misses With simplifyig assumptios: Memory stall cycles Memory accesses = Miss rate Miss pealty Program = Istructios Program Misses Istructio Miss pealty 5.4 Measurig ad Improvig Cache Performace Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 36

19 Cache Performace Example Give I-cache miss rate = 2% D-cache miss rate = 4% Miss pealty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of istructios Miss cycles per istructio I-cache: = 2 D-cache: = 1.44 Actual CPI = = 5.44 Ideal CPU is 5.44/2 =2.72 times faster Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 37 Average Access Time Hit time is also importat for performace Average memory access time (AMAT) AMAT = Hit time + Miss rate Miss pealty Example CPU with 1s clock, hit time = 1 cycle, miss pealty = 20 cycles, I-cache miss rate = 5% AMAT = = 2s 2 cycles per istructio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 38

20 Performace Summary Whe CPU performace icreased Miss pealty becomes more sigificat Decreasig base CPI Greater proportio of time spet o memory stalls Icreasig clock rate Memory stalls accout for more CPU cycles Ca t eglect cache behavior whe evaluatig system performace Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 39 Associative Caches Fully associative Allow a give block to go i ay cache etry Requires all etries to be searched at oce Comparator per etry (expesive) -way set associative Each set cotais etries Block umber determies which set (Block umber) modulo (#Sets i cache) Search all etries i a give set at oce comparators (less expesive) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 40

21 Associative Cache Example Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 41 Spectrum of Associativity For a cache with 8 etries Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 42

22 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative Block access sequece: 0, 8, 0, 6, 8 Direct mapped Block address Cache idex Hit/miss Cache cotet after access miss Mem[0] 8 0 miss Mem[8] 0 0 miss Mem[0] 6 2 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 43 Associativity Example 2-way set associative Block address Cache idex Hit/miss 0 0 miss Mem[0] 8 0 miss Mem[0] Mem[8] 0 0 hit Mem[0] Mem[8] 6 0 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] Fully associative Cache cotet after access Set 0 Set 1 Block address Hit/miss Cache cotet after access 0 miss Mem[0] 8 miss Mem[0] Mem[8] 0 hit Mem[0] Mem[8] 6 miss Mem[0] Mem[8] Mem[6] 8 hit Mem[0] Mem[8] Mem[6] Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 44

How Much Associativity Icreased associativity decreases miss rate But with dimiishig returs Simulatio of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.

23 How Much Associativity Icreased associativity decreases miss rate But with dimiishig returs Simulatio of a system with 64KB D-cache, 16-word blocks, SPEC way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 45 Set Associative Cache Orgaizatio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 46

24 Replacemet Policy Direct mapped: o choice Set associative Prefer o-valid etry, if there is oe Otherwise, choose amog etries i the set Least-recetly used (LRU) Choose the oe uused for the logest time Radom Simple for 2-way, maageable for 4-way, too hard beyod that Gives approximately the same performace as LRU for high associativity Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 47 Multilevel Caches Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster tha mai memory Mai memory services L-2 cache misses Some high-ed systems iclude L-3 cache Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 48

25 Multilevel Cache Example Give CPU base CPI = 1, clock rate = 4GHz Miss rate/istructio = 2% Mai memory access time = 100s With just primary cache Miss pealty = 100s/0.25s = 400 cycles Effective CPI = = 9 Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 49 Example (cot.) Now add L-2 cache Access time = 5s Global miss rate to mai memory = 0.5% Primary miss with L-2 hit Pealty = 5s/0.25s = 20 cycles Primary miss with L-2 miss Extra pealty = 500 cycles CPI = = 3.4 Performace ratio = 9/3.4 = 2.6 Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 50

26 Multilevel Cache Cosideratios Primary cache Focus o miimal hit time L-2 cache Focus o low miss rate to avoid mai memory access Hit time has less overall impact Results L-1 cache usually smaller tha a sigle cache L-1 block size smaller tha L-2 block size Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 51 Iteractios with Advaced CPUs Out-of-order CPUs ca execute istructios durig cache miss Pedig store stays i load/store uit Depedet istructios wait i reservatio statios Idepedet istructios cotiue Effect of miss depeds o program data flow Much harder to aalyse Use system simulatio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 52

Iteractios with Software Misses deped o memory access patters Algorithm behavior Compiler optimizatio for memory access Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 53 Software Optimizatio via

27 Iteractios with Software Misses deped o memory access patters Algorithm behavior Compiler optimizatio for memory access Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 53 Software Optimizatio via Blockig Goal: maximize accesses to data before it is replaced Cosider ier loops of DGEMM: for (it j = 0; j < ; ++j) { double cij = C[i+j*]; for( it k = 0; k < ; k++ ) cij += A[i+k*] * B[k+j*]; C[i+j*] = cij; } Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 54

DGEMM Access Patter C, A, ad B arrays older accesses ew accesses Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 55 Cache Blocked DGEMM 1 #defie BLOCKSIZE 32 2 void do_block (it, it si, it sj, it

28 DGEMM Access Patter C, A, ad B arrays older accesses ew accesses Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 55 Cache Blocked DGEMM 1 #defie BLOCKSIZE 32 2 void do_block (it, it si, it sj, it sk, double *A, double 3 *B, double *C) 4 { 5 for (it i = si; i < si+blocksize; ++i) 6 for (it j = sj; j < sj+blocksize; ++j) 7 { 8 double cij = C[i+j*];/* cij = C[i][j] */ 9 for( it k = sk; k < sk+blocksize; k++ ) 10 cij += A[i+k*] * B[k+j*];/* cij+=a[i][k]*b[k][j] */ 11 C[i+j*] = cij;/* C[i][j] = cij */ 12 } 13 } 14 void dgemm (it, double* A, double* B, double* C) 15 { 16 for ( it sj = 0; sj < ; sj += BLOCKSIZE ) 17 for ( it si = 0; si < ; si += BLOCKSIZE ) 18 for ( it sk = 0; sk < ; sk += BLOCKSIZE ) 19 do_block(, si, sj, sk, A, B, C); 20 } Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 56

Failure Fault: failure of a compoet May or may ot lead to system failure 5.

29 Blocked DGEMM Access Patter Uoptimized Blocked Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 57 Depedability Service accomplishmet Service delivered as specified Restoratio Failure Fault: failure of a compoet May or may ot lead to system failure 5.5 Depedable Memory Hierarchy Service iterruptio Deviatio from specified service Chapter 6 Storage ad Other I/O Topics 58

30 Depedability Measures Reliability: mea time to failure (MTTF) Service iterruptio: mea time to repair (MTTR) Mea time betwee failures MTBF = MTTF + MTTR Availability = MTTF / (MTTF + MTTR) Improvig Availability Icrease MTTF: fault avoidace, fault tolerace, fault forecastig Reduce MTTR: improved tools ad processes for diagosis ad repair Chapter 6 Storage ad Other I/O Topics 59 The Hammig SEC Code Hammig distace Number of bits that are differet betwee two bit patters Miimum distace = 2 provides sigle bit error detectio E.g. parity code Miimum distace = 3 provides sigle error correctio, 2 bit error detectio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 60

Ecodig SEC To calculate Hammig code: Number bits from 1 o the left All bit positios that are a power 2 are parity bits Each parity bit checks certai data bits: Chapter 5 Large ad Fast: Exploitig

31 Ecodig SEC To calculate Hammig code: Number bits from 1 o the left All bit positios that are a power 2 are parity bits Each parity bit checks certai data bits: Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 61 Decodig SEC Value of parity bits idicates which bits are i error Use umberig from ecodig procedure E.g. Parity bits = 0000 idicates o error Parity bits = 1010 idicates bit 10 was flipped Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 62

32 SEC/DEC Code Add a additioal parity bit for the whole word (p ) Make Hammig distace = 4 Decodig: Let H = SEC parity bits H eve, p eve, o error H odd, p odd, correctable sigle bit error H eve, p odd, error i p bit H odd, p eve, double error occurred Note: ECC DRAM uses SEC/DEC with 8 bits protectig each 64 bits Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 63 Virtual Machies Host computer emulates guest operatig system ad machie resources Improved isolatio of multiple guests Avoids security ad reliability problems Aids sharig of resources Virtualizatio has some performace impact Feasible with moder high-performace comptuers Examples IBM VM/370 (1970s techology!) VMWare Microsoft Virtual PC 5.6 Virtual Machies Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 64

33 Virtual Machie Moitor Maps virtual resources to physical resources Memory, I/O devices, CPUs Guest code rus o ative machie i user mode Traps to VMM o privileged istructios ad access to protected resources Guest OS may be differet from host OS VMM hadles real I/O devices Emulates geeric virtual I/O devices for guest Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 65 Example: Timer Virtualizatio I ative machie, o timer iterrupt OS suspeds curret process, hadles iterrupt, selects ad resumes ext process With Virtual Machie Moitor VMM suspeds curret VM, hadles iterrupt, selects ad resumes ext VM If a VM requires timer iterrupts VMM emulates a virtual timer Emulates iterrupt for VM whe physical timer iterrupt occurs Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 66

34 Istructio Set Support User ad System modes Privileged istructios oly available i system mode Trap to system if executed i user mode All physical resources oly accessible usig privileged istructios Icludig page tables, iterrupt cotrols, I/O registers Reaissace of virtualizatio support Curret ISAs (e.g., x86) adaptig Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 67 Virtual Memory Use mai memory as a cache for secodary (disk) storage Maaged joitly by CPU hardware ad the operatig system (OS) Programs share mai memory Each gets a private virtual address space holdig its frequetly used code ad data Protected from other programs CPU ad OS traslate virtual addresses to physical addresses VM block is called a page VM traslatio miss is called a page fault 5.7 Virtual Memory Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 68

from disk Takes millios of clock cycles Hadled by OS code

35 Address Traslatio Fixed-size pages (e.g., 4K) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 69 Page Fault Pealty O page fault, the page must be fetched from disk Takes millios of clock cycles Hadled by OS code Try to miimize page fault rate Fully associative placemet Smart replacemet algorithms Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 70

Page Tables Stores placemet iformatio Array of page table etries, idexed by virtual page umber Page table register i CPU poits to page table i physical memory If page is preset i memory PTE stores

36 Page Tables Stores placemet iformatio Array of page table etries, idexed by virtual page umber Page table register i CPU poits to page table i physical memory If page is preset i memory PTE stores the physical page umber Plus other status bits (refereced, dirty, ) If page is ot preset PTE ca refer to locatio i swap space o disk Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 71 Traslatio Usig a Page Table Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 72

Mappig Pages to Storage Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 73 Replacemet ad Writes To reduce page fault rate, prefer leastrecetly used (LRU) replacemet Referece bit (aka use bit) i

37 Mappig Pages to Storage Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 73 Replacemet ad Writes To reduce page fault rate, prefer leastrecetly used (LRU) replacemet Referece bit (aka use bit) i PTE set to 1 o access to page Periodically cleared to 0 by OS A page with referece bit = 0 has ot bee used recetly Disk writes take millios of cycles Block at oce, ot idividual locatios Write through is impractical Use write-back Dirty bit i PTE set whe page is writte Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 74

(TLB) Typical: 16 512 PTEs, 0.5 1 cycle for hit, 10 100 cycles for miss, 0.

38 Fast Traslatio Usig a TLB Address traslatio would appear to require extra memory refereces Oe to access the PTE The the actual memory access But access to page tables has good locality So use a fast cache of PTEs withi the CPU Called a Traslatio Look-aside Buffer (TLB) Typical: PTEs, cycle for hit, cycles for miss, 0.01% 1% miss rate Misses could be hadled by hardware or software Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 75 Fast Traslatio Usig a TLB Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 76

39 TLB Misses If page is i memory Load the PTE from memory ad retry Could be hadled i hardware Ca get complex for more complicated page table structures Or i software Raise a special exceptio, with optimized hadler If page is ot i memory (page fault) OS hadles fetchig the page ad updatig the page table The restart the faultig istructio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 77 TLB Miss Hadler TLB miss idicates Page preset, but PTE ot i TLB Page ot preset Must recogize TLB miss before destiatio register overwritte Raise exceptio Hadler copies PTE from memory to TLB The restarts istructio If page ot preset, page fault will occur Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 78

Page Fault Hadler Use faultig virtual address to fid PTE Locate page o disk Choose page to replace If dirty, write to disk first Read page ito memory ad update page table Make process ruable agai

40 Page Fault Hadler Use faultig virtual address to fid PTE Locate page o disk Choose page to replace If dirty, write to disk first Read page ito memory ad update page table Make process ruable agai Restart from faultig istructio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 79 TLB ad Cache Iteractio If cache tag uses physical address Need to traslate before cache lookup Alterative: use virtual address tag Complicatios due to aliasig Differet virtual addresses for shared physical address Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 80

41 Memory Protectio Differet tasks ca share parts of their virtual address spaces But eed to protect agaist errat access Requires OS assistace Hardware support for OS protectio Privileged supervisor mode (aka kerel mode) Privileged istructios Page tables ad other state iformatio oly accessible i supervisor mode System call exceptio (e.g., syscall i MIPS) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 81 The Memory Hierarchy The BIG Picture Commo priciples apply at all levels of the memory hierarchy Based o otios of cachig At each level i the hierarchy Block placemet Fidig a block Replacemet o a miss Write policy 5.8 A Commo Framework for Memory Hierarchies Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 82

42 Block Placemet Determied by associativity Direct mapped (1-way associative) Oe choice for placemet -way set associative choices withi a set Fully associative Ay locatio Higher associativity reduces miss rate Icreases complexity, cost, ad access time Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 83 Fidig a Block Associativity Locatio method Tag comparisos Direct mapped Idex 1 -way set associative Set idex, the search etries withi the set Fully associative Search all etries #etries Full lookup table 0 Hardware caches Reduce comparisos to reduce cost Virtual memory Full table lookup makes full associativity feasible Beefit i reduced miss rate Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 84

43 Replacemet Choice of etry to replace o a miss Least recetly used (LRU) Complex ad costly hardware for high associativity Radom Close to LRU, easier to implemet Virtual memory LRU approximatio with hardware support Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 85 Write Policy Write-through Update both upper ad lower levels Simplifies replacemet, but may require write buffer Write-back Update upper level oly Update lower level whe block is replaced Need to keep more state Virtual memory Oly write-back is feasible, give disk write latecy Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 86

44 Sources of Misses Compulsory misses (aka cold start misses) First access to a block Capacity misses Due to fiite cache size A replaced block is later accessed agai Coflict misses (aka collisio misses) I a o-fully associative cache Due to competitio for etries i a set Would ot occur i a fully associative cache of the same total size Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 87 Cache Desig Trade-offs Desig chage Effect o miss rate Negative performace effect Icrease cache size Icrease associativity Icrease block size Decrease capacity misses Decrease coflict misses Decrease compulsory misses May icrease access time May icrease access time Icreases miss pealty. For very large block size, may icrease miss rate due to pollutio. Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 88

45 Cache Cotrol Example cache characteristics Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes) Cache size: 16 KB (1024 blocks) 32-bit byte addresses Valid bit ad dirty bit per block Blockig cache CPU waits util access is complete Tag Idex Offset 18 bits 10 bits 4 bits 5.9 Usig a Fiite State Machie to Cotrol A Simple Cache Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 89 Iterface Sigals Read/Write Read/Write Valid Valid Address 32 Address 32 CPU Write Data 32 Cache Write Data 128 Memory Read Data 32 Read Data 128 Ready Ready Multiple cycles per access Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 90

Fiite State Machies Use a FSM to sequece cotrol steps

are biary ecoded Curret state stored i a register Next

sigals = f o (curret state) Chapter 5 Large ad Fast:

Could partitio ito separate states to reduce clock

46 Fiite State Machies Use a FSM to sequece cotrol steps Set of states, trasitio o each clock edge State values are biary ecoded Curret state stored i a register Next state = f (curret state, curret iputs) Cotrol output sigals = f o (curret state) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 91 Cache Cotroller FSM Could partitio ito separate states to reduce clock cycle time Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 92

47 Cache Coherece Problem Suppose two CPU cores share a physical address space Time step Write-through caches Evet CPU A s cache CPU B s cache Memory CPU A reads X CPU B reads X CPU A writes 1 to X Parallelism ad Memory Hierarchies: Cache Coherece Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 93 Coherece Defied Iformally: Reads retur most recetly writte value Formally: P writes X; P reads X (o iterveig writes) Þ read returs writte value P 1 writes X; P 2 reads X (sufficietly later) Þ read returs writte value c.f. CPU B readig X after step 3 i example P 1 writes X, P 2 writes X Þ all processors see writes i the same order Ed up with the same fial value for X Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 94

48 Cache Coherece Protocols Operatios performed by caches i multiprocessors to esure coherece Migratio of data to local caches Reduces badwidth for shared memory Replicatio of read-shared data Reduces cotetio for access Soopig protocols Each cache moitors bus reads/writes Directory-based protocols Caches ad memory record sharig status of blocks i a directory Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 95 Ivalidatig Soopig Protocols Cache gets exclusive access to a block whe it is to be writte Broadcasts a ivalidate message o the bus Subsequet read i aother cache misses Owig cache supplies updated value CPU activity Bus activity CPU A s cache CPU B s cache Memory CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X CPU A writes 1 to X Ivalidate for X 1 0 CPU B read X Cache miss for X Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 96

Memory Cosistecy Whe are writes see by other processors See meas a read returs the writte value Ca t be istataeously Assumptios A write completes oly whe all processors have see it A processor does

49 Memory Cosistecy Whe are writes see by other processors See meas a read returs the writte value Ca t be istataeously Assumptios A write completes oly whe all processors have see it A processor does ot reorder writes with other accesses Cosequece P writes X the writes Y Þ all processors that see ew Y also see ew X Processors ca reorder reads, but ot writes Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 97 Multilevel O-Chip Caches Chapter 5 Large ad Fast: Exploitig Memory Hierarchy The ARM Cortex-A8 ad Itel Core i7 Memory Hierarchies

50 2-Level TLB Orgaizatio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 99 Supportig Multiple Issue Both have multi-baked caches that allow multiple accesses per cycle assumig o bak coflicts Core i7 cache optimizatios Retur requested word first No-blockig cache Hit uder miss Miss uder miss Data prefetchig Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 100

DGEMM Combie cache blockig ad subword parallelism 5.14 Goig Faster: Cache Blockig ad Matrix Multiply Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 101 Pitfalls Byte vs.

51 DGEMM Combie cache blockig ad subword parallelism 5.14 Goig Faster: Cache Blockig ad Matrix Multiply Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 101 Pitfalls Byte vs. word addressig Example: 32-byte direct-mapped cache, 4-byte blocks Byte 36 maps to block 1 Word 36 maps to block 4 Igorig memory system effects whe writig or geeratig code Example: iteratig over rows vs. colums of arrays Large strides result i poor locality 5.15 Fallacies ad Pitfalls Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 102

52 Pitfalls I multiprocessor with shared L2 or L3 cache Less associativity tha cores results i coflict misses More cores Þ eed to icrease associativity Usig AMAT to evaluate performace of out-of-order processors Igores effect of o-blocked accesses Istead, evaluate performace by simulatio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 103 Pitfalls Extedig address rage usig segmets E.g., Itel But a segmet is ot always big eough Makes address arithmetic complicated Implemetig a VMM o a ISA ot desiged for virtualizatio E.g., o-privileged istructios accessig hardware resources Either exted ISA, or require guest OS ot to use problematic istructios Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 104

53 Cocludig Remarks Fast memories are small, large memories are slow We really wat fast, large memories L Cachig gives this illusio J Priciple of locality Programs use a small part of their memory space frequetly Memory hierarchy L1 cache «L2 cache ««DRAM memory «disk Memory system desig is critical for multiprocessors 5.16 Cocludig Remarks Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 105

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Large ad Fast: Exploitig Memory Hierarchy Priciple of Locality Programs access a small proportio of their address space