Memory Technology. Erik Hagersten Uppsala University, Sweden

Size: px
Start display at page:

Download "Memory Technology. Erik Hagersten Uppsala University, Sweden"

Transcription

1 Memory Technology Erik Hagersten Uppsala University, Sweden

2 Main memory characteristics DRAM: Main memory is built from DRAM: Dynamic RAM 1 transistor/bit ==> more error prone and slow Refresh and precharge crazy stuff SRAM Cache memory is built from SRAM: Static RAM about 4-6 transistors/bit fast but less capacity Mem, VM and SW optimizations2

3 DRAM organization 4Mbit memory array RAS One bit memory cell Address 11 Row decoder cell matrix Column latch Column decoder CAS Word line Capacitance Bit line (4) Data out The address is multiplexed Row/Address Strobe (RAS/CAS) Thin organizations (between x16 and x1) to decrease pin load Refresh of memory cells decreases bandwidth Bit-error rate creates a need for error-correction (ECC) Mem, VM and SW optimizations3

4 SRAM organization In buffer A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8 A 9 Row decoder cell matrix Diff.-amplifyer I/O 3 I/O 2 I/O 1 I/O 0 Column decoder CE A 0 A10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 WE OE Address is typically not multiplexed Each cell consists of about 4-6 transistors Wider organization (x18 or x36), typically few chips Often parity protected (ECC becoming more common) Mem, VM and SW optimizations4

5 Error Detection and Correction Error-correction and detection E.g., 64 bit data protected by 8 bits of ECC Protects DRAM and high-availability SRAM applications Double bit error detection ( crash and burn ) Chip kill detection (all bits of one chip stuck at all-1 or all-0) Single bit correction Need memory scrubbing in order to get good coverage Parity E.g., 8 bit data protected by 1 bit parity Protects SRAM and data paths Single-bit crash and burn detection Not sufficient for large SRAMs today!! Mem, VM and SW optimizations5

6 Correcting the Error Correction on the fly by hardware no performance-glitch great for cycle-level redundancy fixes the problem for now Trap to software correct the data value and write back to memory Memory scrubber kernel process that periodically touches all of memory Mem, VM and SW optimizations6

7 Improving main memory performance Page-mode => faster access within a small distance Improves bandwidth per pin -- not time to critical word Single wide bank improves access time to the complete CL Multiple banks improves bandwidth Mem, VM and SW optimizations7

8 Newer kind of DRAM... SDRAM MHz) Mem controller provides strobe for next seq. access DDRx-DRAM (e.g., 5-½-½-½) Transfer data on both edges of the clock Research: CPU and DRAM on the same chip?? (IMEM)... Mem, VM and SW optimizations8

9 Newer DRAMs (Several DRAM arrays on a die) Name Clock rate (MHz) BW (GB/s per DIMM) DDR ,1 DDR ,4 DDR ,3 DDR ,4 DDR ,5 DDR ,8 Mem, VM and SW optimizations9

10 Modern DRAM (1) From AnandTech: Everything You Always Wanted to Know About SDRAM: But Were Afraid to Ask Mem, VM and SW optimizations10

11 Timing page hit Figure 6. Page-hit timing (with precharge and subsequent bank access) From AnandTech: Everything You Always Wanted to Know About SDRAM: But Were Afraid to Ask Mem, VM and SW optimizations11

12 Timing page miss Figure 8. Page-miss timing From AnandTech: Everything You Always Wanted to Know About SDRAM: But Were Afraid to Ask Mem, VM and SW optimizations12

13 The Endian Mess Numbering the bytes Store the value 0x5F f Store the string Hello 0 H e l l o Big Endian Word 64MB msb lsb 64MB msb lsb 64MB msb lsb f 0 l l e H o Little Endian 64MB msb lsb 64MB msb lsb 64MB msb lsb Mem, VM and SW optimizations13

14 Virtual Memory System Erik Hagersten Uppsala University, Sweden

15 Physical Memory Disk Physical Memory 0 PROGRAM 64MB Mem, VM and SW optimizations15

16 Virtual and Physical Memory 0 stack 0 stack Disk (Caches) Physical Memory 0 PROGRAM $1 $2 heap data heap data 64MB 4GB text 4GB text Segments Context A Context B Mem, VM and SW optimizations16

17 Translation & Protection 0 Virtual Memory 0 stack stack Disk Physical Memory 0 R RW R heap heap RW 64MB data data 4GB text Context A 4GB text Context B Mem, VM and SW optimizations17

18 Virtual memory parameters Compared to first-level cache parameters Replacement in cache handled by HW. Replacement in VM handled by SW VM hit latency very low (often zero cycles) VM miss latency huge (several kinds of misses) Allocation size is one page 4kB and up) Parameter First-level cache Virtual memory Block (page) size bytes 4K-64K bytes Hit time 1-2 clock cycles clock cycles Miss penalty (Access time) (Transfer time) clock cycles (6-60 clock cycles) (2-40 clock cycles) 700K-6000K clock cycles (500K-4000K clock cycles) (200K-2000K clock cycles) Miss rate 0.5%-10% %-0.001% Data memory size 16 Kbyte - 1 Mbyte 16 Mbyte - 8 Gbyte Mem, VM and SW optimizations18

19 VM: Block placement Where can a block (page) be placed in main memory? What is the organization of the VM? The high miss penalty makes SW solutions to implement a fully associative address mapping feasible at page faults A page from disk may occupy any pageframe in PA Some restriction can be helpful (page coloring) Mem, VM and SW optimizations19

20 VM: Block identification Use a page table stored in main memory: Suppose 8 Kbyte pages, 48 bit virtual address Page table occupies 2 48 /2 13 * 4B = 2 37 = 128GB!!! Solutions: Only one entry per physical page is needed Multi-level page table (dynamic) Inverted page table (~hashing) Mem, VM and SW optimizations20

21 Address translation Multi-level table: The Alpha ( Segment is selected by bit 62 & 63 in addr. kseg seg1 seg1 seg1 Kernel segment Used by OS. Does not use virtual memory. User segment 1 Used for stack. PTE seg0 seg0 seg0 User segment 0 Used for instr. & static data & heap Page Table Entry: (translation & protection) Mem, VM and SW optimizations21

22 Protection mechanisms The address translation mechanism can be used to provide memory protection: Use protection attribute bits for each page Stored in the page table entry (PTE) (and TLB ) Each physical page gets its own per process protection Violations detected during the address translation cause exceptions (i.e., SW trap) Supervisor/user modes necessary to prevent user processes from changing e.g. PTEs Mem, VM and SW optimizations22

23 Fast address translation How can we avoid three extra memory references for each original memory reference? Store the most commonly used address translations in a cache Translation Look-aside Buffer (TLB) ==> The caches rears their ugly faces again! P VA TLB lookup PA Addr Cache Main memory Transl. in mem Data Mem, VM and SW optimizations23

24 Do we need a fast TLB? Why do a TLB lookup for every L1 access? Why not cache virtual addresses instead? Move the TLB on the other side of the cache It is only needed for finding stuff in Memory anyhow The TLB can be made larger and slower or can it? P VA Cache TLB lookup PA Main memory Transl. in mem Data Mem, VM and SW optimizations24

25 Aliasing Problem The same physical page may be accessed using different virtual addresses A virtual cache will cause confusion -- a write by one process may not be observed Flushing the cache on each process switch is slow (and may only help partly) =>VIPT (VirtuallyIndexedPhysicallyTagged) is the answer Direct-mapped cache no larger than a page No more sets than there are cache lines on a page + logic Page coloring can be used to guarantee correspondence between more PA and VA bits (e.g., Sun Microsystems) Mem, VM and SW optimizations25

26 Virtually Indexed Physically Tagged =VIPT Index VA Cache PA Addr tag = Hit P TLB lookup PA Main memory Data Transl. in mem Have to guarantee that all aliases have the same index L1_cache_size < (page-size * associativity) Page coloring can help further Mem, VM and SW optimizations26

27 Putting it all together: VIPT Cache: 8kB, 2-way, CL=32B, word=4b, page =4kB TLB: 32 entries, 2-way msb VA-TAG (16) VA-address: (16) (4) PA-Page frame (20) TLB PTE PTE same for PA & VA Identifies the word within a cache line Identifies a byte within a word 16 lsb (7) 1 PA-TAG (20) Cache = = & & logic 2:1 mux (20) (32B) (32B) TLB hit (20) PA addr bit [31-12] = & Cache hit = & logic (3) (1) Mem, VM and SW optimizations27 Multiplexer (16:1 mux) (4B) Data

28 What is the capacity of the TLB Typical TLB size = 0.5-2kB Each translation entry 4-8B ==> entries Typical page size = 4kB - 16kB TLB-reach = 0.1MB - 8MB FIX: Multiple page sizes, e.g., 8kB and 8 MB TSB -- A direct-mapped translation in memory as a second-level TLB Mem, VM and SW optimizations28

29 VM: Page replacement Most important: minimize number of page faults Page replacement strategies: FIFO First-In-First-Out LRU Least Recently Used Approximation to LRU Each page has a reference bit that is set on a reference The OS periodically resets the reference bits When a page is replaced, a page with a reference bit that is not set is chosen Mem, VM and SW optimizations29

30 So far CPU TLB TLB miss fill TLB Data Page (transl$) L1$ fault PF handler TLB fill Memory Unified L2$ PT PT PTPTPT D DDD I I I D I I I PT D DDD D DD D Disk Mem, VM and SW optimizations30

31 Adding TSB (software TLB cache) CPU TLB TLB miss fill TLBD Page Atrans$ fault PF handler TLB fill Data L1$ Unified L2$ PT PT PTPTPT D DDD I I I D D DD I I Disk I Memory PT TSB D DDD D Mem, VM and SW optimizations31

32 VM: Write strategy Write back or Write through? Write back! Write through is impossible to use: Too long access time to disk The write buffer would need to be prohibitively large The I/O system would need an extremely high bandwidth Mem, VM and SW optimizations32

33 VM dictionary Virtual Memory System Virtual address Physical address Page Page fault The cache languge ~Cache address ~Cache location ~Huge cache block ~Extremely painfull $miss Page-fault handler ~The software filling the $ Page-out Write-back if dirty Mem, VM and SW optimizations33

34 Caches Everywhere D cache I cache L2 cache L3 cache ITLB DTLB TSB Virtual memory system Branch predictors Directory cache Mem, VM and SW optimizations34

35 Exploring the Memory of a Computer System Erik Hagersten Uppsala University, Sweden eh@it.uu.se

36 Micro Benchmark Signature for (times = 0; times < Max; times++) /* many times*/ for (i=0; i < ArraySize; i = i + Stride) dummy = A[i]; /* touch an item in the array */ Measuring the averge access time to memory, while varying ArraySize and Stride, will allow us to reverse-engineer the memory system. (need to turn off HW prefetching...) Mem, VM and SW optimizations36

37 Micro Benchmark Signature for (times = 0; times < Max; times++) /* many times*/ for (i=0; i < ArraySize; i = i + Stride) dummy = A[i]; /* touch an item in the array */ 700 Avg time (ns) Time (ns) M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K K 4 K 16 K 64 K 256 K 1 M 4 M Stride (bytes) Stride(bytes) Mem, VM and SW optimizations37

38 Stepping through the array for (times = 0; times < Max; times++) /* many times*/ for (i=0; i < ArraySize; i = i + Stride) dummy = A[i]; /* touch an item in the array */ 0 Array Size = 16, Stride=4 0 Array Size = 16, Stride=8 0 Array Size = 32, Stride=4 0 Array Size = 32, Stride=8 Mem, VM and SW optimizations38

39 Micro Benchmark Signature for (times = 0; times < Max; time++) /* many times*/ for (i=0; i < ArraySize; i = i + Stride) dummy = A[i]; /* touch an item in the array */ 700 Avg time (ns) Time (ns) M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K ArraySize=8MB ArraySize=512kB K 4 K 16 K 64 K 256 K 1 M 4 M Stride(bytes) ArraySize=32-256kB ArraySize=16kB Mem, VM and SW optimizations39

40 Micro Benchmark Signature for (times = 0; times < Max; time++) /* many times*/ for (i=0; i < ArraySize; i = i + Stride) dummy = A[i]; /* touch an item in the array */ Mem+TLBmiss L2$+TLBmiss Mem=300ns Time (ns) M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K ArraySize=8MB ArraySize=512kB ArraySize=32kB-256kB L2$hit=40ns L1$ hit L1$ block size=16b K 4 K 16 K 64 K 256 K 1 M 4 M L2$ block size=64b Mem, VM and SW optimizations40 ArraySize=16kB Stride (bytes) Page size=8k ==> #TLB entries = (56 normal+8 large)

41 Twice as large L2 cache??? for (times = 0; times < Max; time++) /* many times*/ for (i=0; i < ArraySize; i = i + Stride) dummy = A[i]; /* touch an item in the array */ ArraySize=1M 700 Avg time (ns) Time (ns) M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K ArraySize=8MB ArraySize=512kB K 4 K 16 K 64 K 256 K 1 M 4 M Stride(bytes) ArraySize=32-256kB ArraySize=16kB Mem, VM and SW optimizations41

42 Twice as large TLB for (times = 0; times < Max; time++) /* many times*/ for (i=0; i < ArraySize; i = i + Stride) dummy = A[i]; /* touch an item in the array */ Avg time (ns) Time (ns) ArraySize=1MB 8 M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K ArraySize=8MB ArraySize=512kB K 4 K 16 K 64 K 256 K 1 M 4 M Stride(bytes) ArraySize=32-256kB ArraySize=16kB Mem, VM and SW optimizations42

43 Optimizing for Cache/Memory Erik Hagersten Uppsala University, Sweden

44 Optimizing for the memory system: What is the potential gain? Latency difference L1$ and mem: ~50x Bandwidth difference L1$ and mem: ~20x Execute from L1$ instead from mem ==> x improvement At least a factor 2-4x is within reach Mem, VM and SW optimizations44

45 Optimizing for cache performance Keep the active footprint small Use the entire cache line once it has been brought into the cache Fetch a cache line prior to its usage Let the CPU that already has the data in its cache do the job... Mem, VM and SW optimizations45

46 Final cache lingo slide Miss ratio: What is the likelihood that a memory access will miss in a cache? Miss rate: D:o per time unit, e.g. per-second, per instructions Fetch ratio/rate*): What is the likelihood that a memory access will cause a fetch to the cache [including HW prefetching] Fetch utilization*): What fraction of a cacheline was used before it got evicted Writeback utilization*): What fraction of a cacheline written back to memory contains dirty data Communication utilization*): What fraction of a communicated cacheline is ever used? *) This is Acumem-ish language Mem, VM and SW optimizations46

47 What can go Wrong? A Simple Example Perform a diagonal copy 10 times N N Mem, VM and SW optimizations47

48 Example: Loop order //Optimized Example A for (i=1; i<n; i++) { for (j=1; j<n; j++) { A[i][j]= A[i-1][j-1]; } } //Unoptimized Example A for (j=1; j<n; j++) { for (i=1; i<n; i++) { A[i][j] = A[i-1][j-1]; } }? Mem, VM and SW optimizations48

49 Performance Difference: Loop order 20 Speedup vs UnOpt Athlon64 x2 Pentium D Core 2 Duo Demo Time! Array side ThreadSpotter Mem, VM and SW optimizations49

50 Example 1: The Same Application Optimized Performance 4 2.7x 3 2 Demo Time! #cores LBM: Original code App: LBM Optimization can be rewarding, but costly Require expert knowledge about MC and architecture Weeks of wading through performance data This fix required one line of code to change Mem, VM and SW optimizations50

51 Example: Sparse data usage //Optimized Example A for (i=1; i<n; i++) { for (j=1; j<n; j++) { A_d[i][j]= A_d[i-1][j-1]; } } //Unoptimized Example A for (i=1; i<n; i++) { for (j=1; j<n; j++) { A[i][j].d = A[i-1][j-1].d; } } struct vec_type { char a; char b; char c; char d; }; dddd d d d d Mem, VM and SW optimizations51

52 Performance Difference: Sparse Data Speedup vs UnOPT Athlon64 x2 Pentium D Core 2 Duo Array side Mem, VM and SW optimizations52

53 Example 2: The Same Application Optimized Performance Original Optimized 7.3x App: Cigar #Cores Looks like a perfect scalable application! Are we done? Duplicate one data structure Demo Time! Cigar Original code Mem, VM and SW optimizations53

54 Example: Sparse data allocation sparse_rec sparse [HUGE]; for (int j = 0; j < HUGE; j++) { sparse[j].a = 'a'; sparse[j].b = 'b'; sparse[j].c = 'c'; sparse[j].d = 'd'; sparse[j].e = 'e'; sparse[j].f1 = 1.0; sparse[j].f2 = 1.0; sparse[j].f3 = 1.0; sparse[j].f4 = 1.0; sparse[j].f5 = 1.0; } d d d d struct sparse_rec { // size 80B char a; double f1; char b; double f2; char c; double f3; char d; double f4; char e; double f5; }; struct dense_rec { //size 48B double f1; double f2; double f3; double f4; double f5; char a; char b; char c; char d; char e; }; Mem, VM and SW optimizations54

55 Loop Merging /* Unoptimized */ for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) a[i][j] = 2 * b[i][j]; for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) c[i][j] = K * b[i][j] + d[i][j]/2 /* Optimized */ for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) a[i][j] = 2 * b[i][j]; c[i][j] = K * b[i][j] + d[i][j]/2; Mem, VM and SW optimizations55

56 Padding of data structures Generic Cache: Cacheline:? j A A+256*8 A+256*2*8 MSB LSB Addr [63..0] SRAM: i 256 index = 256 = = = = = = =... Hit Sel way 6 mux Data = 64B Mem, VM and SW optimizations56

57 Padding of data structures Generic Cache: Cacheline:? j A A+256*8+padding A+256*2*8+2*padding MSB LSB Addr [63..0] SRAM: i 256 index = 256+padding = = = = = = =... allocate more memory than needed Hit Sel way 6 Mem, VM and SW optimizations57

58 Blocking /* Unoptimized ARRAY: x = y * z */ for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) {r = 0; for (k = 0; k < N; k = k + 1) r = r + y[i][k] * z[k][j]; x[i][j] = r; }; X: j Y: k Z: j i i k Mem, VM and SW optimizations58

59 Blocking /* Optimized ARRAY: X = Y * Z */ for (jj = 0; jj < N; jj = jj + B) for (kk = 0; kk < N; kk = kk + B) for (i = 0; i < N; i = i + 1) for (j = jj; j < min(jj+b,n); j = j + 1) {r = 0; for (k = kk; k < min(kk+b,n); k = k + 1) r = r + y[i][k] * z[k][j]; x[i][j] += r; }; X: Partial solution j Y: k Z: First block j Second block i i k Mem, VM and SW optimizations59

60 Blocking: the Movie! Partial solution i /* Optimized ARRAY: X = Y * Z */ for (jj = 0; jj < N; jj = jj + B) /* Loop 5 */ for (kk = 0; kk < N; kk = kk + B) /* Loop 4 */ for (i = 0; i < N; i = i + 1) /* Loop 3 */ for (j = jj; j < min(jj+b,n); j = j + 1) /* Loop 2 */ {r = 0; for (k = kk; k < min(kk+b,n); k = k + 1) /* Loop 1 */ r = r + y[i][k] * z[k][j]; x[i][j] += r; }; X: jj jj+b j i Y: kk kk+b k Z: jj kk kk+b k Second block First block jj+b 5 4 j Mem, VM and SW optimizations60

61 SW Prefetching /* Unoptimized */ for (j = 0; j < N; j++) for (i = 0; i < N; i++) x[j][i] = 2 * x[j][i]; /* Optimized */ for (j = 0; j < N; j++) for (i = 0; i < N; i++) PREFETCH x[j+1][i] x[j][i] = 2 * x[j][i]; (Typically, the HW prefetcher will successfully prefetch sequential streams) Mem, VM and SW optimizations61

62 Cache Waste /* Unoptimized */ for (s = 0; s < ITERATIONS; s++){ for (j = 0; j < HUGE; j++) x[j] = x[j+1]; /* will hog the cache but not benefit*/ for (i = 0; i < SMALLER_THAN_CACHE; i++) y[i] = y[i+1]; /* will be evicted between usages /* } /* Optimized */ for (s = 0; s < ITERATIONS; s++){ for (j = 0; j < HUGE; j++) { PREFETCH_NT x[j+1] /* will be installed in L1, but not L3 (AMD) */ x[j] = x[j+1]; for (i = 0; I < SMALLER_THAN_CACHE; i++) y[i] = y[i+1]; /* will always hit in the cache*/ } Also important for single-threaded applications if they are co-scheduled and share cache with other applications. Mem, VM and SW optimizations62

63 Categorize and avoiding cache waste Mem Miss rate No point in caching! per-instruction cache avoidence benefit L1 CPU L2 L1 CPU Cache hogging L1 L2 $-size benefit Application classification Slowed by others Don t care Slows & slowed Slows others Hogging Individually In mix In mix, patched Automatic taming of the hoggers benefit bzip LBM LQ Hogging Performance 1,2 1 0,8 0,6 0,4 0,2 0 bzip2 Libquantum LBM Geom mean AMD Opteron 25% Mem, VM and SW optimizations63 Andreas Sandberg, David Eklov and Erik Hagersten. Reducing Cache Pollution Through Detection and Elimination of Non- Temporal Memory Accesses, In Proceedings of Supercomputing (SC), New Orleans, LA, USA, November 2010.

64 Example: Hints to avoid cache pollution (non-temporal prefetches) cache misses The larger cache, the better 2x missrate missrate Hint: Don t allocate! actual/4 actual cache size 3 One Instance Four Instances Throughput % faster 0 Orig Original Lim=1.7MB Hint: lim= actual/4 Mem, VM and SW optimizations64

65 Some performance tools Free licenses Oprofile GNU: gprof AMD: code analyst Google performance tools Virtual Inst: High Productivity Supercomputing ( Not free Intel: Vtune and many more ThreadSpotter (of course ) HP: Multicore toolkit (some free, some not) Mem, VM and SW optimizations65

66 Commercial Break: ThreadSpotter Erik Hagersten Uppsala University, Sweden

67 ThreadSpotter Source: C, C++, Fortran, OpenMP /* Unoptimized Array Multiplication: x = y * z N = 1024 */ for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) {r = 0; for (k = 0; k < N; k = k + 1) r = r + y[i][k] * z[k][j]; x[i][j] = r; } /* Unoptimized Array Multiplication: x = y * z N = 1024 */ for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) {r = 0; for (k = 0; k < N; k = k + 1) r = r + y[i][k] * z[k][j]; x[i][j] = r; } Mission: Find the SlowSpots Asses their importance Enable for non-experts to fix them Improve the productivity of performance experts Any Compiler Sampler n Finger Print (~4MB) Binary Host System Mem, VM and SW optimizations67

68 Source: C, C++, Fortran... /* Unoptimized Array Multiplication: x = y * z N = 1024 */ for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) {r = 0; for (k = 0; k < N; k = k + 1) r = r + y[i][k] * z[k][j]; x[i][j] = r; } /* Unoptimized Array Multiplication: x = y * z N = 1024 */ for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) {r = 0; for (k = 0; k < N; k = k + 1) r = r + y[i][k] * z[k][j]; x[i][j] = r; } What? How? Mission: Find the Where? SlowSpots Asses their importance Enable for non-experts to fix them Improve the productivity of performance experts Any Compiler Sampler n Finger Print (~4MB) Analysis n Advice n Binary Host System Target System Parameters Mem, VM and SW optimizations68

69 A One-Click Report Generation Fill in the following fields: Application to run Input arguments Working dir (where to run the app) (Limit, if you like, data gathered here, e.g., start gathering after after 10 sec. and stop after 10 sec.) Click this button to create a report Cache size of the target system for optimization (e.g., L1 or L2 size) Mem, VM and SW optimizations69

70 Fetch rate Cache utilization Fraction of cache data utilized Predicted fetch rate (if utilization 100%) Cache size Miss rate Mem, VM and SW optimizations70

71 Cache size to optimize for Mem, VM and SW optimizations71

72 Loop Focus Tab Spotting the crime List of bad loops Explaining what to do Mem, VM and SW optimizations72

73 Bandwidth Focus Tab Spotting the crime List of Bandwidth SlowSpots Explaining what to do Mem, VM and SW optimizations73

74 Resource Sharing Example Libquantum A quantum computer simulation Widely used in research (download from: ) lines of C, fairly complex code. Runs an experiment in ~30 min Throughput improvement: 2 1,5 Relative Throughput 1 0, Number of Cores Used Mem, VM and SW optimizations74 74

75 Utilization Analysis Libquantum Fetch rate Predicted fetch rate if utilization = 100% Original Code 1.3% Cache utilization Fraction of cache data utilized Cache size Need 32 MB per thread! data 0 status 0 data 1 status 1 data 2 status 2 data 3 status 3 Only accessing status data in main loop record SlowSpotter s First Advice: Improve Utilization Change one data structure Involves ~20 lines of code Takes a non-expert 30 min Mem, VM and SW optimizations75

76 Utilization Analysis Libquantum Fetch rate Original Code Cache utilization Fraction of cache data utilized Utilization Optimization for (i=0; i++; i<max) {... = huge_data[i].status +... } Predicted fetch rate if utilization = 100% Cache size for (i=0; i++; i<max) {... = huge_data_status[i] +... } SlowSpotter s First Advice: Improve Utilization Change one data structure Involves ~20 lines of code Takes a non-expert 30 min Mem, VM and SW optimizations76

77 After Utilization Optimization Libquantum Old fetch rate Original Code Cache Utilization 95% Utilization Optimization Cache size Predicted fetch rate New fetch rate Mem, VM and SW optimizations77

78 Utilization Optimization Old fetch rate Original Code Cache Utilization 95% Utilization Optimization 1 2 Cache size Predicted fetch rate New fetch rate Two positive effects from better utilization 1. Each fetch brings in more useful data lower fetch rate 2. The same amount of useful data can fit in a smaller cache shift left Mem, VM and SW optimizations78

79 Reuse Analysis Libquantum Fetch rate Utilization Optimization Utilization + Fusion Optimization... toffoli(huge_data,...) cnot(huge_data, fused_toffoli_cnot(huge_data,...)... Second-Fifth SlowSpotter Advice: Improve reuse of data Fuse functions traversing the same data Here: four fused functions created Takes a non-expert < 2h Mem, VM and SW optimizations79

80 Effect: Reuse Optimization SPEC CPU libquantum Old fetch rate Utilization Optimization New fetch rate Utilization + Fusion Optimization 1 The miss in the second loop goes away Still need the same amount of cache to fit all data Mem, VM and SW optimizations80

81 Utilization + Reuse Optimization Libquantum Old fetch rate Utilization Optimization New fetch rate Utilization + Fusion Optimization Fetch rate down to 1.3% for 2MB Same as a 32 MB cache originally Mem, VM and SW optimizations81

82 Summary Libquantum 5 4 Original Utilization Optimization Utilization + Fusion 2.7x Throughput # Cores Used Mem, VM and SW optimizations82 8

83 Uppsala Programming for Multicore Architecture Center 62 MSEK grant / 10 years [$9M/10y] + related additional grants at UU = 130MSEK Research areas: Erik: Performance modeling New parallel algorithms Scheduling of threads and resources Testing & verification Language technology MC in wireless and sensors Mem, VM and SW optimizations83

84 Underneath the ThreadSpotter Hood

85 Great but Slow Insight: Simulation Slowdown: x CPU-sim Level-1 Cache Level-n Cache Memory Simulated CPU Code: set A,%r1 ld [%r1],%r0 st %r0,[%r1+8] add %r1,1,%r1 ld [%r1+16],%r0 add %r0,%r5,%r5 st %r5,[%r1+8] [...] Memory ref: 1:read A 2:write B 3:read C 4:write B [...] Simulated Memory System Mem, VM and SW optimizations85 85

86 Limited Insight: Hardware Counters Slowdown: 0% CPU HW ctr Level-1 Cache HW ctr Level-n Cache HW ctr Memory No flexibility Limited insight Ordinary Computer Insight: Instruction X misses Y% of the time in the cache Architecturally dependent (!!) Mem, VM and SW optimizations86 86

87 Need Efficiency and Insight: Our Approach Machineindependent runtime information Efficient modeling Draw conclusions, build tools Gather runtime info Solve equations Add heuristics 1. Capture data locality information 2. Measure impact of resource allocations 3. Capture code usage information? = 1 1 ) Clustering, K-means... Mem, VM and SW optimizations87 Predict (for many options) Cache statistics Bandwidth requirement Performance Power consumption Phase behavior... Find best : Core type Cache size Thread scheduling Frequency Code optimizations

88 StatCache: Insight and Efficiency Slowdown 10% (for long-running applications) Online Sampling Offline Insight Technology Host Computer Target Architecture core core... mem L2 mem... L1 core core L1 Randomly select accesses Address Streamto monitor 1:read A 2:read B 3:read C 4:write C 5:read B 6:read D 7:read A 8:read E 9:read B Sparse Sampler Reuse Distance=5 Reuse Distance=3 Application Fingerprint 5, 3, Probabilistic Cache Model Modeled behavior Architectural Parameters Acumem Advice Mem, VM and SW optimizations88

89 UART: Efficient sparse sampling trap trap A B D B E B A F D B N 1.Use HW counter overflow to randomly select accesses to sample (e.g. ~on avergage every th access) 2. Set a watchpoint for the data cacheline they touch 3. Use HW counters to count #memory accesses until watchpoint trap Sampling Overhead ~17% (10% at Acumem for long-running apps) (Modeling with math < 100ms) Mem, VM and SW optimizations89 i=0

90 Fingerprint Sparse reuse distance histogram h(d) Reuse distance Mem, VM and SW optimizations90

91 Modeling random caches with math (Assumtion: Constant MissRatio) rd i =5 A B D B E B A F D B N p miss Miss Equation m # repl 5 * MissRatio p Miss? miss =m(#repl) #repl Mem, VM and SW optimizations91

92 Assuming a fully associative cache A A A After 1 Replacement After R Replacements The cacheline A is in a cache with L cachelines (1 1/L) chance that A survives Mem, VM and SW optimizations92 (1 1/L) R chance that A survives

93 Modeling random caches with math (Assumtion: Constant MissRatio) rd i =5 # repl 3 * MissRate A B D B E B A F D B N p miss Miss Equation m m(repl)=1 (1 1/L) repl # repl 5 * MissRatio p miss =m(3 * MissRatio) p Miss? miss =m(5 * MissRatio) #repl n samples: MissRatio * n = Σ m(rd(i) * MissRatio) Can be solved in a fraction of a second for different L n i=0 Mem, VM and SW optimizations93 93

94 Accuracy: Simulation vs. math (Random replacement) Comparing simulation (w/ slowdown 100x) and math ( fractions of a second ) Miss ratio (%) gzip ammp vpr Cache size (bytes) Mem, VM and SW optimizations94 94

95 Modeling LRU Caches: Stack distance... Sampled Reuse Pair A-A rd i =5 A B C C B D A E B C C.. 1 Start= End= N Stack Distance: How many unique data objects? Answer: 3 If we know all reuses: How many of the reuses 2-6 go beyond End? Answer: 3 Stack_distance = Σ [d(i) > (End k + 2)] End k=start Foreach sample: if (Stack_distance > L ) miss++ else hit++ Mem, VM and SW optimizations95

96 But we only know a few reuse distances... d(1) d(2) d(3) d A B C C B D A E B C C h(d) N Estimate: How many of the reuses 2-6 go beyond End? Answer: Est_SD Assume that the distribution (aka histogram) of sampled reuses is representative for all accesses in that time window End Est_SD = Σ p[d(i) > (End - k)] k=start Mem, VM and SW optimizations96

97 All SPEC 2006 Mem, VM and SW optimizations97

98 Architecturally independent! The fingerprint does not depend on the caches of the host architecture Solve the equation for different targer architecture: Cache sizes Cacheline sizes Replacement algorithms {LRU, RND} Cache topology Mem, VM and SW optimizations98

99 In a nutshell 1. Sampling: Randomly select windows, and collect sparse reuse histograms from each window time [mem accesses] d(1) d(2) d(3)... xx x x x x xx x x x x x x x xx x x xx x xx xxxx mem ops 2. Use histogram as input to model behavior of target arch. h histogram miss ratio behavior d Model Mem, VM and SW optimizations99 cache-size

100 Example2: Modeling Shared Resources Offline Modeling Target Architecture core L1 Application Fingerprint A... core L1 L2 mem Application Fingerprint C Parallel Architecture and Compilation Techniques, PACT 5, 3, Sept International Conference on High- Performance Architectures and Compilers Jan 2011 (Best paper award) Mem, VM and SW optimizations100 5, 3, Application Fingerprint B 5, 3, Architectural Parameters Shared cache model Modeled Behavior Predict multicore cache sharing behavior

101 Example 3: Sensitivity Measurements Online Measurement Offline Modeling Host Architecture Target Architecture thief L1 core L1... L2 mem... L2 mem app. L1 core L1 speed $miss $hits bandwdith... Thiefs: Cache Pirate Bandwidth Bandit Quantifying applications: International Symposium on Performance Analysis of System Software International Conference on Parallel Processing 2011 (Best paper award) Model Architectural Parameters Modeled Behavior Advanced modeling: Prediction performace and bandwidth requirement when applications share a cache (PACT Sept ) Bandwidth-limited performance prediction (just submitted) Mem, VM and SW optimizations101

102 3. Efficient Runtime Capturing State of the art simulator running gcc Legend: 36h to simulate 50s execution. OH= % On-line phases detection (ScarPhase) OH=2%: Phase-guided sampling + modeling 60s (OH=20% avg) International Conference on Code Generation and Optimization IEEE International Symposium on Workload Characterization 2011 MASCOTS IEEE International Symposium on Workload Characterization Mem, VM and SW optimizations102 time

103 Summing up fast modeling The World s fastest: 1. Cache locality sampler (OH ~20%) Cache hitrate model for data and instructions (~10ms) Multi-threading model [a.k.a. Coherence model] (~10ms) Cache sharing model (~10ms) 2. Cache/BW quantitative measurements (OH ~5%) Cache sharing model (~10ms) Performance prediction & BW requirement (~10ms) Performance prediction model (~10ms) 3. On-line phase detection tool (OH ~2%) Phase-guided sampling Phase-guided power management Mem, VM and SW optimizations103

Uppsala University, Sweden

Uppsala University, Sweden Main memory characteristics Memory Technology Erik Hagersten Uppsala University, Sweden eh@it.uu.se Performance of main memory (from 3 rd Ed faster today) Access time: time between address is latched and

More information

Optimizing Made Easy: ThreadSpotter Erik Hagersten, Chief Scientist

Optimizing Made Easy: ThreadSpotter Erik Hagersten, Chief Scientist Copyright 2012 Rogue Wave Software All Rights Reserved Optimizing Made Easy: ThreadSpotter Erik Hagersten, Chief Scientist Rogue Wave: A Global Company Sweden Germany Moscow, reseller OR UK TX MA CO (HQ)

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache performance 4 Cache

More information

Motivation for Caching and Optimization of Cache Utilization

Motivation for Caching and Optimization of Cache Utilization Motivation for Caching and Optimization of Cache Utilization Agenda Memory Technologies Bandwidth Limitations Cache Organization Prefetching/Replacement More Complexity by Coherence Performance Optimization

More information

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memories. CPE480/CS480/EE480, Spring Hank Dietz. Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts Lecture 16 Virtual memory why? Virtual memory: Virtual memory concepts (5.10) Protection (5.11) The memory hierarchy of Alpha 21064 (5.13) Virtual address space proc 0? s space proc 1 Physical memory Virtual

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

CMSC 611: Advanced Computer Architecture. Cache and Memory

CMSC 611: Advanced Computer Architecture. Cache and Memory CMSC 611: Advanced Computer Architecture Cache and Memory Classification of Cache Misses Compulsory The first access to a block is never in the cache. Also called cold start misses or first reference misses.

More information

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab 1 due today Reading: Chapter 5.1 5.3 2 1 Overview How to

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction

More information

Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory

More information

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Virtual Memory. Virtual Memory

Virtual Memory. Virtual Memory Virtual Memory Virtual Memory Main memory is cache for secondary storage Secondary storage (disk) holds the complete virtual address space Only a portion of the virtual address space lives in the physical

More information

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner CPS104 Computer Organization and Programming Lecture 16: Virtual Memory Robert Wagner cps 104 VM.1 RW Fall 2000 Outline of Today s Lecture Virtual Memory. Paged virtual memory. Virtual to Physical translation:

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec). Static RAM may be

More information

CSE 502 Graduate Computer Architecture

CSE 502 Graduate Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 CSE 502 Graduate Computer Architecture Lec 11-14 Advanced Memory Memory Hierarchy Design Larry Wittie Computer Science, StonyBrook

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568/668 Part 11 Memory Hierarchy - I Israel Koren ECE568/Koren Part.11.1 ECE568/Koren Part.11.2 Ideal Memory

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory CPS 104 Computer Organization and Programming Lecture 20: Virtual Nov. 10, 1999 Dietolf (Dee) Ramm http://www.cs.duke.edu/~dr/cps104.html CPS 104 Lecture 20.1 Outline of Today s Lecture O Virtual. 6 Paged

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture. - Memory Hierarchies (I) COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap

More information

ECE4680 Computer Organization and Architecture. Virtual Memory

ECE4680 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory If I can see it and I can touch it, it s real. If I can t see it but I can touch it, it s invisible. If I can see it but I can t touch it, it

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Lecture 11. Virtual Memory Review: Memory Hierarchy

Lecture 11. Virtual Memory Review: Memory Hierarchy Lecture 11 Virtual Memory Review: Memory Hierarchy 1 Administration Homework 4 -Due 12/21 HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache size, block size, associativity

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

ECE468 Computer Organization and Architecture. Virtual Memory

ECE468 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory ECE468 vm.1 Review: The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

Topic 18 (updated): Virtual Memory

Topic 18 (updated): Virtual Memory Topic 18 (updated): Virtual Memory COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Virtual Memory Any time you see virtual, think using a level

More information

Background. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory.

Background. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory. Memory Hierarchies Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory Mem Element Background Size Speed Price Register small 1-5ns high?? SRAM medium 5-25ns $100-250 DRAM large

More information

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance: #1 Lec # 9 Winter 2003 1-21-2004 Classification Steady-State Cache Misses: The Three C s of cache Misses: Compulsory Misses Capacity Misses Conflict Misses Techniques To Improve Cache Performance: Reduce

More information

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Review ABC of Cache: Associativity Block size Capacity Cache organization Direct-mapped cache : A =, S = C/B

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations DECstation 5 Miss Rates Cache Performance Measures % 3 5 5 5 KB KB KB 8 KB 6 KB 3 KB KB 8 KB Cache size Direct-mapped cache with 3-byte blocks Percentage of instruction references is 75% Instr. Cache Data

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Movie Rental Store You have a huge warehouse with every movie ever made.

More information

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995 Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Fall 1995 Review: Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course:

More information

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B. Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.5) Memory Technologies Dynamic Random Access Memory (DRAM) Optimized

More information

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming MEMORY HIERARCHY DESIGN B649 Parallel Architectures and Programming Basic Optimizations Average memory access time = Hit time + Miss rate Miss penalty Larger block size to reduce miss rate Larger caches

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Case Study 1: Optimizing Cache Performance via Advanced Techniques 6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Misses Classifying Misses: 3 Cs! Compulsory The first access to a block is

More information

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40

More information

Introduction to cache memories

Introduction to cache memories Course on: Advanced Computer Architectures Introduction to cache memories Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Summary Summary Main goal Spatial and temporal

More information

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management CSE 120 July 18, 2006 Day 5 Memory Instructor: Neil Rhodes Translation Lookaside Buffer (TLB) Implemented in Hardware Cache to map virtual page numbers to page frame Associative memory: HW looks up in

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM Graduate Computer Architecture Handout 4B Cache optimizations and inside DRAM Outline 11 Advanced Cache Optimizations What inside DRAM? Summary 2018/1/15 2 Why More on Memory Hierarchy? 100,000 10,000

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Cache performance Outline

Cache performance Outline Cache performance 1 Outline Metrics Performance characterization Cache optimization techniques 2 Page 1 Cache Performance metrics (1) Miss rate: Neglects cycle time implications Average memory access time

More information

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design Advanced Computer Architecture- 06CS81-Memory Hierarchy Design AMAT and Processor Performance AMAT = Average Memory Access Time Miss-oriented Approach to Memory Access CPIExec includes ALU and Memory instructions

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache Memory Cache Memory Locality cpu cache memory Memory hierarchies take advantage of memory locality. Memory locality is the principle that future memory accesses are near past accesses. Memory hierarchies

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 13

ECE 571 Advanced Microprocessor-Based Design Lecture 13 ECE 571 Advanced Microprocessor-Based Design Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements More on HW#6 When ask for reasons why cache

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Memory hierarchy Outline

Memory hierarchy Outline Memory hierarchy Outline Performance impact Principles of memory hierarchy Memory technology and basics 2 Page 1 Performance impact Memory references of a program typically determine the ultimate performance

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Who Cares About the Memory Hierarchy? Processor Only Thus

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Main Memory (Fig. 7.13) Main Memory

Main Memory (Fig. 7.13) Main Memory Main Memory (Fig. 7.13) CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Memory b. Wide memory organization c. Interleaved memory organization

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Virtual Memory. Motivation:

Virtual Memory. Motivation: Virtual Memory Motivation:! Each process would like to see its own, full, address space! Clearly impossible to provide full physical memory for all processes! Processes may define a large address space

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Instructor: Dr. Turgay Korkmaz Department Computer Science The University of Texas at San Antonio Office: NPB

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

The Memory Hierarchy & Cache

The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory

More information

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

Page 1. Review: Address Segmentation  Review: Address Segmentation  Review: Address Segmentation Review Address Segmentation " CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs" February 23, 2011! Ion Stoica! http//inst.eecs.berkeley.edu/~cs162! 1111 0000" 1110 000" Seg #"

More information

Virtual Memory - Objectives

Virtual Memory - Objectives ECE232: Hardware Organization and Design Part 16: Virtual Memory Chapter 7 http://www.ecs.umass.edu/ece/ece232/ Adapted from Computer Organization and Design, Patterson & Hennessy Virtual Memory - Objectives

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information