15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

Size: px

Start display at page:

Download "15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011"

Jane Thornton
6 years ago
Views:

15-740/18-740 Computer Architecture Lecture 5:

1 15-740/ Computer Architecture Lecture 5: Project Example Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

2 Reminder: Project Proposals Project proposals due NOON on Monday 9/26 Two to three pages consisang of Problem Novelty Idea Hypothesis Methodology Plan All the details are in the project handout 2

3 Agenda for Today s Class Brief background on hybrid main memories Project example from Fall 2010 Project pitches and feedback Q & A 3

4 Main Memory in Today s Systems CPU DRAM HDD/SSD 4

5 Main Memory in Today s Systems CPU Main memory DRAM HDD/SSD 5

6 DRAM Pros Low latency Low cost Cons Low capacity High power Some new and important applicaaons require HUGE capacity (in the terabytes) 6

7 Main Memory in Today s Systems CPU Main memory DRAM HDD/SSD 7

8 Hybrid Memory (Future Systems) Hybrid main memory CPU DRAM (cache) New memories (high capacity) HDD/SSD 8

9 Row Buffer Locality-Aware Hybrid Memory Caching Policies Jus%n Meza HanBin Yoon Rachata Ausavarungnirun Rachael Harding Onur Mutlu

10 Motivation Two conflicang trends: 1. ITRS predicts the end of DRAM scalability 2. Workloads conanue to demand more memory Want future memories to have Large capacity High performance Energy efficient Need scalable DRAM alternaaves 10

11 Motivation Emerging memories can offer more scalability Phase change memory (PCM) Projected to be 3 12 denser than DRAM However, cannot simply replace DRAM Longer access latencies (4 12 DRAM) Higher access energies (2 40 DRAM) Use DRAM as a cache to large PCM memory [Mohan, HPTS 09; Lee+, ISCA 09] 11

12 Phase Change Memory (PCM) Data stored in form of resistance High current melts cell material Rate of cooling determines stored resistance Low current used to read cell contents 12

13 Projected PCM Characteristics (~2013) 32 nm DRAM PCM Rela%ve to DRAM Cell size 6 F F denser Read latency 60 ns ns 6 13 slower Write latency 60 ns 1400 ns 24 slower Read energy 1.2 pj/bit 2.5 pj/bit 2 more energy Write energy 0.39 pj/bit 16.8 pj/bit 40 more energy Durability N/A writes Limited life%me [Mohan, HPTS 09; Lee+, ISCA 09] 13

14 Row Buffers and Locality Memory array organized in columns and rows Row buffers store contents of accessed row Row buffers are important for mem. devices Device slower than bus: need to buffer data Fast accesses for data with spaaal locality DRAM: DestrucAve reads PCM: Writes are costly: want to coalesce 14

15 Row Buffers and Locality A D D R ROW DATA Row buffer miss! hit! LOAD X LOAD X+1 15

16 Key Idea Since DRAM and PCM both use row buffers, Row buffer hit latency same in DRAM and PCM Row buffer miss latency small in DRAM Row buffer miss latency large in PCM Cache data in DRAM which Frequently row buffer misses Is reused many Ames à because miss penalty is smaller in DRAM 16

17 Hybrid Memory Architecture CPU Memory Controller DRAM Cache (Low density) PCM (High density) Memory channel 17

18 Hybrid Memory Architecture DRAM Ctlr CPU PCM Ctlr DRAM Cache (Low density) PCM (High density) 18

19 Hybrid Memory Architecture Tag store: 2 KB rows CPU Memory Controller DRAM Cache (Low density) PCM (High density) 19

20 Hybrid Memory Architecture Tag store: X à DRAM LOAD X CPU Memory Controller DRAM Cache (Low density) PCM (High density) 20

21 Hybrid Memory Architecture Tag store: Y à PCM LOAD Y CPU Memory Controller DRAM Cache (Low density) PCM (High density) How does data get migrated to DRAM? Caching Policy 21

22 Methodology Simulated our system configuraaons Collected program traces using a tool called Pin Fed instrucaon trace informaaon to a Aming simulator modeling an OoO core and DDR3 memory Migrated data at the row (2 KB) granularity Collected memory traces from a standard computer architecture benchmark suite SPEC CPU2006 Used an in- house simulator writen in C# 22

23 Conventional Caching Data is migrated when first accessed Simple, used for many caches 23

24 Conventional Caching Data is migrated when first accessed Simple, used for many caches Tag store: Z à PCM LD Rw1 Rw2 CPU Memory Controller How does conven%onal caching perform DRAM Cache (Low density) Bus contenaon! PCM Row Data (High density) in a hybrid main memory? 24

25 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching 25

Conventional Caching IPC Normalized to All DRAM 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.

26 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching Beneficial for some benchmarks 26

27 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching Performance degrades due to bus contenaon 27

Conventional Caching IPC Normalized to All DRAM 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.

28 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) Many row buffer hits: don t need to migrate data ConvenAonal Caching 28

29 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) Want to idenafy data which misses in row buffer and is reused ConvenAonal Caching 29

30 Problems with Conventional Caching Performs useless migraaons Migrates data which are not reused Migrates data which hit in the row buffer Causes bus contenaon and DRAM polluaon Want to cache rows which are reused Want to cache rows which miss in row buffer 30

31 Problems with Conventional Caching Performs useless migraaons Migrates data which are not reused Migrates data which hit in the row buffer Causes bus contenaon and DRAM polluaon Want to cache rows which are reused Want to cache rows which miss in row buffer 31

32 A Reuse-Aware Policy Keep track of the number of accesses to a row Cache row in DRAM when accesses A Reset accesses every Q cycles Similar to CHOP [Jiang+, HPCA 10] Cached hot (reused) pages in on- chip DRAM To reduce off- chip bandwidth requirements We call this policy A- COUNT 32

33 A Reuse-Aware Policy IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 33

34 A Reuse-Aware Policy IPC Normalized to All DRAM Performs fewer migraaons: reduces channel contenaon No Caching (All PCM) ConvenAonal Caching A- COUNT.4 34

35 A Reuse-Aware Policy IPC Normalized to All DRAM Too few migraaons: too many accesses go to PCM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 35

A Reuse-Aware Policy IPC Normalized to All DRAM 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.

36 A Reuse-Aware Policy IPC Normalized to All DRAM Rows with many hits sall needlessly migrated No Caching (All PCM) ConvenAonal Caching A- COUNT.4 36

37 Problems with Reuse-Aware Policy AgnosAc of DRAM/PCM access latencies May keep data which row buffer misses in PCM Missed opportunity: could save cycles in DRAM 37

38 Problems with Reuse-Aware Policy AgnosAc of DRAM/PCM access latencies Data with frequent row buffer hits Time PCM DRAM Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Saved cycles if placed in DRAM Data with frequent row buffer misses Time PCM DRAM Miss Hit Miss Hit Miss Miss Hit Miss Hit Miss Saved cycles if placed in DRAM 38

39 Row Buffer Locality-Aware Policy Cache rows which benefit from being in DRAM I.e., those with frequent row buffer misses Keep track of number of misses to a row Cache row in DRAM when misses M Reset misses every Q cycles We call this policy M- COUNT 39

40 Row Buffer Locality-Aware Policy IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 40

Row Buffer Locality-Aware Policy IPC Normalized to All DRAM 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.

41 Row Buffer Locality-Aware Policy IPC Normalized to All DRAM Recognizes rows with many hits and does not migrate them No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 41

Row Buffer Locality-Aware Policy IPC Normalized to All DRAM 1 0.9 0.

1 0 Lots of data with just enough misses to get cached but litle

42 Row Buffer Locality-Aware Policy IPC Normalized to All DRAM Lots of data with just enough misses to get cached but litle reuse azer being cached à need to also track reuse No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 42

43 Combined Reuse/Locality Approach Cache rows with reuse and which frequently miss in the row buffer Use A- COUNT as predictor of future reuse and M- COUNT as predictor of future row buffer misses Cache row if accesses A and misses M We call this policy AM- COUNT 43

44 Combined Reuse/Locality Approach Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 AM- COUNT

45 Combined Reuse/Locality Approach Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 AM- COUNT.4.2 Reduces useless migraaons 45

Combined Reuse/Locality Approach Normalized to All DRAM 1 0.8 0.6 0.4 0.

46 Combined Reuse/Locality Approach Normalized to All DRAM And data with litle reuse kept out of DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 AM- COUNT

47 Dynamic Reuse/Locality Approach Previously menaoned policies require profiling To determine the best A and M thresholds We propose a dynamic threshold policy Performs a cost- benefit analysis every Q cycles Simple hill- climbing algorithm to maximize benefit (Side note: we simplify the problem slightly by just finding the best A threshold, because we observe that M = 2 performs the best for a given A.) 47

48 Cost-Benefit Analysis Each quantum, we measure the first- order costs and benefits of the current A threshold Cost = cycles of bus contenaon due to migraaons Benefit = cycles saved at the banks by servicing a request in DRAM versus PCM Cost = MigraAons t migraaon Benefit = Reads DRAM (t read,pcm t read,dram ) + Writes DRAM (t write,pcm t write,dram ) 48

49 Cost-Benefit Maximization Algorithm Each quantum (10 million cycles): Net = Benefit Cost if Net < 0 then A++ else if Net > PreviousNet then A++ else A-- end end PreviousNet = Net // net benefit // too many migrations? // increase threshold // last A beneficial // increasing benefit? // try next A // decreasing benefit // too strict, reduce 49

50 Dynamic Policy Performance IPC Normalized to All DRAM No Caching (All PCM) Best StaAc ConvenAonal Caching Dynamic 50

Dynamic Policy Performance IPC Normalized to All DRAM 1 0.8 0.6 0.4 0.

51 Dynamic Policy Performance IPC Normalized to All DRAM No Caching (All PCM) Best StaAc ConvenAonal Caching Dynamic 29% improvement over All PCM, Within 18% of All DRAM 51

52 Evaluation Methodology/Metrics 16- core system Averaged across 100 randomly- generated workloads of varying working set size LARGE = working set size > main memory size Weighted speedup (performance) = IPC together IPC alone Maximum slowdown (fairness) = max IPC alone IPC together 52

53 16-core Performance & Fairness Weighted Speedup Conventional Caching A-COUNT AM-COUNT DAM-COUNT Maximum Slowdown Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup. 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (b) Maximum slowdown. 53

16-core Performance & Fairness Weighted Speedup 8 6 4 2 Conventional Caching A-COUNT AM-COUNT DAM-COUNT 40 More contenaon à more benefit Maximum Slowdown 30 20 10 Conventional

54 16-core Performance & Fairness Weighted Speedup Conventional Caching A-COUNT AM-COUNT DAM-COUNT 40 More contenaon à more benefit Maximum Slowdown Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup. 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (b) Maximum slowdown. 54

16-core Performance & Fairness Weighted Speedup 8 6 4 2 Conventional Caching A-COUNT AM-COUNT DAM-COUNT Maximum Slowdown 40 30 20 10 Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0%

55 16-core Performance & Fairness Weighted Speedup Conventional Caching A-COUNT AM-COUNT DAM-COUNT Maximum Slowdown Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup. Dynamic policy can adjust to different workloads 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (b) Maximum slowdown. 55

56 Versus All PCM and All DRAM Compared to an All PCM main memory 17% performance improvement 21% fairness improvement Compared to an All DRAM main memory Within 21% of performance Within 53% of fairness 56

57 Robustness to System Configuration Normalized Weighted Speedup cores 8 cores 4 cores 2 cores Sorted Workload Number 57

58 Implementation/Hardware Cost Requires a tag store in memory controller We currently assume 36 KB of storage per 16 MB of DRAM We are invesagaang ways to miagate this overhead Requires a sta1s1cs store To keep track of accesses and misses 58

59 Conclusions DRAM scalability is nearing its limit Emerging memories (e.g. PCM) offer scalability Problem: must address high latency and energy We propose a dynamic, row buffer locality- aware caching policy for hybrid memories Cache rows which miss frequently in row buffer Cache rows which are reused many Ames 17/21% perf/fairness improvement vs. all PCM Within 21/53% perf/fairness of all DRAM system 59

60 Thank you! Questions? 60

61 Backup Slides 61

62 Related Work Weighted Speedup DIP Probabilistic Probabilistic+RBL DAM-COUNT 0 A 62

63 PCM Latency 6 All PCM DAM-COUNT Weighted Speedup PCM Latency Scaling Factor 63

64 DRAM Cache Size Weighted Speedup Conventional Caching DAM-COUNT 0 64 MB 128 MB 256 MB 512 MB DRAM Cache Size 64

65 Versus All DRAM and All PCM Weighted Speedup Maximum Slowdown Harmonic Speedup All PCM Conventional Caching DAM-COUNT All DRAM Power (W) WS MS HS P 0 65

66 Performance vs. Statistics Store Size IPC Normalized to All DRAM (8 ways, LRU) 512- entry (0.2 KB) entry (0.4 KB) entry (0.8 KB) entry (1.6 KB) - entry 66

4 0.2 0 512- entry (0.2 KB) 1024- entry (0.

67 Performance vs. Statistics Store Size IPC Normalized to All DRAM entry (0.2 KB) entry (0.4 KB) entry (0.8 KB) entry (1.6 KB) - entry (8 ways, LRU) Within ~1% of infinite storage with 200 B of storage 67

68 All DRAM 8 Banks IPC Normalized to All DRAM with 8 Banks No Caching (All PCM) ConvenAonal Caching Best StaAc Dynamic

69 All DRAM 16 Banks IPC Normalized to All DRAM with 16 Banks No Caching (All PCM) ConvenAonal Caching Best StaAc Dynamic

70 Simulation Parameters 70

71 Overview DRAM is reaching its scalability limits Yet, memory capacity requirements are increasing Emerging memory devices offer scalability Phase- change, resisave, ferroelectric, etc. But, have worse latency/energy than DRAM We propose a scalable hybrid memory arch. Use DRAM as a cache to phase change memory Cache data based on row buffer locality and reuse 71

72 Methodology Core model 3- wide issue with 128- entry instrucaon window 32 KB L1 D- cache per core 512 KB shared L2 cache per core Memory model 16 MB DRAM / 512 MB PCM per core Scaled based on workload trace size and access paterns to be smaller than working set DDR3 800 MHz, single channel, 8 banks per device Row buffer hit: 40 ns Row buffer miss: 80 ns (DRAM); 128, 368 ns (PCM) Migrate data at 2 KB row granularity 72

73 Outline Overview MoAvaAon/Background Methodology Caching Policies MulAcore EvaluaAon Conclusions 73

Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup.

74 TODO: change diagram to two channels so that this can be explained 16-core Performance & Fairness Weighted Speedup Conventional Caching A-COUNT AM-COUNT DAM-COUNT Maximum Slowdown Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup. DistribuAng data benefits small working sets, too 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (b) Maximum slowdown. 74

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different