15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

Size: px
Start display at page:

Download "15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011"

Transcription

1 15-740/ Computer Architecture Lecture 5: Project Example Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

2 Reminder: Project Proposals Project proposals due NOON on Monday 9/26 Two to three pages consisang of Problem Novelty Idea Hypothesis Methodology Plan All the details are in the project handout 2

3 Agenda for Today s Class Brief background on hybrid main memories Project example from Fall 2010 Project pitches and feedback Q & A 3

4 Main Memory in Today s Systems CPU DRAM HDD/SSD 4

5 Main Memory in Today s Systems CPU Main memory DRAM HDD/SSD 5

6 DRAM Pros Low latency Low cost Cons Low capacity High power Some new and important applicaaons require HUGE capacity (in the terabytes) 6

7 Main Memory in Today s Systems CPU Main memory DRAM HDD/SSD 7

8 Hybrid Memory (Future Systems) Hybrid main memory CPU DRAM (cache) New memories (high capacity) HDD/SSD 8

9 Row Buffer Locality-Aware Hybrid Memory Caching Policies Jus%n Meza HanBin Yoon Rachata Ausavarungnirun Rachael Harding Onur Mutlu

10 Motivation Two conflicang trends: 1. ITRS predicts the end of DRAM scalability 2. Workloads conanue to demand more memory Want future memories to have Large capacity High performance Energy efficient Need scalable DRAM alternaaves 10

11 Motivation Emerging memories can offer more scalability Phase change memory (PCM) Projected to be 3 12 denser than DRAM However, cannot simply replace DRAM Longer access latencies (4 12 DRAM) Higher access energies (2 40 DRAM) Use DRAM as a cache to large PCM memory [Mohan, HPTS 09; Lee+, ISCA 09] 11

12 Phase Change Memory (PCM) Data stored in form of resistance High current melts cell material Rate of cooling determines stored resistance Low current used to read cell contents 12

13 Projected PCM Characteristics (~2013) 32 nm DRAM PCM Rela%ve to DRAM Cell size 6 F F denser Read latency 60 ns ns 6 13 slower Write latency 60 ns 1400 ns 24 slower Read energy 1.2 pj/bit 2.5 pj/bit 2 more energy Write energy 0.39 pj/bit 16.8 pj/bit 40 more energy Durability N/A writes Limited life%me [Mohan, HPTS 09; Lee+, ISCA 09] 13

14 Row Buffers and Locality Memory array organized in columns and rows Row buffers store contents of accessed row Row buffers are important for mem. devices Device slower than bus: need to buffer data Fast accesses for data with spaaal locality DRAM: DestrucAve reads PCM: Writes are costly: want to coalesce 14

15 Row Buffers and Locality A D D R ROW DATA Row buffer miss! hit! LOAD X LOAD X+1 15

16 Key Idea Since DRAM and PCM both use row buffers, Row buffer hit latency same in DRAM and PCM Row buffer miss latency small in DRAM Row buffer miss latency large in PCM Cache data in DRAM which Frequently row buffer misses Is reused many Ames à because miss penalty is smaller in DRAM 16

17 Hybrid Memory Architecture CPU Memory Controller DRAM Cache (Low density) PCM (High density) Memory channel 17

18 Hybrid Memory Architecture DRAM Ctlr CPU PCM Ctlr DRAM Cache (Low density) PCM (High density) 18

19 Hybrid Memory Architecture Tag store: 2 KB rows CPU Memory Controller DRAM Cache (Low density) PCM (High density) 19

20 Hybrid Memory Architecture Tag store: X à DRAM LOAD X CPU Memory Controller DRAM Cache (Low density) PCM (High density) 20

21 Hybrid Memory Architecture Tag store: Y à PCM LOAD Y CPU Memory Controller DRAM Cache (Low density) PCM (High density) How does data get migrated to DRAM? Caching Policy 21

22 Methodology Simulated our system configuraaons Collected program traces using a tool called Pin Fed instrucaon trace informaaon to a Aming simulator modeling an OoO core and DDR3 memory Migrated data at the row (2 KB) granularity Collected memory traces from a standard computer architecture benchmark suite SPEC CPU2006 Used an in- house simulator writen in C# 22

23 Conventional Caching Data is migrated when first accessed Simple, used for many caches 23

24 Conventional Caching Data is migrated when first accessed Simple, used for many caches Tag store: Z à PCM LD Rw1 Rw2 CPU Memory Controller How does conven%onal caching perform DRAM Cache (Low density) Bus contenaon! PCM Row Data (High density) in a hybrid main memory? 24

25 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching 25

26 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching Beneficial for some benchmarks 26

27 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching Performance degrades due to bus contenaon 27

28 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) Many row buffer hits: don t need to migrate data ConvenAonal Caching 28

29 Conventional Caching IPC Normalized to All DRAM No Caching (All PCM) Want to idenafy data which misses in row buffer and is reused ConvenAonal Caching 29

30 Problems with Conventional Caching Performs useless migraaons Migrates data which are not reused Migrates data which hit in the row buffer Causes bus contenaon and DRAM polluaon Want to cache rows which are reused Want to cache rows which miss in row buffer 30

31 Problems with Conventional Caching Performs useless migraaons Migrates data which are not reused Migrates data which hit in the row buffer Causes bus contenaon and DRAM polluaon Want to cache rows which are reused Want to cache rows which miss in row buffer 31

32 A Reuse-Aware Policy Keep track of the number of accesses to a row Cache row in DRAM when accesses A Reset accesses every Q cycles Similar to CHOP [Jiang+, HPCA 10] Cached hot (reused) pages in on- chip DRAM To reduce off- chip bandwidth requirements We call this policy A- COUNT 32

33 A Reuse-Aware Policy IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 33

34 A Reuse-Aware Policy IPC Normalized to All DRAM Performs fewer migraaons: reduces channel contenaon No Caching (All PCM) ConvenAonal Caching A- COUNT.4 34

35 A Reuse-Aware Policy IPC Normalized to All DRAM Too few migraaons: too many accesses go to PCM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 35

36 A Reuse-Aware Policy IPC Normalized to All DRAM Rows with many hits sall needlessly migrated No Caching (All PCM) ConvenAonal Caching A- COUNT.4 36

37 Problems with Reuse-Aware Policy AgnosAc of DRAM/PCM access latencies May keep data which row buffer misses in PCM Missed opportunity: could save cycles in DRAM 37

38 Problems with Reuse-Aware Policy AgnosAc of DRAM/PCM access latencies Data with frequent row buffer hits Time PCM DRAM Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Saved cycles if placed in DRAM Data with frequent row buffer misses Time PCM DRAM Miss Hit Miss Hit Miss Miss Hit Miss Hit Miss Saved cycles if placed in DRAM 38

39 Row Buffer Locality-Aware Policy Cache rows which benefit from being in DRAM I.e., those with frequent row buffer misses Keep track of number of misses to a row Cache row in DRAM when misses M Reset misses every Q cycles We call this policy M- COUNT 39

40 Row Buffer Locality-Aware Policy IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 40

41 Row Buffer Locality-Aware Policy IPC Normalized to All DRAM Recognizes rows with many hits and does not migrate them No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 41

42 Row Buffer Locality-Aware Policy IPC Normalized to All DRAM Lots of data with just enough misses to get cached but litle reuse azer being cached à need to also track reuse No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 42

43 Combined Reuse/Locality Approach Cache rows with reuse and which frequently miss in the row buffer Use A- COUNT as predictor of future reuse and M- COUNT as predictor of future row buffer misses Cache row if accesses A and misses M We call this policy AM- COUNT 43

44 Combined Reuse/Locality Approach Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 AM- COUNT

45 Combined Reuse/Locality Approach Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 AM- COUNT.4.2 Reduces useless migraaons 45

46 Combined Reuse/Locality Approach Normalized to All DRAM And data with litle reuse kept out of DRAM No Caching (All PCM) ConvenAonal Caching A- COUNT.4 M- COUNT.2 AM- COUNT

47 Dynamic Reuse/Locality Approach Previously menaoned policies require profiling To determine the best A and M thresholds We propose a dynamic threshold policy Performs a cost- benefit analysis every Q cycles Simple hill- climbing algorithm to maximize benefit (Side note: we simplify the problem slightly by just finding the best A threshold, because we observe that M = 2 performs the best for a given A.) 47

48 Cost-Benefit Analysis Each quantum, we measure the first- order costs and benefits of the current A threshold Cost = cycles of bus contenaon due to migraaons Benefit = cycles saved at the banks by servicing a request in DRAM versus PCM Cost = MigraAons t migraaon Benefit = Reads DRAM (t read,pcm t read,dram ) + Writes DRAM (t write,pcm t write,dram ) 48

49 Cost-Benefit Maximization Algorithm Each quantum (10 million cycles): Net = Benefit Cost if Net < 0 then A++ else if Net > PreviousNet then A++ else A-- end end PreviousNet = Net // net benefit // too many migrations? // increase threshold // last A beneficial // increasing benefit? // try next A // decreasing benefit // too strict, reduce 49

50 Dynamic Policy Performance IPC Normalized to All DRAM No Caching (All PCM) Best StaAc ConvenAonal Caching Dynamic 50

51 Dynamic Policy Performance IPC Normalized to All DRAM No Caching (All PCM) Best StaAc ConvenAonal Caching Dynamic 29% improvement over All PCM, Within 18% of All DRAM 51

52 Evaluation Methodology/Metrics 16- core system Averaged across 100 randomly- generated workloads of varying working set size LARGE = working set size > main memory size Weighted speedup (performance) = IPC together IPC alone Maximum slowdown (fairness) = max IPC alone IPC together 52

53 16-core Performance & Fairness Weighted Speedup Conventional Caching A-COUNT AM-COUNT DAM-COUNT Maximum Slowdown Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup. 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (b) Maximum slowdown. 53

54 16-core Performance & Fairness Weighted Speedup Conventional Caching A-COUNT AM-COUNT DAM-COUNT 40 More contenaon à more benefit Maximum Slowdown Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup. 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (b) Maximum slowdown. 54

55 16-core Performance & Fairness Weighted Speedup Conventional Caching A-COUNT AM-COUNT DAM-COUNT Maximum Slowdown Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup. Dynamic policy can adjust to different workloads 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (b) Maximum slowdown. 55

56 Versus All PCM and All DRAM Compared to an All PCM main memory 17% performance improvement 21% fairness improvement Compared to an All DRAM main memory Within 21% of performance Within 53% of fairness 56

57 Robustness to System Configuration Normalized Weighted Speedup cores 8 cores 4 cores 2 cores Sorted Workload Number 57

58 Implementation/Hardware Cost Requires a tag store in memory controller We currently assume 36 KB of storage per 16 MB of DRAM We are invesagaang ways to miagate this overhead Requires a sta1s1cs store To keep track of accesses and misses 58

59 Conclusions DRAM scalability is nearing its limit Emerging memories (e.g. PCM) offer scalability Problem: must address high latency and energy We propose a dynamic, row buffer locality- aware caching policy for hybrid memories Cache rows which miss frequently in row buffer Cache rows which are reused many Ames 17/21% perf/fairness improvement vs. all PCM Within 21/53% perf/fairness of all DRAM system 59

60 Thank you! Questions? 60

61 Backup Slides 61

62 Related Work Weighted Speedup DIP Probabilistic Probabilistic+RBL DAM-COUNT 0 A 62

63 PCM Latency 6 All PCM DAM-COUNT Weighted Speedup PCM Latency Scaling Factor 63

64 DRAM Cache Size Weighted Speedup Conventional Caching DAM-COUNT 0 64 MB 128 MB 256 MB 512 MB DRAM Cache Size 64

65 Versus All DRAM and All PCM Weighted Speedup Maximum Slowdown Harmonic Speedup All PCM Conventional Caching DAM-COUNT All DRAM Power (W) WS MS HS P 0 65

66 Performance vs. Statistics Store Size IPC Normalized to All DRAM (8 ways, LRU) 512- entry (0.2 KB) entry (0.4 KB) entry (0.8 KB) entry (1.6 KB) - entry 66

67 Performance vs. Statistics Store Size IPC Normalized to All DRAM entry (0.2 KB) entry (0.4 KB) entry (0.8 KB) entry (1.6 KB) - entry (8 ways, LRU) Within ~1% of infinite storage with 200 B of storage 67

68 All DRAM 8 Banks IPC Normalized to All DRAM with 8 Banks No Caching (All PCM) ConvenAonal Caching Best StaAc Dynamic

69 All DRAM 16 Banks IPC Normalized to All DRAM with 16 Banks No Caching (All PCM) ConvenAonal Caching Best StaAc Dynamic

70 Simulation Parameters 70

71 Overview DRAM is reaching its scalability limits Yet, memory capacity requirements are increasing Emerging memory devices offer scalability Phase- change, resisave, ferroelectric, etc. But, have worse latency/energy than DRAM We propose a scalable hybrid memory arch. Use DRAM as a cache to phase change memory Cache data based on row buffer locality and reuse 71

72 Methodology Core model 3- wide issue with 128- entry instrucaon window 32 KB L1 D- cache per core 512 KB shared L2 cache per core Memory model 16 MB DRAM / 512 MB PCM per core Scaled based on workload trace size and access paterns to be smaller than working set DDR3 800 MHz, single channel, 8 banks per device Row buffer hit: 40 ns Row buffer miss: 80 ns (DRAM); 128, 368 ns (PCM) Migrate data at 2 KB row granularity 72

73 Outline Overview MoAvaAon/Background Methodology Caching Policies MulAcore EvaluaAon Conclusions 73

74 TODO: change diagram to two channels so that this can be explained 16-core Performance & Fairness Weighted Speedup Conventional Caching A-COUNT AM-COUNT DAM-COUNT Maximum Slowdown Conventional Caching A-COUNT AM-COUNT DAM-COUNT 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (a) Weighted speedup. DistribuAng data benefits small working sets, too 0 0% 25% 50% 75% 100% Fraction of LARGE Benchmarks (b) Maximum slowdown. 74

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different

More information

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories SAFARI Technical Report No. 2-5 (December 6, 2) : A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon hanbinyoon@cmu.edu Justin Meza meza@cmu.edu

More information

Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories

Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu, Norm Jouppi* Carnegie Mellon University * Hewlett-Packard

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive

More information

Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University Main Memory Lectures These slides are from the Scalable Memory Systems course taught at ACACES 2013 (July 15-19,

More information

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now? cps 14 memory.1 RW Fall 2 CPS11 Computer Organization and Programming Lecture 13 The System Robert Wagner Outline of Today s Lecture System the BIG Picture? Technology Technology DRAM A Real Life Example

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections ) Lecture 8: Virtual Memory Today: DRAM innovations, virtual memory (Sections 5.3-5.4) 1 DRAM Technology Trends Improvements in technology (smaller devices) DRAM capacities double every two years, but latency

More information

18-740: Computer Architecture Recitation 5: Main Memory Scaling Wrap-Up. Prof. Onur Mutlu Carnegie Mellon University Fall 2015 September 29, 2015

18-740: Computer Architecture Recitation 5: Main Memory Scaling Wrap-Up. Prof. Onur Mutlu Carnegie Mellon University Fall 2015 September 29, 2015 18-740: Computer Architecture Recitation 5: Main Memory Scaling Wrap-Up Prof. Onur Mutlu Carnegie Mellon University Fall 2015 September 29, 2015 Review Assignments for Next Week Required Reviews Due Tuesday

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Understanding Reduced-Voltage Operation in Modern DRAM Devices Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

A Comprehensive Analytical Performance Model of DRAM Caches

A Comprehensive Analytical Performance Model of DRAM Caches A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science

More information

Design-Induced Latency Variation in Modern DRAM Chips:

Design-Induced Latency Variation in Modern DRAM Chips: Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms Donghyuk Lee 1,2 Samira Khan 3 Lavanya Subramanian 2 Saugata Ghose 2 Rachata Ausavarungnirun

More information

DRAM Disturbance Errors

DRAM Disturbance Errors http://www.ddrdetective.com/ http://users.ece.cmu.edu/~yoonguk/ Flipping Bits in Memory Without Accessing Them An Experimental Study of DRAM Disturbance Errors Yoongu Kim Ross Daly, Jeremie Kim, Chris

More information

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Scalable Many-Core Memory Systems Lecture 3, Topic 2: Emerging Technologies and Hybrid Memories

Scalable Many-Core Memory Systems Lecture 3, Topic 2: Emerging Technologies and Hybrid Memories Scalable Many-Core Memory Systems Lecture 3, Topic 2: Emerging Technologies and Hybrid Memories Prof. Onur Mutlu http://www.ece.cmu.edu/~omutlu onur@cmu.edu HiPEAC ACACES Summer School 2013 July 17, 2013

More information

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models Lecture: Memory, Multiprocessors Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row

More information

Main Memory Supporting Caches

Main Memory Supporting Caches Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Cache Issues 1 Example cache block read

More information

Phase Change Memory An Architecture and Systems Perspective

Phase Change Memory An Architecture and Systems Perspective Phase Change Memory An Architecture and Systems Perspective Benjamin C. Lee Stanford University bcclee@stanford.edu Fall 2010, Assistant Professor @ Duke University Benjamin C. Lee 1 Memory Scaling density,

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

EEM 486: Computer Architecture. Lecture 9. Memory

EEM 486: Computer Architecture. Lecture 9. Memory EEM 486: Computer Architecture Lecture 9 Memory The Big Picture Designing a Multiple Clock Cycle Datapath Processor Control Memory Input Datapath Output The following slides belong to Prof. Onur Mutlu

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3) Lecture 15: DRAM Main Memory Systems Today: DRAM basics and innovations (Section 2.3) 1 Memory Architecture Processor Memory Controller Address/Cmd Bank Row Buffer DIMM Data DIMM: a PCB with DRAM chips

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Phase Change Memory An Architecture and Systems Perspective

Phase Change Memory An Architecture and Systems Perspective Phase Change Memory An Architecture and Systems Perspective Benjamin Lee Electrical Engineering Stanford University Stanford EE382 2 December 2009 Benjamin Lee 1 :: PCM :: 2 Dec 09 Memory Scaling density,

More information

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce

More information

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar

More information

EE414 Embedded Systems Ch 5. Memory Part 2/2

EE414 Embedded Systems Ch 5. Memory Part 2/2 EE414 Embedded Systems Ch 5. Memory Part 2/2 Byung Kook Kim School of Electrical Engineering Korea Advanced Institute of Science and Technology Overview 6.1 introduction 6.2 Memory Write Ability and Storage

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory

BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory HotStorage 18 BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory Gyuyoung Park 1, Miryeong Kwon 1, Pratyush Mahapatra 2, Michael Swift 2, and Myoungsoo Jung 1 Yonsei University Computer

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

CS311 Lecture 21: SRAM/DRAM/FLASH

CS311 Lecture 21: SRAM/DRAM/FLASH S 14 L21-1 2014 CS311 Lecture 21: SRAM/DRAM/FLASH DARM part based on ISCA 2002 tutorial DRAM: Architectures, Interfaces, and Systems by Bruce Jacob and David Wang Jangwoo Kim (POSTECH) Thomas Wenisch (University

More information

Caches. Samira Khan March 23, 2017

Caches. Samira Khan March 23, 2017 Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp. Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3

More information

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling

More information

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we

More information

DRAM Tutorial Lecture. Vivek Seshadri

DRAM Tutorial Lecture. Vivek Seshadri DRAM Tutorial 18-447 Lecture Vivek Seshadri DRAM Module and Chip 2 Goals Cost Latency Bandwidth Parallelism Power Energy 3 DRAM Chip Bank I/O 4 Sense Amplifier top enable Inverter bottom 5 Sense Amplifier

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 7: Memory Hierarchy and Caches Dr. Ahmed Sallam Suez Canal University Spring 2015 Based on original slides by Prof. Onur Mutlu Memory (Programmer s View) 2 Abstraction: Virtual

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Course Administration

Course Administration Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570

More information

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp. 13 1 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas 110 Winter 2009 CMPE Cache Direct-mapped cache Reads and writes Cache associativity Cache and performance Textbook Edition: 7.1 to 7.3 Third

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache 2 Virtually Indexed Caches 24-bit virtual address, 4KB page size 12 bits offset and 12 bits

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose, Onur Mutlu February 13, 2018 Executive Summary

More information

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Towards Energy-Proportional Datacenter Memory with Mobile DRAM Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University

More information

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Case Study 1: Optimizing Cache Performance via Advanced Techniques 6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each

More information

ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture

ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture Jie Zhang, Miryeong Kwon, Changyoung Park, Myoungsoo Jung, Songkuk Kim Computer Architecture and Memory

More information

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements EECS15 - Digital Design Lecture 11 SRAM (II), Caches September 29, 211 Elad Alon Electrical Engineering and Computer Sciences University of California, Berkeley http//www-inst.eecs.berkeley.edu/~cs15 Fall

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 Lec20.1 Fall 2003 Head

More information

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory 1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer) Agenda Caches Samira Khan March 23, 2017 Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand, Onur Mutlu, Phillip B. Gibbons, Michael A.

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I October 23, 2013 Make sure that your exam has 15 pages and is not missing any sheets, then

More information

Lecture 24: Memory, VM, Multiproc

Lecture 24: Memory, VM, Multiproc Lecture 24: Memory, VM, Multiproc Today s topics: Security wrap-up Off-chip Memory Virtual memory Multiprocessors, cache coherence 1 Spectre: Variant 1 x is controlled by attacker Thanks to bpred, x can

More information

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor Taeho Kgil, Shaun D Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Steve Reinhardt, Krisztian Flautner,

More information

740: Computer Architecture, Fall 2013 Midterm I

740: Computer Architecture, Fall 2013 Midterm I Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 Midterm I October 23, 2013 Make sure that your exam has 17 pages and is not missing any sheets, then write your

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK] Lecture 17 Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] SRAM / / Flash / RRAM / HDD SRAM / / Flash / RRAM/ HDD SRAM

More information

CS 3510 Comp&Net Arch

CS 3510 Comp&Net Arch CS 3510 Comp&Net Arch Cache P1 Dr. Ken Hoganson 2010 Von Neuman Architecture Instructions and Data Op Sys CPU Main Mem Secondary Store Disk I/O Dev Bus The Need for Cache Memory performance has not kept

More information

Evaluating Phase Change Memory for Enterprise Storage Systems

Evaluating Phase Change Memory for Enterprise Storage Systems Hyojun Kim Evaluating Phase Change Memory for Enterprise Storage Systems IBM Almaden Research Micron provided a prototype SSD built with 45 nm 1 Gbit Phase Change Memory Measurement study Performance Characteris?cs

More information

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont Basic Caches Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Caches Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns

More information

CS152 Computer Architecture and Engineering Lecture 16: Memory System

CS152 Computer Architecture and Engineering Lecture 16: Memory System CS152 Computer Architecture and Engineering Lecture 16: System March 15, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http://http.cs.berkeley.edu/~patterson

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information