Managing Memory for Timing Predictability. Rodolfo Pellizzoni

Size: px
Start display at page:

Download "Managing Memory for Timing Predictability. Rodolfo Pellizzoni"

Transcription

1 Managing Memory for Timing Predictability Rodolfo Pellizzoni

2 Thanks This work would not have been possible without the following students and collaborators Zheng Pei Wu*, Yogen Krish Heechul Yun* Renato Mancuso, Marco Caccamo, Lui Sha * additional thanks for contributing slides! 2 / 67

3 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 3 / 67

4 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 4 / 67

5 Multicore Server Desktop Mobile RT/Embedded 5 / 67

6 Challenges: Shared Memory Hierarchy T1 T2 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 CPU Core 1 Core 2 Core 3 Core 4 Memory Hierarchy Memory Hierarchy Unicore Multicore Performance Impact 6 / 67

7 DO-178C Certification Certification Authorities Software Team (CAST) Position Paper 32, May _software/cast/cast_papers/media/cast-32.pdf All effects due to shared resources in Multi-core Processors must be: Identified Analyzed Mitigated 7 / 67

8 CAST Identifies: DO-178C Certification Shared Cache (L3) Main Memory (DDR2/3) Interconnection 8 but other resources might create interference (more on this later). 8 / 67

9 How Bad Can it Be? 7,0 Slowdown ratio 6,0 5,0 4,0 3,0 2,0 1,0 0,0 Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference) Slowdown ratio = Solo IPC / CorunIPC (*) Measured on Intel Xeon 3530 (4 cores), 8GB 1ch DDR3 DRAM 9 9 / 67

10 This Talk Core1 Core2 Core3 Core4 Memory Controller DRAM Focus on main memory implemented as Double Date Rate Dynamic Synchronous RAM (DDR SDRAM DRAM for short). We show how to mitigate and bound the effects of memory contention. 10 / 67

11 Single Core Equivalence (SCE) End goal: Single Core Equivalent (SCE) execution [6] Prove that each core in a M-core platform behaves as a (possibly slower) single-processor system Certify ARINC-653 partition on the (slower) singleprocessor system Hardware Selection Resource Partitioning Software Partition to Core Allocat. Schedulability Analysis I/O Channel Partitioning 11 / 67

12 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 12 / 67

13 DRAM Background Storage Array contains Data Can only Read/Write to Row Buffer 13 / 67

14 DRAM Background Front End generates the needed commands, Back End issues commands on command bus 14 / 67

15 DRAM Background READ Targeting Data in this Row Row Buffer contain data from a different row 15 / 67

16 DRAM Background READ P, A, R 16 / 67

17 DRAM Background P, A, R PRE ACT ACT: Load the data from array Pre-Charge: into buffer store the data back into array P A Pre-charge command issued on command bus Timing Constraint 17 / 67

18 DRAM Background P, A, R CAS (R/W): reads/writes using buffer READ P A R Data 18 / 67

19 DRAM Background READ R Targeting Data Already in Row Buffer Only Need Read Command Can be issued immediately P A R Data R Data 19 / 67

20 DRAM Background READ Latency of a close request is much longer than the R latency of an open request Latency of a close request Latency of a open request P A R Data R Data 20 / 67

21 Bank Parallelism Bank 1 Bank 2 Bank 3 Bank 4 Accessing data in multiple banks A R Data Multiple data can be pipelined A R Data A R Data A R Data 21 / 67

22 Bank Parallelism Requests should be spread over multiple banks Bank 1 Bank 2 Bank 3 Bank 4 Accessing data in same bank, different rows A R Data P A R Data No pipelining much larger delay 22 / 67

23 Write-to-Read Penalty Transactions of the same type can be pipelined R Data W Data Huge Latency Penalty! R Data W Data but a read after a write cannot. W Data Write-to-Read (WtR) timing constraint R Data 23 / 67

24 Summary To extract maximum bandwidth, we need: Successive requests to the same bank should hit the same row (spatial locality) Successive requests to the same bank should be of the same type (reads or writes) Concurrent requests should hit different banks Problem: the way banks are accessed depends on: The memory mapping (how are physical memory locations mapped to banks?) The memory controller schedule 24 / 67

25 Memory Arbitration COTS memory controllers use First-Ready First-Come- First-Serve (FR-FCFS) arbitration. Each bank has a separate request queue Requests targeting an open row are prioritized over requests targeting a close row Reads are prioritized over writes FCFS arbitration among banks Maximum throughput but very bad worse case delay Front R1:W% W Data W Data R2:R% t W L t B U S t W T R t W L t B U S R t W T R R3:W% W Data t = 0 t = 0 t W L t B U S t W T R t = 4 t = 16 t = 20 t = / 67

26 Our Solutions COTS Solution: PALLOC + Memguard OS-level solution No hardware modifications Solves bank parallelism, bandwidth allocation Custom Solution: ROC Controller redesign As before + write-read optimization Supports differentiated service (predictable latency vs optimized throughput) 26 / 67

27 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 27 / 67

28 Problem SMP OS Core1 Core2 Core3 Core4 L3 OS does NOTknow DRAM banks OS memory pages are spread all over multiple banks Memory Controller (MC)???? DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 Unpredictable memory performance

29 Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast Peak = 10.6 GB/s Bank 1 Bank 2 Bank 3 Bank 4 DDR3 1333Mhz Out-of-order processors

30 Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast Peak = 10.6 GB/s Bank 1 Bank 2 Bank 3 Bank 4 DDR3 1333Mhz

31 Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Mess Bank 1 Bank 2 Bank 3 Bank 4 Performance =??

32 Worst-case Core 1 Core 2 Core 3 Core 4 Request order Bank 1 L3 Memory Controller (MC) Bank 2 Bank 3 DRAM DIMM Bank 4 time Row 1 Row 2 Row 3 Row 4 Row 1 Row 2 1 bank & always row misses ~1/10 peak b/w

33 PALLOC SMP OS Core1 Core2 Core3 Core4 CPC Memory Controller (MC) OS is aware of DRAM mapping Each pagecan be allocated to a desired DRAM bank DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 Flexible allocation policy

34 PALLOC Core1 Core2 Core3 Core4 Private banking L3 Memory Controller (MC) Allocatepages on certain exclusively assigned banks DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 Eliminate Inter-core bank conflicts

35 PALLOC Implementation Modified Linux kernel s buddy allocator DRAM bank-aware page frame allocation at each page fault Example: per-core private banking (PB) # cd /sys/fs/cgroup # mkdir core0 core1 core2 core3 create 4 cgroup partitions # echo 0-3 > core0/palloc.dram_bank assign bank 0 ~ 3 for the core0 partition. # echo 4-7 > core1/palloc.dram_bank # echo 8-11 > core2/palloc.dram_bank # echo > core3/palloc.dram_bank 35

36 Performance Impact on Unicore 1,20 4banks 8banks 16banks Buddy Normalized IPC 1,00 0,80 0,60 0,40 0,20 0,00 #of bank vs. performance on a single core Finding: bank partitioning negatively affect performance due to reduced MLP, but not significant for most benchmarks 36

37 Performance Isolation on 4 Cores 7,00 buddy PB PB+PC Slowdown ratio 6,00 5,00 4,00 3,00 2,00 1,00 0,00 Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference) PB: DRAM bank partitioning only; PB+PC: DRAM bank and Cache partitioning Finding: bank (and cache) partitioning improves isolation, but far from ideal 37

38 Bus Bandwidth Management Why PB+PC not ideal? Cores still contend for access to memory data bus Bandwidth allocation can be highly unfair -> unpredictable behavior FR FCFS scheduling gives higher priority to earlier requests An out-of-order core performing multiple concurrent requests can receive an inordinate amount of bandwidth Solution: bandwidth reservation Similar to a real-time server, but guarantees memory bandwidth rather than CPU time 38

39 MemGuard: Managing Memory MemGuard Bandwidth Reclaim Manager Operating System BW BW Regulator 0.9GB/s BW BW Regulator 0.1GB/s Regulator 0.1GB/s Regulator 0.1GB/s PMC PMC PMC PMC Core1 Core2 Core3 Core4 Memory Controller DRAM DIMM Multicore Processor Reservation: minimum memory bandwidth guarantee Reclaiming/sharing: additional memory bandwidth 39

40 Reservation Idea Reserve per-core memory bandwidth via the OS scheduler Use h/w PMC to monitor memory request rate Budget Core activity 2 1 Suspend the RT idle task Guarantee: as long as sum budgets <= guaranteed memory bandwidth, queuing delay in the memory controller is small (more details in [3]) 0 1ms Schedule a RT idle task 2ms computation memory fetch 40

41 Reclaiming/Sharing Idea Redistribute unused bandwidth to demanding cores Divide core into protected and unprotected cores Protected cores have guaranteed bandwidth; do not require additional bandwidth Unprotected cores use budget prediction Unneeded budget is donated to the system Reclaim on-demand basis 41

42 Reclaiming/Sharing 4 units of budgets donated at time 10, 2 at time 20 Global budget reclaimed when core budget is depleted 42

43 Effects of MemGuard T1's exec. Time (ms) Perf. guarantee No deadline violation 60% solo corun solo corun solo corun T1 Core1 L2 Core2 L2 DRAM Original 94% T2 T1 900M/s Core1 L2 DRAM MemGuard (Reserve only) 14% T2 T1 69% T2 200M/s Core2 L2 900M/s Core1 L2 DRAM 200M/s Core2 L2 MemGuard (Reclaim + Share) 43

44 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 44 / 67

45 ROC Key idea: design new architectural optimizations targeted at reducing worst case, not average case latency ROC: Rank-switching, Open-row Controller We discuss: Design Latency Analysis Implementation Results

46 Multi-Rank Device Memory device with up to 4 ranks, each comprising separate bank set

47 Multi-Rank Device Long Write-to-Read constraint replaced by much shorter Rank-to-Rank one W Data RtR R Data

48 Example: Write-Read-Write-Read 1 Rank: Latency 52 cycles Challenge: Design a command scheduling algorithm to minimize worst case latency by forcing alternation among ranks 2Ranks: Latency 35 cycles 4 Ranks: Latency 29 cycles

49 Worst Case Analysis Total # of Requestors and Ranks Memory Device Parameters Worst Case Single Request Latency Analysis Latency for different types of request Open Read Close Read Open Write Close Write Task Under Analysis # of open reads # of close reads # of open writes # of close writes Cumulative Worst Case Execution Time WCET

50 Back End Model Front End adds constant delay focus on Back End Three levels arbitration Rank 1 Command Queues Level 3 Arbitration P/A Arbiter Level 2 Arbitration Controller Back End CAS Arbiter P/A Arbiter CAS Arbiter Level 1 Command Arbitration To Devic e Rank R

51 Back End Model L1: CAS (R/W) commands have higher priority than PRE/ACT commands priority to data bus contention Rank 1 Command Queues Level 3 Arbitration P/A Arbiter Level 2 Arbitration Controller Back End CAS Arbiter P/A Arbiter CAS Arbiter Level 1 Command Arbitration To Devic e Rank R

52 Back End Model L2 alternates among ranks L3 alternates among requestors within a rank Rank 1 Command Queues Level 3 Arbitration P/A Arbiter Level 2 Arbitration Controller Back End CAS Arbiter P/A Arbiter CAS Arbiter Level 1 Command Arbitration To Devic e Rank R

53 Back End Model Let: : number of ranks : number of requestors for rank Then given the alternation in L2/L3, the latency of each command is a function of Isolation property: the latency of a requestor does not depend on the # of requestors or scheduling policy used in other ranks We can dedicate some ranks to hard real-time requestors, and others to soft real-time requestors Optimize hard requestors for latency, soft requestors for bandwidth

54 CAS Rank Switching Rule Key idea: Schedule ranks in Round-Robin order if transfers on the data bus are separated We by prove: at most Rank-to-Rank constraint When this is not possible (due to Write-to-Read constraint), reorder ranks W Rank 2 can read twice while Rank 1 is waiting for WtR 1. Reordering does not increase worstcase latency if R = 4 2. If the system is backlogged, no more Data than RtRbetween data transfers => WtR Data does not degrade memory RtR throughput R R Data R Rank 1 Data RtR Rank 2

55 Results Comparison against Analyzable Memory Controller (AMC) [1] Predictable memory controller with WCET guarantees for hard real-time tasks Fair arbitration (Round Robin) similar to our approach Synthetic Benchmarks Used to show how worst case latency bound varies as parameters are changed CHStone Benchmarks Memory traces are obtained from gem5 simulator and used as inputs to both analysis and simulators Core under analysis is in-order Interfering requestors are out-of-order running lbm

56 Results DDR3-1333H memory device 64 and 32 bits data bus width Simulations Python simulators base controller [5] and AMC [1] ROC Implementation Three stages pipelined implementation in Verilog RTL Synthesizes to Xilinx FPGA at 340Mhz (original soft memory controller: 400Mhz) ASIC implementation could likely be significantly faster Code available at

57 Results Synthetic Benchmarks, 20% write Synthetic: 8 Requestors-64bits 400 Avg Worst Case Latency (ns) % 20% 40% 60% 80% 100% Row Hit % This improvement due to ROC-4 private bank has parallelism > 33% lower latency than [5] for all cases..this due to open-row policy..this due to rank-swithing AMC Analysis [5] ROC-2Rank ROC-4Rank 20/24

58 Results CHStone, 16 requestors 64bits data bus ROC-4 has between 5 and 35% lower WCET than [5] T-bars: worst case bound Solid bars: simulation 2.00 AMC Analysis[5] ROC-2 Rank ROC-4 Rank Normalized Execution Time /26 adpcm aes bf gsm jpeg mips motion sha dfadd dfdiv dfmul dfsin

59 Summary Bank Contention Bandwidth Partitioning Open Row Write-to- Read Opt Differentiated Service ROC Other pred. mem. cont. PALLOC MemGuard PALLOC + MemGuard 59 / 67

60 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 60 / 67

61 Other Sources Of Interference Help is needed in mitigating and analyzing other sources of interference Cache Allocation interference: multiple tasks/cores contending for cache space. Can be solved with cache partitioning/locking. Timing interference: simultaneous accesses by multiple cores. Generally low in modern architectures. Interconnects How does this work? We rarely have detailed info What about cache coherency? 61 / 67

62 Other Sources Of Interference Other sources of interference might not be well understood Example: Miss Status Holding Register (MSHR) in Intel platforms Caches are non-blocking allow multiple concurrent requests by same/different cores Pending requests are enqueued in global MSHR register A core can be stalled due to contention for MSHR access Based on preliminary measures, we believe this can account for a 2x delay factor 62 / 67

63 Hardware Profiles The described OS techniques rely on basic hw support How do we know if a platform can support them? I strongly believe the community should join to define a set of SCE hardware profiles Techniques can then be specified based on required profile Profile Mem-Predict A hardware platform satisfies this profile if it supports: DRAM controller with per-bank queues Virtual memory Per-core performance counters with interrupt functionality counting number of memory accesses This is sufficient for PALLOC + MemGuard 63 / 67

64 Academic Involvement in Multi-Core Certification The community needs to come together to: Identify and review the inter-core interference channels Share and review solution approaches Develop metrics, analysis and testing procedures for the quality of solution approaches Foster the development of a shared process for the certification of multicore avionics Basic timeline agreed upon with FAA and IEEE TCRTS leaderships. Academic workshops to be followed by workshops with FAA and industry First workshop at next CPSWeek in April 64 / 67

65 Conclusions DO-178C certification on multi-core: a huge challenge Main memory is a major interference channel Intra-bank interference Inter-bank interference We discussed solution to mitigate the effect of the memory channel COTS solution relies on available hardware support, mitigates interference but at the cost of performance Custom hardware requires controller re-design A lot more remains to be done consider getting involved! 65 / 67

66 Bibliography [1] M. Paolieri, E. Quinones, F. Cazorla, and M. Valero, An Analyzable Memory Controller for Hard Real-Time CMPs, Embedded Systems Letters, IEEE, vol. 1, no. 4, pp , [2] Y. Krish, Z. P. Wu and R. Pellizzoni, ROC: A Rank-switching, Open-row DRAM Controller for Time-predictable Systems, IEEE ECRTS, July 2014 [3] H. Yun, G. Yao, R. Pellizzoni, M. Caccamoand L. Sha, MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms, 19th IEEE RTAS, April 2013 [4] H. Yun, R. Mancuso, Z. P. Wu, R. Pellizzoni, PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms, 20th IEEE RTAS, April 2014 [5] Z. P. Wu, Y. Krish, and R. Pellizzoni, Worst Case Analysis of DRAM Latency in Multi-Requestor Systems, 34 th IEEE RTSS, December 2013 [6] LuiShaet al.: Single Core Equivalent Virtual Machines for Hard Real-Time Computing on Multicore Processors. Technical Report, 66 / 67

67 Questions? 67 / 67

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Single Core Equivalence Framework (SCE) For Certifiable Multicore Avionics

Single Core Equivalence Framework (SCE) For Certifiable Multicore Avionics Single Core Equivalence Framework (SCE) For Certifiable Multicore Avionics http://rtsl-edge.cs.illinois.edu/sce/ a collaboration of: Presenters: Lui Sha, Marco Caccamo, Heechul Yun, Renato Mancuso, Jung-Eun

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

Worst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu Yogen Krish Rodolfo Pellizzoni

Worst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu Yogen Krish Rodolfo Pellizzoni orst Case Analysis of DAM Latency in Multi-equestor Systems Zheng Pei u Yogen Krish odolfo Pellizzoni Multi-equestor Systems CPU CPU CPU Inter-connect DAM DMA I/O 1/26 Multi-equestor Systems CPU CPU CPU

More information

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Deterministic Memory Abstraction and Supporting Multicore System Architecture Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi $, Prathap Kumar Valsan^, Renato Mancuso *, Heechul Yun $ $ University of Kansas, ^ Intel, * Boston University

More information

Variability Windows for Predictable DDR Controllers, A Technical Report

Variability Windows for Predictable DDR Controllers, A Technical Report Variability Windows for Predictable DDR Controllers, A Technical Report MOHAMED HASSAN 1 INTRODUCTION In this technical report, we detail the derivation of the variability window for the eight predictable

More information

Worst Case Analysis of DRAM Latency in Hard Real Time Systems

Worst Case Analysis of DRAM Latency in Hard Real Time Systems Worst Case Analysis of DRAM Latency in Hard Real Time Systems by Zheng Pei Wu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied

More information

Single Core Equivalent Virtual Machines. for. Hard Real- Time Computing on Multicore Processors

Single Core Equivalent Virtual Machines. for. Hard Real- Time Computing on Multicore Processors Single Core Equivalent Virtual Machines for Hard Real- Time Computing on Multicore Processors Lui Sha, Marco Caccamo, Renato Mancuso, Jung- Eun Kim, and Man- Ki Yoon University of Illinois at Urbana- Champaign

More information

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Member,

More information

An introduction to SDRAM and memory controllers. 5kk73

An introduction to SDRAM and memory controllers. 5kk73 An introduction to SDRAM and memory controllers 5kk73 Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part

More information

Understanding Shared Memory Bank Access Interference in Multi-Core Avionics

Understanding Shared Memory Bank Access Interference in Multi-Core Avionics Understanding Shared Memory Bank Access Interference in Multi-Core Avionics Andreas Löfwenmark and Simin Nadjm-Tehrani Department of Computer and Information Science Linköping University, Sweden {andreas.lofwenmark,

More information

Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms

Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms Heechul Yun, Prathap Kumar Valsan University of Kansas {heechul.yun, prathap.kumarvalsan}@ku.edu Abstract Tasks running

More information

This is the published version of a paper presented at MCC14, Seventh Swedish Workshop on Multicore Computing, Lund, Nov , 2014.

This is the published version of a paper presented at MCC14, Seventh Swedish Workshop on Multicore Computing, Lund, Nov , 2014. http://www.diva-portal.org This is the published version of a paper presented at MCC14, Seventh Swedish Workshop on Multicore Computing, Lund, Nov. 27-28, 2014. Citation for the original published paper:

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates

More information

Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms

Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms Heechul Yun, Prathap Kumar Valsan University of Kansas {heechul.yun, prathap.kumarvalsan}@ku.edu Abstract Tasks running

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

Administrivia. Mini project is graded. 1 st place: Justin (75.45) 2 nd place: Liia (74.67) 3 rd place: Michael (74.49)

Administrivia. Mini project is graded. 1 st place: Justin (75.45) 2 nd place: Liia (74.67) 3 rd place: Michael (74.49) Administrivia Mini project is graded 1 st place: Justin (75.45) 2 nd place: Liia (74.67) 3 rd place: Michael (74.49) 1 Administrivia Project proposal due: 2/27 Original research Related to real-time embedded

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Timing analysis and timing predictability

Timing analysis and timing predictability Timing analysis and timing predictability Architectural Dependences Reinhard Wilhelm Saarland University, Saarbrücken, Germany ArtistDesign Summer School in China 2010 What does the execution time depends

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

A Dual-Criticality Memory Controller (DCmc): Proposal and Evaluation of a Space Case Study

A Dual-Criticality Memory Controller (DCmc): Proposal and Evaluation of a Space Case Study A Dual-Criticality Memory Controller (DCmc): Proposal and Evaluation of a Space Case Study Javier Jalle,, Eduardo Quiñones, Jaume Abella, Luca Fossati, Marco Zulianello, Francisco J. Cazorla, Barcelona

More information

arxiv: v4 [cs.ar] 19 Apr 2018

arxiv: v4 [cs.ar] 19 Apr 2018 Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi, Prathap Kumar Valsan, Renato Mancuso, Heechul Yun University of Kansas, {farshchi, heechul.yun}@ku.edu Intel,

More information

Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platforms

Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platforms Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Marco Caccamo, Lui Sha University of Kansas, USA. heechul@ittc.ku.edu

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Overview of Potential Software solutions making multi-core processors predictable for Avionics real-time applications

Overview of Potential Software solutions making multi-core processors predictable for Avionics real-time applications Overview of Potential Software solutions making multi-core processors predictable for Avionics real-time applications Marc Gatti, Thales Avionics Sylvain Girbal, Xavier Jean, Daniel Gracia Pérez, Jimmy

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

MARACAS: A Real-Time Multicore VCPU Scheduling Framework

MARACAS: A Real-Time Multicore VCPU Scheduling Framework : A Real-Time Framework Computer Science Department Boston University Overview 1 2 3 4 5 6 7 Motivation platforms are gaining popularity in embedded and real-time systems concurrent workload support less

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Deterministic Memory Abstraction and Supporting Multicore System Architecture Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi University of Kansas, USA farshchi@ku.edu Prathap Kumar Valsan Intel, USA prathap.kumar.valsan@intel.com Renato

More information

A Comparative Study of Predictable DRAM Controllers

A Comparative Study of Predictable DRAM Controllers 0:1 0 A Comparative Study of Predictable DRAM Controllers Real-time embedded systems require hard guarantees on task Worst-Case Execution Time (WCET). For this reason, architectural components employed

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Compositional Analysis for the Multi-Resource Server *

Compositional Analysis for the Multi-Resource Server * Compositional Analysis for the Multi-Resource Server * Rafia Inam, Moris Behnam, Thomas Nolte, Mikael Sjödin Management and Operations of Complex Systems (MOCS), Ericsson AB, Sweden rafia.inam@ericsson.com

More information

NEW REAL-TIME MEMORY CONTROLLER DESIGN FOR EMBEDDED MULTI-CORE SYSTEM By Ahmed Shafik Shafie Mohamed

NEW REAL-TIME MEMORY CONTROLLER DESIGN FOR EMBEDDED MULTI-CORE SYSTEM By Ahmed Shafik Shafie Mohamed NEW REAL-TIME MEMORY CONTROLLER DESIGN FOR EMBEDDED MULTI-CORE SYSTEM By Ahmed Shafik Shafie Mohamed A Thesis Submitted to the Faculty of Engineering at Cairo University in Partial Fulfillment of the Requirements

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

COTS Multicore Processors in Avionics Systems: Challenges and Solutions COTS Multicore Processors in Avionics Systems: Challenges and Solutions Dionisio de Niz Bjorn Andersson and Lutz Wrage dionisio@sei.cmu.edu, baandersson@sei.cmu.edu, lwrage@sei.cmu.edu Report Documentation

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

A Comparative Study of Predictable DRAM Controllers

A Comparative Study of Predictable DRAM Controllers A A Comparative Study of Predictable DRAM Controllers Danlu Guo,Mohamed Hassan,Rodolfo Pellizzoni and Hiren Patel, {dlguo,mohamed.hassan,rpellizz,hiren.patel}@uwaterloo.ca, University of Waterloo Recently,

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

The Migra*on of Safety- Cri*cal RT So6ware to Mul*core. Marco Caccamo University of Illinois at Urbana- Champaign

The Migra*on of Safety- Cri*cal RT So6ware to Mul*core. Marco Caccamo University of Illinois at Urbana- Champaign The Migra*on of Safety- Cri*cal RT So6ware to Mul*core Marco Caccamo University of Illinois at Urbana- Champaign Outline Mo*va*on Memory- centric scheduling theory Background: PRedictable Execu*on Model

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

DRAM Main Memory. Dual Inline Memory Module (DIMM)

DRAM Main Memory. Dual Inline Memory Module (DIMM) DRAM Main Memory Dual Inline Memory Module (DIMM) Memory Technology Main memory serves as input and output to I/O interfaces and the processor. DRAMs for main memory, SRAM for caches Metrics: Latency,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

MemGuard on Raspberry Pi 3

MemGuard on Raspberry Pi 3 EECS 750 Mini Project #1 MemGuard on Raspberry Pi 3 In this mini-project, you will first learn how to build your own kernel on raspberry pi3. You then will learn to compile and use an out-of-source-tree

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

ECE 5730 Memory Systems

ECE 5730 Memory Systems ECE 5730 Memory Systems Spring 2009 More on Memory Scheduling Lecture 16: 1 Exam I average =? Announcements Course project proposal What you propose to investigate What resources you plan to use (tools,

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Static RAM (SRAM) Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 0.5ns 2.5ns, $2000 $5000 per GB 5.1 Introduction Memory Technology 5ms

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Resource Sharing and Partitioning in Multicore

Resource Sharing and Partitioning in Multicore www.bsc.es Resource Sharing and Partitioning in Multicore Francisco J. Cazorla Mixed Criticality/Reliability Workshop HiPEAC CSW Barcelona May 2014 Transition to Multicore and Manycores Wanted or imposed

More information

Multimedia Systems 2011/2012

Multimedia Systems 2011/2012 Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware

More information

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

A Comparative Study of Predictable DRAM Controllers

A Comparative Study of Predictable DRAM Controllers 1 A Comparative Study of Predictable DRAM Controllers DALU GUO, MOHAMED HASSA, RODOLFO PELLIZZOI, and HIRE PATEL, University of Waterloo, CADA Recently, the research community has introduced several predictable

More information

CS311 Lecture 21: SRAM/DRAM/FLASH

CS311 Lecture 21: SRAM/DRAM/FLASH S 14 L21-1 2014 CS311 Lecture 21: SRAM/DRAM/FLASH DARM part based on ISCA 2002 tutorial DRAM: Architectures, Interfaces, and Systems by Bruce Jacob and David Wang Jangwoo Kim (POSTECH) Thomas Wenisch (University

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Analyzing the Effect Of Predictability of Memory References on WCET

Analyzing the Effect Of Predictability of Memory References on WCET NORTH CAROLINA STATE UNIVERSITY Analyzing the Effect Of Predictability of Memory References on WCET Amir Bahmani Department of Computer Science North Carolina State University abahaman@ncsu. edu Vishwanathan

More information

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS

More information

Real-Time Cache Management for Multi-Core Virtualization

Real-Time Cache Management for Multi-Core Virtualization Real-Time Cache Management for Multi-Core Virtualization Hyoseung Kim 1,2 Raj Rajkumar 2 1 University of Riverside, California 2 Carnegie Mellon University Benefits of Multi-Core Processors Consolidation

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models Lecture: Memory, Multiprocessors Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,

More information

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5 th Edition Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic

More information

arxiv: v1 [cs.dc] 25 Jul 2014

arxiv: v1 [cs.dc] 25 Jul 2014 Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems arxiv:1407.7448v1 [cs.dc] 25 Jul 2014 Heechul Yun University of Kansas, USA. heechul.yun@ku.edu July 29, 2014 Abstract In

More information

CEC 450 Real-Time Systems

CEC 450 Real-Time Systems CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation

More information