Worst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu Yogen Krish Rodolfo Pellizzoni

Similar documents
Trends in Embedded System Design

Worst Case Analysis of DRAM Latency in Hard Real Time Systems

Managing Memory for Timing Predictability. Rodolfo Pellizzoni

Variability Windows for Predictable DDR Controllers, A Technical Report

Memory Controllers for Real-Time Embedded Systems. Benny Akesson Czech Technical University in Prague

Administrivia. Mini project is graded. 1 st place: Justin (75.45) 2 nd place: Liia (74.67) 3 rd place: Michael (74.49)

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

A Comparative Study of Predictable DRAM Controllers

Design and Analysis of Time-Critical Systems Timing Predictability and Analyzability + Case Studies: PTARM and Kalray MPPA-256

A Comparative Study of Predictable DRAM Controllers

A Comparative Study of Predictable DRAM Controllers

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Reducing NoC and Memory Contention for Manycores

arxiv: v1 [cs.dc] 25 Jul 2014

This is the published version of a paper presented at MCC14, Seventh Swedish Workshop on Multicore Computing, Lund, Nov , 2014.

An introduction to SDRAM and memory controllers. 5kk73

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Main Memory Supporting Caches

A Dual-Criticality Memory Controller (DCmc): Proposal and Evaluation of a Space Case Study

Pollard s Attempt to Explain Cache Memory

arxiv: v1 [cs.ar] 5 Jul 2012

Staged Memory Scheduling

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems

Real-Time Mixed-Criticality Wormhole Networks

Design and Analysis of Real-Time Systems Predictability and Predictable Microarchitectures

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

DRAM Tutorial Lecture. Vivek Seshadri

Understanding Shared Memory Bank Access Interference in Multi-Core Avionics

Designing Predictable Real-Time and Embedded Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Advanced Caching Techniques

A Timing Effects of DDR Memory Systems in Hard Real-Time Multicore Architectures: Issues and Solutions

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

A Cache Hierarchy in a Computer System

CS 426 Parallel Computing. Parallel Computing Platforms

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

On the Off-chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option?

Cache memory. Lecture 4. Principles, structure, mapping

A Predictable and Command-Level Priority-Based DRAM Controller for Mixed-Criticality Systems

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Caches Concepts Review

Timing analysis and timing predictability

ECE 30 Introduction to Computer Engineering

Resource Sharing and Partitioning in Multicore

SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core

Final Lecture. A few minutes to wrap up and add some perspective

Timing Effects of DDR Memory Systems in Hard Real-Time Multicore Architectures: Issues and Solutions

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

ECE/CS 757: Homework 1

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

SGI Challenge Overview

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Advanced Caching Techniques

NEW REAL-TIME MEMORY CONTROLLER DESIGN FOR EMBEDDED MULTI-CORE SYSTEM By Ahmed Shafik Shafie Mohamed

A DRAM Centric NoC Architecture and Topology Design Approach

Structure of Computer Systems

Challenges of WCET Analysis in COTS Multi-core due to Different Levels of Abstraction

Efficient real-time SDRAM performance

Improving DRAM Performance by Parallelizing Refreshes with Accesses

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Predictable Cache Coherence for Multi- Core Real-Time Systems

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Memory Hierarchy. Slides contents from:

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Mastering The Behavior of Multi-Core Systems to Match Avionics Requirements

CS/CoE 1541 Exam 2 (Spring 2019).

Introduction to memory system :from device to system

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

CS2253 COMPUTER ORGANIZATION AND ARCHITECTURE 1 KINGS COLLEGE OF ENGINEERING DEPARTMENT OF INFORMATION TECHNOLOGY

Predictable Programming on a Precision Timed Architecture

A Reconfigurable Real-Time SDRAM Controller for Mixed Time-Criticality Systems

Data Bus Slicing for Contention-Free Multicore Real-Time Memory Systems

Unit In a time - sharing operating system, when the time slot given to a process is completed, the process goes from the RUNNING state to the

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

A Comprehensive Analytical Performance Model of DRAM Caches

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

Composable Resource Sharing Based on Latency-Rate Servers

Bounding SDRAM Interference: Detailed Analysis vs. Latency-Rate Analysis

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Impact of Resource Sharing on Performance and Performance Prediction: A Survey

Memory Hierarchy. Slides contents from:

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

BlueVisor: A Scalable Real-time Hardware Hypervisor for Many-core Embedded System

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

Efficient Latency Guarantees for Mixed-criticality Networks-on-Chip

(b) External fragmentation can happen in a virtual memory paging system.

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Transcription:

orst Case Analysis of DAM Latency in Multi-equestor Systems Zheng Pei u Yogen Krish odolfo Pellizzoni

Multi-equestor Systems CPU CPU CPU Inter-connect DAM DMA I/O 1/26

Multi-equestor Systems CPU CPU CPU Inter-connect DAM DMA I/O INTEFEENCE!!! 1/26

Multi-equestor Systems CPU CPU CPU Hard eal Time Systems Must Inter-connect be Predictable!!! DAM DMA I/O INTEFEENCE!!! 1/26

Multi-equestor Systems Schedulability Analysis: needs CET as input CET depends on hardware platform CET: needs Latency to access shared resource (e.g. cache, DAM) Existing approaches can bound the interference but they assume the latency for DAM access is constant 2/26

Multi-equestor Systems Schedulability Analysis: needs CET as input Problem: DAM latency is variable and changes depending on its state CET depends on hardware platform CET: needs Latency to access shared resource (e.g. cache, DAM) Existing approaches can bound the interference but they assume the latency for DAM access is constant 2/26

Contribution equestor Under Analysis CPU CPU CPU Timing analysis that bounds Inter-connect the worst case latency for DAM access DAM DMA I/O 3/26

Contribution Interfering equestors CPU CPU CPU Assuming we do not know Inter-connect what they are doing, so we assume they cause the worst case interference DAM DMA I/O Interfering equestors 3/26

Outline 1. Background & elated ork 2. Memory Controller Model 3. orst Case Latency Analysis 4. esults & Conclusion

Background Storage Array contains Can only ead/rite to ow Buffer 4/26

Background EAD Targeting in this ow ow Buffer contain data from a different row 4/26

Background EAD P, A, Front End generates the needed commands Back End issues commands on command bus 4/26

Background P, A, PE ACT ACT: Load the data from array Pre-Charge: into buffer store the data back into array P A Pre-charge command issued on command bus Timing Constraint 4/26

Background P, A, EAD P A 4/26

Background EAD Targeting Already in ow Buffer Only Need ead Command Can be issued immediately P A 4/26

Background EAD -Latency of a close request is much longer than the latency of an open request -Latency of memory access is variable! Latency of a close request Latency of a open request P A 4/26

Predictable Memory Controllers Close ow Policy: After each -Can access, not take the advantage row buffer of is automatically locality pre-charged (row hits) -Latency is much longer than open request Memory Latency is the same for all requests Implicit Next Pre-charge equest targets same bank A P A 5/26

Predictable Memory Controllers Interleaving Banks Bank 1 Bank 2 Bank 3 Bank 4 Accessing data in multiple banks A Multiple data can be pipelined A A A 6/26

Predictable Memory Controllers Interleaving Banks Bank 1 Bank 2 Bank 3 Bank 4 Problem: requestors can close each other s row buffer since they can access all banks A Thus closed row policy is used to make A latency predictable The problem of long latency of close row policy still exist! A A A 6/26

Predictable Memory Controllers Interleaving Banks This is good for system with small DAM data bus width (e.g. 16 bits) Bank 1 Bank 2 Bank 3 Bank 4 A A A A A Larger data buses can transfer same amount of data without interleaving so many banks 6/26

Predictable Memory Controllers Interleaving Banks Bank 1 Bank 2 Interleaving two banks for wider data bus (e.g. 32 bits) Interleaving Problems: A1. equestors can close each other s rows (interference) A 2. Must be used with close row policy to make latency predictable 3. For wider data bus, effectiveness of interleaving is diminished Time asted!! A 7/26

Predictable Memory Controllers Private Banks Can partition banks to either requestors or tasks Core 1 Core 2 DMA Bank 1 Bank 2 Bank 3 Bank 4 This can be done by: Hardware if Memory controller supports By compiler In OS, using virtual memory 8/26

elated ork AMC[1] and Predator [2]: -Close ow Policy -Interleaved Bank Conservative Open-Page [3]: Interleaved Bank Leave row open for a small window of time PET DAM Controller [4]: Close ow Policy Private Bank 9/26

Our Approach Private Bank eliminates row buffer Challenge: interferences from other requestors 1. Analysis is more complex 2. More than 20 timing constraints 3. Latency depends on the dynamic state of DAM Open ow Policy reduce latency and take advantage or row hit ratio (locality) 10/26

Outline 1. Background & elated ork 2. Memory Controller Model 3. orst Case Latency Analysis 4. esults & Conclusion

Memory Controller Model e focus on the back end latency ignore CONSTANT front end delay Front End Back End Core 1 Per equestor Buffers A Global FIFO Queue Command Bus DMA Command Generator A P Core 2 Bus 11/26

Memory Controller Model Each requestor has a Global private FIFO is used for arbitration buffer for memory command Front End Back End Core 1 Per equestor Buffers A Global FIFO Queue Command Bus DMA Command Generator A P Core 2 Bus 11/26

Memory Controller Model Command at head of each private buffer are inserted into the FIFO Front End Back End Core 1 Per equestor Buffers A Global FIFO Queue Command Bus DMA Command Generator A P Core 2 Bus 11/26

Memory Controller Model Command at head of each private buffer are inserted into the FIFO Front End Back End Core 1 DMA Command Generator Per equestor Buffers A Global FIFO Queue A P Command Bus Core 2 Bus 11/26

Memory Controller Model Controller scan the global FIFO from front to end for a command that can be issued Front End Back End Core 1 DMA Command Generator Per equestor Buffers A Global FIFO Queue A P Command Bus Core 2 Bus 11/26

Memory Controller Model Next command must wait until timing constraints are satisfied before it can be inserted into FIFO Core 1 DMA Intuitively, the arbitration is fair and Front is End similar to a round Back robin End policy Command Generator Per equestor Buffers A Command Issued Global FIFO Queue P Command Bus A Core 2 Bus 11/26

Outline 1. Background & elated ork 2. Memory Controller Model 3. orst Case Latency Analysis 4. esults & Conclusion

orst Case Analysis Total # of equestors Memory Device Parameters Task Under Analysis orst Case Single equest Latency Analysis Part 2 Only provided for in-order core # of open reads # of close reads # of open writes # of close writes Part 1 Main Contribution ork for any type of cores Latency for different types of request Open Close Open Assumption: ead ead rite e do not know about the activity on the other interfering requestors, so we assume those requestors Cumulative produce the worst case orst pattern Case to cause maximum interference Execution Time Close rite CET 12/26

orst Case Analysis Total # of equestors Memory Device Parameters orst Case Single equest Latency Analysis Latency for different types of request Open ead Close ead Open rite Close rite Task Under Analysis # of open reads # of close reads # of open writes # of close writes Cumulative orst Case Execution Time CET 12/26

Single equest Latency Decomposed into two parts equest Arrival / / Arrival to ead/rite ead/rite to Arrival until ead/rite command is inserted into the global FIFO ead/write inserted into FIFO until data is finished transmitting 13/26

Single equest Latency This part may include Pre-charge and ACT commands equest Arrival / P A / Arrival to ead/rite ead/rite to Latency depends on the previous request (i.e., state of the DAM) Latency does not depend on state of the DAM 13/26

Single equest Latency Both parts depends on the # of interfering requestors as well as DAM timing constraints equest Arrival / P A / Arrival to ead/rite ead/rite to 13/26

Single equest Latency equest Arrival / P A / Arrival to ead/rite ead/rite to For details on this part, refer to paper e will focus on this part 13/26

ead/rite to Latency ead to ead has no timing constraints, only contention on the data bus Same for rite to rite 14/26

ead/rite to Latency Therefore, an alternation of read and write commands produce longer latency rite to ead timing constraint ead to rite timing constraint 15/26

ead/rite to Latency Interference on rite command All other requestors inserts / commands to create maximum interference Front 16/26

ead/rite to Latency Interference on rite command Front A write command could of finished immediately before t 0 17/26

ead/rite to Latency Interference on rite command Therefore, further delay the first ead command Front 18/26

orst Case Analysis Total # of equestors Memory Device Parameters orst Case Single equest Latency Analysis Part 2 Only provided for in-order core Latency for different types of request Open ead Close ead Open rite Close rite Task Under Analysis # of open reads # of close reads # of open writes # of close writes Cumulative orst Case Execution Time CET

Cumulative Latency Open ead Close ead Open rite Close rite Task Under Analysis: t 19/26

Cumulative Latency orst case request order depends on input value, code path, cache state, etc. Open ead Close ead Open rite Close rite Task Under Analysis: If worst case request order is known, we can sum the latency of each request t 19/26

Cumulative Latency Open ead Close ead Open rite Close rite Static Analysis tools can be used to obtain safe bound for # of each type of request Task Under Analysis: If worst case request order is known, we can sum the latency of each request t 19/26

Cumulative Latency Open ead Close ead Open rite Close rite This problem can be solved in constant time; see paper for detail Task Under Analysis: hich pattern leads to worst case latency? 19/26

Outline 1. Background & elated ork 2. Memory Controller Model 3. orst Case Latency Analysis Single equest Latency Cumulative Latency 4. esults & Conclusion

esults Comparison against Analyzable Memory Controller [1] Since they use fair arbitration (ound obin) which is similar to our approach Synthetic Benchmarks Used to show how worst case latency varies as parameters are changed CHStone Benchmarks Memory traces are obtained from gem5 simulator Memory traces are used as input the worst case analysis 20/26

esults Synthetic Benchmarks 21/26

esults Synthetic Benchmarks 22/26

esults As memory devices becomes faster, the difference between open and close access is getting larger and therefore close row is becoming too pessimistic 50% ow Hit atio, 4 equestors, 20% rites Devices 800D (ns) 1066F (ns) 1333H (ns) 1600K (ns) 1866L (ns) 2133N (ns) % better AMC (64 bits) 185 185.27 180.9 178 169.84 163 11.89% Our (64 bits) 125.2 112.47 104.85 102.18 96.97 92.85 25.84% 23/26

esults CHStone Benchmarks for 64bits bus 24/26

Conclusion A novel worst case analysis that takes dynamic state into account Open row policy can reduce memory latency as devices are becoming faster Private bank scheme is used to eliminate row buffer interference from other requestors 25/26

Future ork Discussion of shared data Bus utilization is still poor due to read/write switching ead/rite optimization to reduce latency bound Handle Multiple anks Implementation in hardware 26/26

eferences [1] M. Paolieri, E. Quin ones, F. Cazorla, and M. Valero, An Analyzable Memory Controller for Hard eal-time CMPs, Embedded Systems Letters, IEEE, vol. 1, no. 4, pp. 86 90, 2009. [2] B. Akesson, K. Goossens, and M. inghofer, Predator: a predictable SDAM memory controller, in CODES+ISSS, 2007, pp. 251 256. [3] S. Goossens, B. Akesson, and K. Goossens, Conservative Open- page Policy for Mixed Time-Criticality Memory Controllers, in DATE, 2013. [4] J. eineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee, Pret dram controller: Bank privatization for predictability and temporal isolation, in CODES+ISSS, 2011, pp. 99 108.