Memory Access Scheduling

Similar documents
2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Memory Access Scheduling

Memory Access Scheduler

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

IMAGINE: Signal and Image Processing Using Streams

Latency Symbol DRAM cycles Latency Symbol DRAM cycles Latency Symbol DRAM cycles Read / Write tbl 4 Additive Latency AL 0 Activate to Activate trc 39

The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor

Understanding and Evaluating the Performance of DRAM Memory Controller Policies under Various Algorithms Using DRAMsim

Lecture: Memory Technology Innovations

Reducing main memory access latency through SDRAM address mapping techniques and access reordering mechanisms

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

High Performance Memory Access Scheduling Using Compute-Phase Prediction and Writeback-Refresh Overlap

Accelerated Motion Estimation of H.264 on Imagine Stream Processor

An introduction to SDRAM and memory controllers. 5kk73

Energy Optimizations for FPGA-based 2-D FFT Architecture

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong

ADAPTIVE HISTORY-BASED MEMORY SCHEDULERS FOR MODERN PROCESSORS

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Optimizing Memory Performance for FPGA Implementation of PageRank

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

LECTURE 5: MEMORY HIERARCHY DESIGN

ECE 5730 Memory Systems

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Copyright 2012, Elsevier Inc. All rights reserved.

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Advanced cache optimizations. ECE 154B Dmitri Strukov

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Copyright 2012, Elsevier Inc. All rights reserved.

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System

CS 152 Computer Architecture and Engineering

A Power and Temperature Aware DRAM Architecture

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Stride- and Global History-based DRAM Page Management

Data Parallel Architectures

Design and Verification of High Speed SDRAM Controller with Adaptive Bank Management and Command Pipeline

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs

Copyright 2012, Elsevier Inc. All rights reserved.

NVIDIA nforce IGP TwinBank Memory Architecture

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture Lecture 24: Memory Scheduling

Utilizing RF-I and Intelligent Scheduling for Better Throughput/Watt in a Mobile GPU Memory System

IMAGINE: MEDIA PROCESSING

A Bandwidth-efficient Architecture for a Streaming Media Processor

Lecture 18: DRAM Technologies

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

Computer Systems Laboratory Sungkyunkwan University

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

An Approach for Adaptive DRAM Temperature and Power Management

Power Aware External Bus Arbitration for System-on-a-Chip Embedded Systems

High Performance Memory Requests Scheduling Technique for Multicore Processors

COSC 6385 Computer Architecture - Memory Hierarchies (III)

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

Exploring GPU Architecture for N2P Image Processing Algorithms

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

A Comprehensive Analytical Performance Model of DRAM Caches

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

The Memory Hierarchy & Cache

Intro to Computer Architecture, Spring 2012 Midterm Exam II. Name:

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

ABSTRACT. This dissertation investigates prefetching scheme for servers with respect to realistic

Master Informatics Eng.

Mainstream Computer System Components

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

CENG3420 Lecture 08: Memory Organization

Computer Architecture

The Design Space of Data-Parallel Memory Systems

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Staged Memory Scheduling

Adapted from David Patterson s slides on graduate computer architecture

Structure of Computer Systems. advantage of low latency, read and write operations with auto-precharge are recommended.

CSEE W4824 Computer Architecture Fall 2012

DRAM Main Memory. Dual Inline Memory Module (DIMM)

International IEEE Symposium on Field-Programmable Custom Computing Machines

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

A Fast Synchronous Pipelined DRAM Architecture with SRAM Buffers

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory technology and optimizations ( 2.3) Main Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer System Components

Datasheet. Zetta 4Gbit DDR3L SDRAM. Features VDD=VDDQ=1.35V / V. Fully differential clock inputs (CK, CK ) operation

Memories: Memory Technology

Performance Evolution of DDR3 SDRAM Controller for Communication Networks

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

High Performance AXI Protocol Based Improved DDR3 Memory Controller With Improved Memory Bandwidth

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Transcription:

Memory Access Scheduling ECE 5900 Computer Engineering Seminar Ying Xu Mar 4, 2005 Instructor: Dr. Chigan 1 ECE 5900 spring 05 1

Outline Introduction Modern DRAM architecture Memory access scheduling Structure of access scheduler Scheduling policies Experimental results First-ready scheduling Aggressive reordering Conclusions 2 ECE 5900 spring 05 2

Introduction Bandwidth of memory chip increases dramatically DDR2, SDRAM Media processors Streaming memory reference patterns Memory bandwidth bottleneck 3 ECE 5900 spring 05 3

Intro (contd) Pipelining memory accesses Maximize the memory bandwidth Sequential accesses to the different row of the same bank can t be pipelined Memory access scheduling Reorder memory operations Bank precharge, row activation, column access Memory references completed out of order 4 ECE 5900 spring 05 4

Intro(contd) 5 ECE 5900 spring 05 5

Characteristics of DRAM architecture DRAMs are not truly random access devices 3 dimensional memories Bank Row Column 3 operations Bank precharge Row activation Column access 6 ECE 5900 spring 05 6

DRAM organization 7 ECE 5900 spring 05 7

Resource constraints of DRAMS Dram resources Internal banks A single set of address lines A single set of data lines Different operation has different demand 8 ECE 5900 spring 05 8

Bank state 9 ECE 5900 spring 05 9

Memory access scheduling Process of ordering DRAM operations Subject to resource constraints Simplest: oldest pending references first Inefficient DRAM Not ready for the oldest references Leave the available resource idle Need more complicated scheduling algorithm 10 ECE 5900 spring 05 10

Memory access scheduler structure 11 ECE 5900 spring 05 11

Memory access scheduling policies 12 ECE 5900 spring 05 12

Memory access scheduling algorithm Combination of policies used by precharge manager, row arbiter, column arbiter, address arbiter Address arbiter decides which selected precharge, row, column operation to perform Choices: in-order, priority, precharge operation first, row operation first, column operation first 13 ECE 5900 spring 05 13

Experimental setup Streaming media processors are preferred Streams lack temporal locality Stream transfer bandwidth drives the processor performance The image stream processor is simulated frequency 500MHZ Dram frequency 125MHZ Peak system bandwidth 2GB/s 14 ECE 5900 spring 05 14

Experimental setup(contd) Benchmarks and media processing applications 15 ECE 5900 spring 05 15

In order scheduling In-order access scheduler No access reordering A column is only performed for the oldest pending reference; same as bank precharge and row activation Baseline 16 ECE 5900 spring 05 16

First-ready ready scheduling Uses the ordered priority scheme for all units Subjects to resource and timing constraints Schedule an operation for the oldest pending references Benefits: Accesses targeting other banks can be performed while waiting for a precharge or row activation parallelism: multiple references in progress 17 ECE 5900 spring 05 17

Experimental results Sustained memory bandwidth increased about 79% 18 ECE 5900 spring 05 18

Experimental results Sustained bandwidth increased about 17% 19 ECE 5900 spring 05 19

Experimental results Sustained memory bandwidth increased about 79% 20 ECE 5900 spring 05 20

Aggressive reordering Drawback of first-ready scheduling Precharges a bank when the oldest pending reference targets a different row than the active row in a bank, there are still multiple pending references to the active row Aggressive reordering to further increase sustained memory bandwidth 21 ECE 5900 spring 05 21

Possible reordering scheduling algorithm polices Large range of possible memory access scheduler Four representative 22 ECE 5900 spring 05 22

Experimental results Improve bandwidth by 106-144% 23 ECE 5900 spring 05 23

Experimental results Improve bandwidth by 27-30% 24 ECE 5900 spring 05 24

Experimental results Improve bandwidth 85-93% 25 ECE 5900 spring 05 25

Row-first policy VS column first policy Address arbiter Row-first: always select row operation first Column-first: always select column operation first Little difference across all benchmarks Exception: FFT Less to do with the scheduling algorithm than the characteristic of benchmark itself FFT most sensitive to stream load latency Col/op policy allows a store stream to delay load streams 26 ECE 5900 spring 05 26

Open or closed precharge policy? Closed precharge policy banks are precharged as soon as no pending references to the active row Open precharge policy No pending references to the active row, pending references to other rows of the same bank Difference between open and closed precharge policy is slight Benchmarks with random access pattern prefer closed precharge policy Little reference locality No benefit to keep row open FFT prefers op precharge policy Numerous accesses to each row 27 ECE 5900 spring 05 27

Effect of bank buffer size Row/closed scheduling algorithm 28 ECE 5900 spring 05 28

Conclusions Memory access scheduling greatly increases the bandwidth utilization Buffering memory references Access internal banks in parallel Maximize the number of column accesses per row access First ready scheduling algorithm 79% bandwidth improvement on microbenchmarks, 40% on application traces Aggressive reordering algorithm 144% bandwidth improvement on benchmarks, 30% on media processing applications, 93% on the application traces 29 ECE 5900 spring 05 29

Conclusions Closed precharge policy preferred by most benchmarks Little difference in performance between rowfirst or column first policies. For latency sensitive applications, scheduling loads ahead of stores preferred. Banks are precharged as soon as the last column reference to an active row is completed 30 ECE 5900 spring 05 30

Paper reference Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, John D. Owens, Memory access scheduling, ACM SIGARCH Computer Architecture News, Proceedings of the 27th annual international symposium on Computer architecture, Volume 28 Issue 2, May 2000 31 ECE 5900 spring 05 31

Thank you! 32 ECE 5900 spring 05 32