Computer Architecture Lecture 24: Memory Scheduling

Similar documents
15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 22: Memory Controllers. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015

Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture

Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions. Onur Mutlu June 5, 2011 ISMM/MSPC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Design of Digital Circuits Lecture 4: Mysteries in Comp Arch and Basics. Prof. Onur Mutlu ETH Zurich Spring March 2018

Staged Memory Scheduling

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

15-740/ Computer Architecture Lecture 1: Intro, Principles, Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems

CSE502: Computer Architecture CSE 502: Computer Architecture

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture: Memory Technology Innovations

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Intro to Computer Architecture, Spring 2012 Midterm Exam II

Intro to Computer Architecture, Spring 2012 Midterm Exam II. Name:

15-740/ Computer Architecture

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

An introduction to SDRAM and memory controllers. 5kk73

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

IN modern systems, the high latency of accessing largecapacity

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Improving DRAM Performance by Parallelizing Refreshes with Accesses

High Performance Memory Requests Scheduling Technique for Multicore Processors

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

CMU Introduction to Computer Architecture, Spring Midterm Exam 2. Date: Wed., 4/17. Legibility & Name (5 Points): Problem 1 (90 Points):

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

High Performance Memory Access Scheduling Using Compute-Phase Prediction and Writeback-Refresh Overlap

CS311 Lecture 21: SRAM/DRAM/FLASH

Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Thesis Defense Lavanya Subramanian

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Advanced cache optimizations. ECE 154B Dmitri Strukov

Memory Hierarchy Basics

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

CMU Introduction to Computer Architecture, Spring 2014 HW 5: Virtual Memory, Caching and Main Memory

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Exploiting Core Criticality for Enhanced GPU Performance

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture Topics. Announcements. Today: Uniprocessor Scheduling (Stallings, chapter ) Next: Advanced Scheduling (Stallings, chapter

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems

NAME: Problem Points Score. 7 (bonus) 15. Total

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

Prefetch-Aware DRAM Controllers

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Prefetch-Aware DRAM Controllers

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

Rethinking Memory System Design for Data-Intensive Computing. Onur Mutlu October 11-16, 2013

EITF20: Computer Architecture Part4.1.1: Cache - 2

Addressing the Memory Wall

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

15-740/ Computer Architecture

DRAM Main Memory. Dual Inline Memory Module (DIMM)

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

This is the published version of a paper presented at MCC14, Seventh Swedish Workshop on Multicore Computing, Lund, Nov , 2014.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Intro to Computer Architecture, Spring 2012 Final Exam. Name:

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Transcription:

18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014

Last Two Lectures Main Memory Organization and DRAM Operation Memory Controllers DRAM Design and Enhancements More Detailed DRAM Design: Subarrays RowClone and In-DRAM Computation Tiered-Latency DRAM Memory Access Scheduling FR-FCFS row-hit-first scheduling 2

Today Row Buffer Management Policies Memory Interference (and Techniques to Manage It) With a focus on Memory Request Scheduling 3

Review: DRAM Scheduling Policies (I) FCFS (first come first served) Oldest request first FR-FCFS (first ready, first come first served) 1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate maximize DRAM throughput Actually, scheduling is done at the command level Column commands (read/write) prioritized over row commands (activate/precharge) Within each group, older commands prioritized over younger ones 4

Review: DRAM Scheduling Policies (II) A scheduling policy is essentially a prioritization order Prioritization can be based on Request age Row buffer hit/miss status Request type (prefetch, read, write) Requestor type (load miss or store miss) Request criticality Oldest miss in the core? How many instructions in core are dependent on it? 5

Row Buffer Management Policies Open row Keep the row open after an access + Next access might need the same row row hit -- Next access might need a different row row conflict, wasted energy Closed row Close the row after an access (if no other requests already in the request buffer need the same row) + Next access might need a different row avoid a row conflict -- Next access might need the same row extra activate latency Adaptive policies Predict whether or not the next access to the bank will be to the same row 6

Open vs. Closed Row Policies Policy First access Next access Commands needed for next access Open row Row 0 Row 0 (row hit) Read Open row Row 0 Row 1 (row conflict) Closed row Row 0 Row 0 access in request buffer (row hit) Closed row Row 0 Row 0 access not in request buffer (row closed) Precharge + Activate Row 1 + Read Read Activate Row 0 + Read + Precharge Closed row Row 0 Row 1 (row closed) Activate Row 1 + Read + Precharge 7

Memory Interference and Scheduling in Multi-Core Systems

Review: A Modern DRAM Controller 9

Row decoder Review: DRAM Bank Operation Access Address: (Row 0, Column 0) (Row 0, Column 1) (Row 0, Column 85) (Row 1, Column 0) Columns Row address 01 Rows Empty Row 01 Row Buffer CONFLICT HIT! Column address 185 0 Column mux Data 10

Scheduling Policy for Single-Core Systems A row-conflict memory access takes significantly longer than a row-hit access Current controllers take advantage of the row buffer FR-FCFS (first ready, first come first served) scheduling policy 1. Row-hit first 2. Oldest first Goal 1: Maximize row buffer hit rate maximize DRAM throughput Goal 2: Prioritize older requests ensure forward progress Is this a good policy in a multi-core system? 11

Trend: Many Cores on Chip Simpler and lower power than a single large core Large scale parallelism on chip AMD Barcelona 4 cores Intel Core i7 8 cores IBM Cell BE 8+1 cores IBM POWER7 8 cores Sun Niagara II 8 cores Nvidia Fermi 448 cores Intel SCC 48 cores, networked Tilera TILE Gx 100 cores, networked 12

Many Cores on Chip What we want: N times the system performance with N times the cores What do we get today? 13

(Un)expected Slowdowns in Multi-Core High priority Low priority (Core 0) (Core 1) Moscibroda and Mutlu, Memory performance attacks: Denial of memory service in multi-core systems, USENIX Security 2007. 14

Uncontrolled Interference: An Example CORE stream1 CORE random2 Multi-Core Chip unfairness L2 CACHE L2 CACHE INTERCONNECT DRAM MEMORY CONTROLLER Shared DRAM Memory System DRAM DRAM DRAM DRAM Bank 0 Bank 1 Bank 2 Bank 3 15

A Memory Performance Hog // initialize large arrays A, B for (j=0; j<n; j++) { index = j*linesize; streaming A[index] = B[index]; } // initialize large arrays A, B for (j=0; j<n; j++) { index = rand(); A[index] = B[index]; } random STREAM - Sequential memory access - Very high row buffer locality (96% hit rate) - Memory intensive RANDOM - Random memory access - Very low row buffer locality (3% hit rate) - Similarly memory intensive Moscibroda and Mutlu, Memory Performance Attacks, USENIX Security 2007. 16

Row decoder What Does the Memory Hog Do? T0: Row 0 T0: T1: Row 05 T1: T0: Row 111 0 T1: T0: Row 16 0 Memory Request Buffer Row 0 Row Buffer T0: STREAM T1: RANDOM Column mux Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1 Data Moscibroda and Mutlu, Memory Performance Attacks, USENIX Security 2007. 17

Slowdown Effect of the Memory Performance Hog 3 2.5 2.82X slowdown 2 1.5 1.18X slowdown 1 0.5 0 STREAM Virtual gcc PC Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux) Moscibroda and Mutlu, Memory Performance Attacks, USENIX Security 2007. 18

Slowdown Problems due to Uncontrolled Interference Main memory is the only shared resource High priority Memory Low performance priority hog Cores make very slow progress Unfair slowdown of different threads Low system performance Vulnerability to denial of service Priority inversion: unable to enforce priorities/slas 19

Problems due to Uncontrolled Interference Unfair slowdown of different threads Low system performance Vulnerability to denial of service Priority inversion: unable to enforce priorities/slas Poor performance predictability (no performance isolation) Uncontrollable, unpredictable system 20

Inter-Thread Interference in Memory Memory controllers, pins, and memory banks are shared Pin bandwidth is not increasing as fast as number of cores Bandwidth per core reducing Different threads executing on different cores interfere with each other in the main memory system Threads delay each other by causing resource contention: Bank, bus, row-buffer conflicts reduced DRAM throughput Threads can also destroy each other s DRAM bank parallelism Otherwise parallel requests can become serialized 21

Effects of Inter-Thread Interference in DRAM Queueing/contention delays Bank conflict, bus conflict, channel conflict, Additional delays due to DRAM constraints Called protocol overhead Examples Row conflicts Read-to-write and write-to-read delays Loss of intra-thread parallelism A thread s concurrent requests are serviced serially instead of in parallel 22

Problem: QoS-Unaware Memory Control Existing DRAM controllers are unaware of inter-thread interference in DRAM system They simply aim to maximize DRAM throughput Thread-unaware and thread-unfair No intent to service each thread s requests in parallel FR-FCFS policy: 1) row-hit first, 2) oldest first Unfairly prioritizes threads with high row-buffer locality Unfairly prioritizes threads that are memory intensive (many outstanding memory accesses) 23

Solution: QoS-Aware Memory Request Scheduling Core Core Core Core Memory Controller Resolves memory contention by scheduling requests Memory How to schedule requests to provide High system performance High fairness to applications Configurability to system software Memory controller needs to be aware of threads 24

Stall-Time Fair Memory Scheduling Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors" 40th International Symposium on Microarchitecture (MICRO), pages 146-158, Chicago, IL, December 2007. Slides (ppt) STFM Micro 2007 Talk

The Problem: Unfairness Vulnerable to denial of service Unable to enforce priorities or service-level agreements Low system performance Uncontrollable, unpredictable system 26

How Do We Solve the Problem? Stall-time fair memory scheduling [Mutlu+ MICRO 07] Goal: Threads sharing main memory should experience similar slowdowns compared to when they are run alone fair scheduling Also improves overall system performance by ensuring cores make proportional progress Idea: Memory controller estimates each thread s slowdown due to interference and schedules requests in a way to balance the slowdowns Mutlu and Moscibroda, Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors, MICRO 2007. 27

Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it equalizes the slowdown of equal-priority threads relative to when each thread is run alone on the same system DRAM-related stall-time: The time a thread spends waiting for DRAM memory ST shared : DRAM-related stall-time when the thread runs with other threads ST alone : DRAM-related stall-time when the thread runs alone Memory-slowdown = ST shared /ST alone Relative increase in stall-time Stall-Time Fair Memory scheduler (STFM) aims to equalize Memory-slowdown for interfering threads, without sacrificing performance Considers inherent DRAM performance of each thread Aims to allow proportional progress of threads 28

STFM Scheduling Algorithm [MICRO 07] For each thread, the DRAM controller Tracks ST shared Estimates ST alone Each cycle, the DRAM controller Computes Slowdown = ST shared /ST alone for threads with legal requests Computes unfairness = MAX Slowdown / MIN Slowdown If unfairness < Use DRAM throughput oriented scheduling policy If unfairness Use fairness-oriented scheduling policy (1) requests from thread with MAX Slowdown first (2) row-hit first, (3) oldest-first 29

How Does STFM Prevent Unfairness? T0: Row 0 T1: Row 5 T0: Row 0 T1: Row 111 T0: Row 0 T0: T1: Row 016 T0 Slowdown 1.00 1.04 1.07 1.10 1.03 T1 Slowdown 1.00 1.03 1.08 1.14 1.06 1.11 Unfairness 1.00 1.06 1.04 1.03 1.05 Row 16 111 0 Data Row Buffer 30

STFM Pros and Cons Upsides: First algorithm for fair multi-core memory scheduling Provides a mechanism to estimate memory slowdown of a thread Good at providing fairness Being fair can improve performance Downsides: Does not handle all types of interference (Somewhat) complex to implement Slowdown estimations can be incorrect 31

Parallelism-Aware Batch Scheduling Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems 35th International Symposium on Computer Architecture (ISCA), pages 63-74, Beijing, China, June 2008. Slides (ppt) PAR-BS ISCA 2008 Talk

Another Problem due to Interference Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests Memory-Level Parallelism (MLP) Out-of-order execution, non-blocking caches, runahead execution Effective only if the DRAM controller actually services the multiple requests in parallel in DRAM banks Multiple threads share the DRAM controller DRAM controllers are not aware of a thread s MLP Can service each thread s outstanding requests serially, not in parallel 33

Bank Parallelism of a Thread Single Thread: 2 DRAM Requests Thread A : Compute Stall Bank 0 Bank 1 Compute Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread A: Bank 1, Row 1 Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency 34

Bank Parallelism Interference in DRAM Baseline Scheduler: 2 DRAM Requests Bank 0 Bank 1 A : Compute Stall Bank 0 2 DRAM Requests Bank 1 Stall Compute Thread A: Bank 0, Row 1 B: Compute Bank 1 Stall Bank 0 Stall Compute Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies 35

Parallelism-Aware Scheduler Baseline Scheduler: 2 DRAM Requests Bank 0 Bank 1 A : Compute A : Bank 0 2 DRAM Requests B: Compute Bank 1 Parallelism-aware Scheduler: 2 DRAM Requests Compute Bank 0 Bank 1 2 DRAM Requests B: Compute Stall Stall Stall Stall Bank 1 Bank 0 Bank 0 Bank 1 Stall Stall Compute Stall Compute Compute Saved Cycles Compute Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 Average stall-time: ~1.5 bank access latencies 36

Parallelism-Aware Batch Scheduling (PAR-BS) Principle 1: Parallelism-awareness Schedule requests from a thread (to different banks) back to back Preserves each thread s bank parallelism But, this can cause starvation T1 T2 T2 T3 T1 T0 T2 T2 Batch Principle 2: Request Batching T0 T3 Group a fixed number of oldest requests from each thread into a batch Service the batch before all other requests Form a new batch when the current one is done Eliminates starvation, provides fairness Allows parallelism-awareness within a batch T2 T1 T1 T0 Bank 0 Bank 1 Mutlu and Moscibroda, Parallelism-Aware Batch Scheduling, ISCA 2008. 37

PAR-BS Components Request batching Within-batch scheduling Parallelism aware 38

Request Batching Each memory request has a bit (marked) associated with it Batch formation: Mark up to Marking-Cap oldest requests per bank for each thread Marked requests constitute the batch Form a new batch when no marked requests are left Marked requests are prioritized over unmarked ones No reordering of requests across batches: no starvation, high fairness How to prioritize requests within a batch? 39

Within-Batch Scheduling Can use any existing DRAM scheduling policy FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality But, we also want to preserve intra-thread bank parallelism Service each thread s requests back to back Scheduler computes a ranking of threads when the batch is formed HOW? Higher-ranked threads are prioritized over lower-ranked ones Improves the likelihood that requests from a thread are serviced in parallel by different banks Different threads prioritized in the same order across ALL banks 40

How to Rank Threads within a Batch Ranking scheme affects system throughput and fairness Maximize system throughput Minimize average stall-time of threads within the batch Minimize unfairness (Equalize the slowdown of threads) Service threads with inherently low stall-time early in the batch Insight: delaying memory non-intensive threads results in high slowdown Shortest stall-time first (shortest job first) ranking Provides optimal system throughput [Smith, 1956]* Controller estimates each thread s stall-time within the batch Ranks threads with shorter stall-time higher * W.E. Smith, Various optimizers for single stage production, Naval Research Logistics Quarterly, 1956. 41

Shortest Stall-Time First Ranking Maximum number of marked requests to any bank (max-bank-load) Rank thread with lower max-bank-load higher (~ low stall-time) Total number of marked requests (total-load) Breaks ties: rank thread with lower total-load higher T3 T3 T3 T2 T3 T3 T1 T0 T2 T0 T2 T2 T1 T2 T3 T1 T0 T3 T1 T3 T2 T3 max-bank-load total-load T0 1 3 T1 2 4 T2 2 6 T3 5 9 Bank 0 Bank 1 Bank 2 Bank 3 Ranking: T0 > T1 > T2 > T3 42

Time Time Example Within-Batch Scheduling Order Baseline Scheduling Order (Arrival order) T3 T3 7 6 PAR-BS Scheduling Order T3 T3 7 6 T3 T2 T3 T3 5 T3 T3 T3 T3 5 T1 T0 T2 T0 4 T3 T2 T2 T3 4 T2 T2 T1 T2 3 T2 T2 T2 T3 3 T3 T1 T0 T3 2 T1 T1 T1 T2 2 T1 T3 T2 T3 1 T1 T0 T0 T0 1 Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3 Ranking: T0 > T1 > T2 > T3 T0 T1 T2 T3 T0 T1 T2 T3 Stall times 4 4 5 7 Stall times 1 2 4 7 AVG: 5 bank access latencies AVG: 3.5 bank access latencies 43

Putting It Together: PAR-BS Scheduling Policy PAR-BS Scheduling Policy (1) Marked requests first (2) Row-hit requests first (3) Higher-rank thread first (shortest stall-time first) (4) Oldest first Batching Parallelism-aware within-batch scheduling Three properties: Exploits row-buffer locality and intra-thread bank parallelism Work-conserving Services unmarked requests to banks without marked requests Marking-Cap is important Too small cap: destroys row-buffer locality Too large cap: penalizes memory non-intensive threads Mutlu and Moscibroda, Parallelism-Aware Batch Scheduling, ISCA 2008. 44

Hardware Cost <1.5KB storage cost for 8-core system with 128-entry memory request buffer No complex operations (e.g., divisions) Not on the critical path Scheduler makes a decision only every DRAM cycle 45

Unfairness (lower is better) Unfairness on 4-, 8-, 16-core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007] 5 4.5 4 3.5 FR-FCFS FCFS NFQ STFM PAR-BS 3 2.5 2 1.5 1 4-core 8-core 16-core 46

Normalized Hmean Speedup System Performance 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 FR-FCFS FCFS NFQ STFM PAR-BS 0.2 0.1 0 4-core 8-core 16-core 47

PAR-BS Pros and Cons Upsides: First scheduler to address bank parallelism destruction across multiple threads Simple mechanism (vs. STFM) Batching provides fairness Ranking enables parallelism awareness Downsides: Implementation in multiple controllers needs coordination for best performance too frequent coordination since batching is done frequently Does not always prioritize the latency-sensitive applications 48