Constructing a Weak Memory Model

Similar documents
A Memory Model for RISC-V

Weak Memory Models with Matching Axiomatic and Operational Definitions

Formal Specification of RISC-V Systems Instructions

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Sequential Consistency & TSO. Subtitle

Beyond Sequential Consistency: Relaxed Memory Models

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Portland State University ECE 588/688. Memory Consistency Models

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Kami: A Framework for (RISC-V) HW Verification

Relaxed Memory: The Specification Design Space

C11 Compiler Mappings: Exploration, Verification, and Counterexamples

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

GPU Concurrency: Weak Behaviours and Programming Assumptions

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Overview: Memory Consistency

Computer Architecture

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Relaxed Memory-Consistency Models

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

An introduction to weak memory consistency and the out-of-thin-air problem

CS533 Concepts of Operating Systems. Jonathan Walpole

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Shared Memory Consistency Models: A Tutorial

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Coherence and Consistency

RELAXED CONSISTENCY 1

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

CSE502: Computer Architecture CSE 502: Computer Architecture

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Memory Consistency Models

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Precise Exceptions and Out-of-Order Execution. Samira Khan

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

RTLCheck: Verifying the Memory Consistency of RTL Designs

Consistency Models. ECE/CS 757: Advanced Computer Architecture II. Readings. ECE 752: Advanced Computer Architecture I 1. Instructor:Mikko H Lipasti

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1

1. Memory technology & Hierarchy

Relaxed Memory-Consistency Models

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

Distributed Shared Memory and Memory Consistency Models

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Hardware-based Speculation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ECE/CS 757: Advanced Computer Architecture II

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Portland State University ECE 588/688. Cray-1 and Cray T3E

Chapter. Out of order Execution

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

COMPUTER ORGANIZATION AND DESI

Bus-Based Coherent Multiprocessors

ECE 505 Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13

The Processor: Instruction-Level Parallelism

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Foundations of the C++ Concurrency Memory Model

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Relaxed Memory Consistency

Advanced issues in pipelining

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Handout 3 Multiprocessor and thread level parallelism

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Using Relaxed Consistency Models

EE 660: Computer Architecture Superscalar Techniques

Shared Memory Consistency Models: A Tutorial

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Shared Memory Consistency Models: A Tutorial

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture: Consistency Models, TM

Hardware-Based Speculation

Commit-Reconcile & Fences (CRF): A New Memory Model for Architects and Compiler Writers

Complex Pipelines and Branch Prediction

Memory Consistency Models. CSE 451 James Bornholt

Taming release-acquire consistency

Distributed Operating Systems Memory Consistency

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:

Portland State University ECE 587/687. Memory Ordering

Transcription:

Constructing a Weak Memory Model Sizhuo Zhang 1, Muralidaran Vijayaraghavan 1, Andrew Wright 1, Mehdi Alipour 2, Arvind 1 1 MIT CSAIL 2 Uppsala University ISCA 2018, Los Angeles, CA 06/04/2018 1

The Specter of Weak Memory Models POWER Alpha Architects find SC constraining, and have unleashed the specter of weak memory models to the world ARM Programmers have never asked for weak memory models RMO Two major questions on weak memory models Do weak memory models improve PPA (performance/power/area) over strong models? Highly controversial Is there a common semantic base for weak memory models? This paper answers this question 2

Importance of a Common Base for Weak Memory Models Even experts cannot agree on the precise definitions of different weak models, or the differences between them Example: Researchers have been formalizing the definitions of commercial weak memory models empirically (i.e., come up with models that match the observed behaiors on commercial machines) With more experiments being performed on the commercial machines, either unexpected behaviors show up, or expected behaviors never arise: Keep revising the model POWER: [PLDI 2011] -> [TOPLAS 2014] ARM: [POPL 2016] -> [POPL 2018] A common base model helps us understand the nature of weak models without being drowned in the details of commercial machines or ISAs [PLDI 2011] Understanding POWER Multiprocessors [TOPLAS 2014] Herding cats: Modelling, Simulation, Testing, and Data-mining for Weak Memory [POPL 2016] Modelling the ARMv8 Architecture, Operationally: Concurrency and ISA [POPL 2018] Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8 3

Executive Summary Our approach to construct the base memory model, GAM (General Atomic Memory Model) Step 1: Minimum ordering constraints of a uniprocessor Step 2: Additional constraints when processors are connected by an atomic shared memory system* *An atomic memory system can be abstracted to monolithic memory which executes loads and stores instantaneously and forms a natural total order SC, TSO, PSO, RMO, Alpha, and ARMv8 use atomic memory while ARMv7 and POWER do not Step 3: Optional or der ing constr aints to match pr ogr ammer s expectations while ensur ing equivalent axiomatic and operational definitions Performance evaluation GAM has similar performance to other weak models Some optional ordering constraints have little impact on performance 4

Meaning of Reorderings in Uniprocessor Reorder Buffer (ROB) Fetch Load Buffer (LB) Store Buffer (SB) Load req/resp Store req/resp Write-back cache hierarchy L1 cache Memory execution order differs from commit order Commit order: order of instructions being committed from ROB Execution order: order of times when instructions finish execution A non-memory instructions finishes execution by doing its computation Load finishes execution by reading its value from L1 or SB Store finishes execution by writing data into L1 If the store has forwarded the data to a younger load, then the load appears before the store in the execution order 5

Uniprocessor Ordering Constraints Ordering constraints are derived from two aspects: To maintain single-thread correctness, some memory instructions for the same address must be ordered The execution of some instructions cannot start until older instructions have finished execution A speculative stores cannot be sent to the memory system No instruction can start execution until its source operands are ready (rules out value speculation) 6

Uniprocessor Constraints for Same- Address Memory Instructions I1: r1 = Ld [a] I2: St [a] 1 I1: r1 = St [a] 1 I2: St [a] 2 GAM Constraint: A younger store cannot be reordered with an older memory instruction to the same address X I1: r1 = Ld [a] I2: r2 = Ld [a] No ordering constraints for two loads for the same address X I1: r1 = r2 + 1 I2: St [a] r1 I3: r1 = Ld [a] forward GAM Constraint: A younger load cannot be reordered with the instruction that produces the address or data of the immediately preceding store for the same address 7

Uniprocessor Constraints for Starting Execution I1: r1 = Ld [a] I2: r2 = Ld [r1] GAM Constraint: Cannot reorder instructions with register read-after-write dependencies I1: if(r1==0) goto somewhere I2: St [a] 1 GAM Constraint: A younger store cannot be reordered with any older branch I1: r1 = 1+99 I2: r2 = Ld [a+r1] I3: St [a] 1 GAM Constraint: A younger store cannot be reordered with an instruction that produces the address of any older memory instruction 8

Additional Constraints on Load Values in Multiprocessor Exe order Proc.1 Proc. n Exe order Write-back cache hierarchy L1 Memory L1 Atomic memory system can be abstracted to a monolithic memory which gives a total (Atomic) Memory Order on all instructions that access memory A load can get its value either from memory or by bypassing from SB Reading from Memory: value is determined by atomic memory order Atomic memory order must be consistent with the execution order of each processor Bypassing from SB: The forwarding store is the immediately preceding store for the same address in commit order The load always appears before the forwarding store in the execution order Execution order must also incorporate constraints imposed by fences 9

Putting it Together: Global Memory Order Exe order Proc.1 Proc. n Exe order Write-back cache hierarchy L1 Memory L1 Define Global memory order as the order of execution finish times of all memory instructions This is the join of all execution orders and the atomic memory order of all memory instructions One load-value constraint to cover both cases (similar to RMO and Alpha) 10

Pr ogr ammer s Intuition: Same Address Loads Proc. 1 Proc. 2 I1: St [a] 1 I2: r1 = Ld [a] (=1) I3: r2 = Ld [a] (=0) Programmers do not expect the second load to return a stale value Stricter constraint Three choices for the reordering of two consecutive loads for the same address major difference among RMO, ARM, and GAM No or der ing constr aint, e.g., RMO: does not match pr ogr ammer s intuition Only allow reordering when two loads read from the same store, e.g., ARM: Matches pr ogr ammer s intuition in the shown example However, can lead to confusing behaviors in other cases (see paper for the example) Cannot be reordered, e.g., GAM: avoid all confusing behaviors 11

Performance Overheads in Ordering Same-Address Loads GAM Stall load issue (to start execution) if the load cannot forward from a store younger than an unissued older load When a load is issued, kill younger loads which have been issued to memory or have got data forwarded from a store older than the issuing load ARM When a load finishes loading memory, kill younger loads whose values have been overwritten by other processors Optionally stall load issue as in GAM Single-thread application is enough if the goal is to show that the performance overheads of GAM is negligible Kills and stalls in GAM are not caused by stores from other processors Best case for ARM: no overwrite by other processor, so no kill 12

Evaluation Results of Same-Address Load-Load Ordering Modelling RMO, ARM and GAM using GEM5 Haswell-like OOO core All SPECCPU ref inputs: take 10 checkpoints for each input, simulate 100M instructions from each checkpoint Performance improvement over GAM Average Max ARM 0.2% 2.3% RMO 0.2% 2.3% Stalls/Kills by same-address load-load ordering (number of events per 1000 micro-ops) Average Max Kills in GAM 0.2 3.24 Issue stalls in GAM 0.19 2.15 Issue stalls in ARM 0.19 2.15 Performance impact caused by same-address load-load ordering is negligible 13

Pr ogr ammer s Intuition: Data- Dependent Loads Proc. 1 Proc. 2 I1: St [a] 1 I2: FenceSS I3: St [b] a I3: r1 = Ld [b] (=a) I4: r2 = Ld [r1] (=0) All memory models except Alpha forbids this behavior Programmers expect data-dependent loads to be ordered GAM has already enforced this ordering Enforcing this ordering may bring restrictions on implementations Value prediction Load-to-load forwarding Coherence protocols that delay cache invalidations (e.g. VIPS, Tardis) Comprehensive performance evaluation is hard Implementation restrictions vs. extra fences 14

Equivalent Axiomatic and Operational Models of GAM Axiomatic model Constraints axioms Operational model Models speculative execution Memory accesses are instantaneous Proc 1. ROB Proc n. ROB Monolithic Memory 15

Conclusion Constructed the common base memory model, GAM RMO, ARM and RISC-V differ in same-address load-load ordering Alpha differs in data-dependency ordering Constraint on same-address load-load ordering is a major difference between these memory models, but it has little impact on performance Data-dependency ordering is a feature expected by programmers, but it may constrain implementations; its performance implication needs further evaluation 16

Questions? 17