TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

Similar documents
NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Memory Consistency Models

Distributed Operating Systems Memory Consistency

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Testing Implementations of Transactional Memory

Overview: Memory Consistency

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Portland State University ECE 588/688. Memory Consistency Models

Multiprocessor Synchronization

Shared Memory Consistency Models: A Tutorial

Lamport Clocks: Verifying A Directory Cache-Coherence Protocol. Computer Sciences Department

Fast Complete Memory Consistency Verification

Sequential Consistency & TSO. Subtitle

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Beyond Sequential Consistency: Relaxed Memory Models

Relaxed Memory-Consistency Models

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

A Memory Model for RISC-V

Advanced Operating Systems (CS 202)

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Using Relaxed Consistency Models

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CSE502: Computer Architecture CSE 502: Computer Architecture

ECE/CS 757: Advanced Computer Architecture II

Multiprocessors and Locking

CS533 Concepts of Operating Systems. Jonathan Walpole

EECS 570 Final Exam - SOLUTIONS Winter 2015

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Computer Science 146. Computer Architecture

Computer Architecture Crash course

Multiprocessor Memory Model Verification

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Post-Silicon Validation of Multiprocessor Memory Consistency

GPU Concurrency: Weak Behaviours and Programming Assumptions

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Computer Architecture Spring 2016

Relaxed Memory Consistency

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

1. Memory technology & Hierarchy

Portland State University ECE 588/688. Cray-1 and Cray T3E

Shared Memory Architecture

Memory Consistency and Multiprocessor Performance

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Constructing a Weak Memory Model

Topic C Memory Models

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Lock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Administrivia. p. 1/20

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Computer Architecture!

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Multiprocessors Should Support Simple Memory Consistency Models

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Consistency Models. ECE/CS 757: Advanced Computer Architecture II. Readings. ECE 752: Advanced Computer Architecture I 1. Instructor:Mikko H Lipasti

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

OPENSPARC T1 OVERVIEW

EECS578 Computer-Aided Design Verification of Digital Systems

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Shared Memory Consistency Models: A Tutorial

Verification of Cache Coherency Formal Test Generation

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Shared Memory Consistency Models: A Tutorial

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 12. CPU Structure and Function. Yonsei University

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Portland State University ECE 587/687. Memory Ordering

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lamport Clocks: Reasoning About Shared Memory Correctness 1

RELAXED CONSISTENCY 1

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Hardware Support for NVM Programming

Computer Architecture

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

The Cache-Coherence Problem

EE382 Processor Design. Processor Issues for MP

RAMP-White / FAST-MP

Transcription:

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model Sudheendra Hangal, Durgam Vahia, Chaiyasit Manovit, Joseph Lu and Sridhar Narayanan tsotool@sun.com ISCA-2004 Sun Microsystems India Pvt. Ltd., Bangalore (India) Sun Microsystems, Sunnyvale, CA (USA) P1

Goal: Find the hard memory-related bugs in Sun's multiprocessor systems P2

Motivation Many bugs in MP memory subsystem Hard to find, hard to debug Require silicon revs; impact time-to-market; cost $$ Almost all Sun systems are MP (on-a-chip) Throughput computing: SMT, CMP, CMT Now it gets interesting P3

Why is MP Verification Hard? Many elements in memory hierarchy Asynchronous load-store units, L1,L2,L3 caches, prefetchers, bus protocols, system interconnects, memory controllers, DRAM technologies, etc. Many optimizations For performance & scalability Many derivative processors & systems Each makes changes to the memory system P4

Prior Methodology Use a reference model to check design Can't fully check pseudo-random MP programs with data races; only for obvious problems Creating 100% cycle-accurate reference is hard False Sharing Within a $-line, CPUs write non-overlapping bytes Specific MP idioms Litmus tests, MP code for mutexes, locks, etc. P5

TSOtool Methodology Create a short, pseudo-random program with intense sharing Hopefully, hit corner cases faster Analyze correctness of observed architectural results w.r.t. memory model Microarchitecture agnostic No dependence on simulation observations Faster simulation (big win for h/w accelerators) Use observability if available P6

Memory Consistency Model Contract between programmer system A formal specification of how memory appears to behave to a programmer. Examples: SC, PC, TSO, PSO, RMO, RC, etc. Challenge to correctly implement for architects, designers, system programmers,... All Sun systems support TSO Ref: Adve and Gharachorloo's tutorial http://citeseer.nj.nec.com/adve95shared.html P7

TSO Specification Loosely speaking: Instructions appear to execute in program order EXCEPT S->L order relaxed; loads may overtake prior stores on same CPU Store is visible on same processor before others (Allows store buffer bypass) P8

Notation 2 orders: ';' (program order) & ' ' (global order) 4 operations: L for loads, S for stores [L;S] for swaps L X i S Y j [ L Z k ; S Z k ] M Load to address X on processor i Store to address Y on processor j Atomic swap to address Z on processor k Memory barrier P9

TSO Formal Specification LoadOp StoreStore Order Termination Atomicity Value i L X i S X S X i S X i i i i ;Op Y L X Op Y i i ; S Y S X S Y i S j Y S j i Y S X j L X ; L j j i X L X ; suchthat S X i i i [ L X ; S X ] L X S X i L X j S j Y : S j i i Y L X S X S j Y i Val[ L X ]=Val[ MAX { S k X S k i i i i X L X S X S X ; L X }] Membar Op 1 ; M ;Op 2 Op 1 Op 2 Ref: Formal specifications of Memory models, P. Sindhu et al (Xerox PARC) P10

TSOtool Usage Other Generators TSOtool Generator Bias file Block-level testbenches SPARC Program Bus Agents MP System or MP Simulation (Software/Accelerated) Program results TSOtool Analyzer User provided TSOtool program or generated OK Reason Not OK P11

TSOtool Test Generation Real-world instruction set (SPARC) All operations related to memory LD,DWLD,QWLD,BLD,AQLD,PREFETCH ST,DWST,QWST,BST,BSTC SWAP,CAS,CASX, MEMBAR, FLUSH Branches, ASI, Replacements, Interrupts, Macros,... All stores write a unique value read-mapping : Map(L) = S iff Val[L] = Val[S] P12

TSOtool Analysis Is result compatible with TSO axioms? Represent loads/stores as nodes in a graph Add edges to create constraints on If graph has a cycle, is not a valid order If graph has no cycle, is an order compatible with program results and axioms No false failures But not guaranteed to catch a true failure P13

Analysis Algorithm Add Edges L ; Op L Op, S ; S' S S' (LdOp and StoreStore Axioms) S ; M ; L S L (Membar Axiom) Op S [L;S] Op L (Atomicity Axiom) L Op [L;S] S Op (Atomicity Axiom) all Ops to the same address: Val[L] = Val[S] S;L S L (Value Axiom) Val[L] = Val[S] S';L S' S (Value Axiom) Val[L] = Val[S] S' L S' S (Value Axiom) Val[L] = Val[S] S S' L S' (Value Axiom) Last 2 steps used on L.H.S. (but we're deriving ) So, iterate over these steps till fixed point P14

Analysis Example (Cycle) P0 P1 P2 ST [Y] = 2 ST [Y] = 3 ST [X] = 1 LD [X] = 1 LD [Y] = 3 LD [Y] = 3 LD [Y] = 2 P15

Analysis Example (Cycle) P0 P1 P2 ST [Y] = 2 ST [Y] = 3 ST [X] = 1 LD [X] = 1 LD [Y] = 3 LD [Y] = 3 LD [Y] = 2 P16

Analysis Example (Cycle) P0 P1 P2 ST [Y] = 2 ST [Y] = 3 ST [X] = 1 LD [X] = 1 LD [Y] = 3 LD [Y] = 3 LD [Y] = 2 P17

Analysis Example (Cycle) P0 P1 P2 ST [Y] = 2 ST [Y] = 3 ST [X] = 1 LD [X] = 1 LD [Y] = 3 LD [Y] = 3 LD [Y] = 2 P18

Bugs Found Run on all recent Sun CPUs and systems Many problems caught early (Data corruption, atomicity violations, invalid instruction reordering) Lost tag write to Write cache Lock for atomic instruction released too early Prefetch cache missed an invalidate Ordering between cacheable and non-cacheable queues DRAM controller corrupted speculative load request Software emulation routines P19

Analysis Complexity Complexity results for Verifying SC [GK94] n is # memory operations NP-Complete for unlimited # of processors (even if read-mapping is known) n^k for k processors TSO version also NP-Complete TSOtool analysis is polynomial time See paper for details P20

A Missed Relation Write Order axiom may not be satisfied ST [Y] = 3 ST [Y] = 4 ST [X] = 1 ST [X] = 2 LD [Y] = 3 LD [Y] = 4 P21

Summary TSOtool finds hard bugs in real designs Incomplete algorithm is useful Allows checking of pseudo-random MP with races Exposes failures with shorter tests Microarchitecture independent Useful pre-si, post-si P22

Related Work Complexity Proofs [Gibbons, Korach] Complexity of verifying SC (VSC) and flavors [Cantin] Verifying Memory Coherence (VMC) and flavors Testing Approaches Industrial Design Groups: Sun, Intel, IBM, HP, etc. [Collier] Litmus tests (ARCHTEST) Protocol Verification [Sorin et al] Lamport clocks for protocol verification Formal Methods [Nalumasu et al] Test model checking [Park, Dill] Executable specification Constraint Graphs [Landin et al] Access Graphs [Cain et al] Constraint graph analysis P23

Questions? tsotool@sun.com P24

Backup slides P25

TSOtool Test Generation (2) Real-world environments MP systems (with OS), simulation models (full-chip, block-level, high-level/rtl/gate, accelerated) Every load value is saved In program (some perturbation) Directly from simulation if possible Users control bias Typically few shared addresses (1-1000) Typically few instructions (100-100K) P26