Productive Design of Extensible Cache Coherence Protocols!

Similar documents
Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture: Consistency Models, TM

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Synchronization. CSCI 5103 Operating Systems. Semaphore Usage. Bounded-Buffer Problem With Semaphores. Monitors Other Approaches

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Computer Architecture

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

A Data-flow Approach to Solving the von Neumann Bottlenecks

Verilog Tutorial. Verilog Fundamentals. Originally designers used manual translation + bread boards for verification

Improving the Practicality of Transactional Memory

Verilog Tutorial 9/28/2015. Verilog Fundamentals. Originally designers used manual translation + bread boards for verification

Transactional Memory

Spring Prof. Hyesoon Kim

The Bulk Multicore Architecture for Programmability

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Kami: A Framework for (RISC-V) HW Verification

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Diplomatic Design Patterns

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Foundations of the C++ Concurrency Memory Model

6 Transactional Memory. Robert Mullins

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

WaveScalar. Winter 2006 CSE WaveScalar 1

Lecture 06: Rule Scheduling

Michael Adler 2017/05

Flexible Architecture Research Machine (FARM)

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Verification of Cache Coherency Formal Test Generation

1. Memory technology & Hierarchy

Potential violations of Serializability: Example 1

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Mohamed M. Saad & Binoy Ravindran

Reminder from last time

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture: Transactional Memory. Topics: TM implementations

Multiprocessors & Thread Level Parallelism

Chapter 5 Concurrency: Mutual Exclusion. and. Synchronization. Operating Systems: Internals. and. Design Principles

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

wait with priority An enhanced version of the wait operation accepts an optional priority argument:

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Handout 3 Multiprocessor and thread level parallelism

Program logics for relaxed consistency

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

Scalable Cache Coherence

Cache Coherence and Atomic Operations in Hardware

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Chisel: Constructing Hardware In a Scala Embedded Language

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

A Basic Snooping-Based Multi-processor

CS 152 Laboratory Exercise 5 (Version C)

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

The Java Memory Model

Multiprocessors and Locking

Sistemas Operacionais I. Valeria Menezes Bastos

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Programming Kotlin. Familiarize yourself with all of Kotlin s features with this in-depth guide. Stephen Samuel Stefan Bocutiu BIRMINGHAM - MUMBAI

EC 513 Computer Architecture

Chí Cao Minh 28 May 2008

Monitors; Software Transactional Memory

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Cache Coherence in Scalable Machines

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Bluespec-4: Rule Scheduling and Synthesis. Synthesis: From State & Rules into Synchronous FSMs

Monitors; Software Transactional Memory

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Manchester University Transactions for Scala

Atomic Coherence in Gem5 with a Neural Network Application

Limitations of parallel processing

Software Architecture

Comparing Memory Systems for Chip Multiprocessors

IT 540 Operating Systems ECE519 Advanced Operating Systems

Chapter 5 Concurrency: Mutual Exclusion and Synchronization

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Transcription:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y Productive Design of Extensible Cache Coherence Protocols!! Henry Cook, Jonathan Bachrach,! Krste Asanovic! Par Lab Summer Retreat, Santa Cruz June 1, 2011 1

Background Cache coherence is important! Not just functionality, but performance and energy! Major implications for programming models! Cache coherence is difficult! To implement and verify a single protocol! To explore design space of multiple protocols! Hardware design in general is not productive! Often lacking modularity, extensibility, composability! 2

Motivating Hypotheses Hypothesis: We can write protocols using succinct, declarative descriptions, and generate effective hardware implementations! Produce verified implementations from verified specifications! Experiment with more designs! Hypothesis: Customization of protocol behavior is important for energy efficiency! On a per-motif or per-specializer basis! Heterogeneity in memory hierarchy! 3

Chisel Constructing Hardware In a Scala Embedded Language! Embed a hardware-description language in Scala, using Scalaʼs extension facilities! A hardware module is just a data structure in Scala! Different backends can generate different types of output (C, Verilog) from same Chisel representation! Full power of Scala for writing hardware generators! Object-Oriented: Factory objects, traits, overloading etc! Functional: Higher-order funcs, anonymous funcs, currying! Compiles to JVM: Good performance, Java interoperability! 4

Chisel Vision Apply the best of SW design practices to HW design! Write reusable modules! Capture design patterns as generators! Declarative design and search! Write it the way you do on the whiteboard! 5

Coherence on the Whiteboard Three views! First view: coherence protocol as abstract state machine! Node types! States! Invariants! 6

Coherence on the Whiteboard Second view: coherence protocol as set of sequences of request/reply messages! Set of all sequences! Order of messages in each sequence! Messages, payloads! 7

Coherence on the Whiteboard Third view: coherence protocol as rule tables! Given state and input, emit messages and update state! Set of rules for each node type! 8

Complexity Concerns We can verify this protocolʼs rules, but are there additional sources of complexity?! Turning message sequences into transactions! Making multi-step, intra-node behavior atomic! 9

Are we still on the whiteboard? Complexity at the inter-node message sequence level! Interleave messages re: particular block! Add transient/busy states to protocol! Handle races! Provide write serializability and atomicity! Avoid deadlock, livelock, starvation! Address externalities: type of network used, amount of message buffering available! 10

Are we still on the whiteboard? Complexity at the intra-node atomicity level! Arbitrating for finite number of SRAM ports! Dequeuing and buffering requests! Enqueuing requests and responses! Filling and draining MSHRs! Multi-cycle ops with potentially conflicting updates lead to additional transient states! Any modification to deal with the above (or area/timing constraints) could render original verification work useless! 11

Current Focus Address intra-node complexity using BlueChisel, a declarative, embedded DSL built on top of Chisel! Goal: Generate not only control logic for protocol-defined activity but also:! Arbitration logic for access to SRAM ports! Skid buffers and queuing logic! Logic implementing intermediate/transient protocol states! 12

Inspiration: Bluespec High-level, functional HDL compiled to a term rewriting system and translated into HW! Natural way to describe many HW devices! Understandable, well-defined semantics! Conditional atomic execution of state updates, based on rules! Guarded atomic actions! Scheduler dynamically tries to fire as many as possible! Limitations:! In general, guarded atomic actions are a productive abstraction in some cases, but not in others! Rules can only express actions that take single cycle! 13

BlueChisel Core functionality: conditional evaluation of rules leads to atomic state updates!! For cache coherence, automatically generate:! Extra transient/implicit protocol states! Additional rules to govern multi-cycle operations! Fairness and rule priority with urgency annotations! Extension built on top of Chisel, which we choose to apply where appropriate! 14

BlueCHISEL Inputs and Outputs Inputs: rule ( cond ) { updates }! Ouputs:! Old States Rules Condition Action Condition Action CAN FIRE Scheduler WILL FIRE New States Condition Action Muxing 15

Hardware Integration Cache controller protocol rule engine Directory controller protocol rule engine Dir 16

Design Flow Abstract State Machines Independent Message Sequences Per-Node Rule Tables To formal verifier! Race-Free Coherence Transactions Integrated Memory Hierarchy Per-Node Rule Engine Old States Rules CAN FIRE Scheduler WILL FIRE New States Chisel! Conditio n Action Conditio n Action Conditio n Action Muxing 17

Future Work: Extending Chisel Try to address inter-node, message sequence interleaving complexity! Generate sufficient transient/busy states! Compatible physical networks and buffering! High-level composition of new transactions with existing protocols! MESI = MI + E + S +?! Patterns in message sequences! 18

Future Work: Design Flow Abstract State Machines Independent Message Sequences??? Per-Node Rule Tables Race-Free Coherence Transactions Integrated Memory Hierarchy Per-Node Rule Engine Old States Rules CAN FIRE Scheduler WILL FIRE New States Chisel Conditio n Action Conditio n Action Conditio n Action Muxing 19

Future Work: Protocols Protocol extensions that! Exploit SW knowledge via explicit SW->HW directives! Allow for heterogeneous memory hierarchy behavior! Assignment of data to particular state! Clean this cache block! Assignment of data to particular sub-protocol! Keep this block coherent using an update protocol! Exemption of data from any protocol other than SW-directed actions! Only move this block in response to a DMA command! 20

Future Work: Specialization Programs have been found to employ a set of common types of sharing behavior! Write-once, Private, Write-many, Result, Synchronization, Migratory, Producer/Consumer, Read-mostly, Streaming! Sharing behavior is often known by expert or even application programmer! Utilize high-level info instead of reconstructing in HW! SEJITS: emit optimized code from high level abstractions! Code that can control cache behavior according to pattern! Even define user-level protocols! Bloom: Use consistency-analysis to inform decisions about which sub-protocols to employ! 21

Conclusion We can write protocols using succinct, declarative descriptions, and generate effective hardware implementations! Declarative extension to Chisel based on ideas from Bluespec! Produce verified implementations from verified specifications! Experiment with more designs! Explore space of protocols that use SW input to create heterogeneous behavior! 22

Thanks 23

Related Work Murphi! DSL for finite-state analysis/model checking! Teapot! DSL for user-space software coherence protocols! SLICC! DSL for emitting SW modules for GEMS simulator! Bluespec SystemVerilog! HDL based on guarded atomic actions! Bloom! DSL for high-level consistency analysis of distributed parallel algorithms! 24

Consistency For now, left up to enforcement at processor by compiler-issued ISA constructs (e.g. fences)! In the future, would like to consider protocols that exploit very relaxed consistency models (e.g. location consistency)! What patterns? What whiteboard design?! 25

BlueChisel Inputs:!!rule ( cond ) { update }! Outputs:! Rule engines, consisting of scheduler and muxing logic, that conditional modify state! CHISEL internals:! During creation, log rule with component! During elaboration, analyze domain and range of rules to create scheduler and mux code! 26

Applying CHISEL to Intra-node Complexity What are the abstractions?! Data (states, messages, payloads)! Actions (send messages, update states)! Rules! What are the reusable modules?! Collections of rules encapsulated in rule engines! What are the design patterns?! Control logic for queues, arbiters, MSHRs! 27

Flow of Information and Control Applications Pattern Compositions PLL Knowledge Specializations ELL/Runtime Implementation SW-Management Mechanisms Knowledge Allocation Placement Eviction Prefetch Synchronization Bulk Transfers Predictability Isolation Reconfigurability Opt-in Efficiency Functional Correctness Manycore HW Features 28

Data object access patterns Write-once! Initialized but then only read! Best supported by replication (selected portions of large objects)! Private! Need not be managed! On violation, demote/activate management?! Write-many! Frequently modified by multiple threads between synch points! Use delayed-update protocol! Result! Restricted subset of write-many, no reads until all writes complete! Lack of conflicts allows maximum utilization of delayed-update protocol! Synchronization! Distributed locks, atomic operands! Migratory! Read and written by single thread at a time, as object in critical sections of code! Associate w/ lock movement, look for signature pattern! Producer/Consumer! Produced by one thread and consumed by fixed set of other threads! Eager object movement in update protocol! Read-mostly! Replicate and update infrequently via broadcast! Streaming! Read once (or few times), too large to keep in particular level for reuse! General read/write! Default, rare! 29

Case Study 1: SVM Training def train (val, data) def val_compare(x): return compare(x, val) z = map(val_compare, data) data val 30

Case Study 2: Stencils Producer/consumer! Update protocol with proactive transmission of ghost cells to static neighbors! 31

Case Study 3: nbody Migratory! Migrate on read miss rather than replicate! 32

Specialization: Update protocols Known at compile time (prod/con)! Fixed per run (prod/con)! Dynamic (migratory)! 33

MI -> MESI MI MEI MSI MESI 34