Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Similar documents
Lecture: Transactional Memory. Topics: TM implementations

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture: Interconnection Networks

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture: Consistency Models, TM

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation

Lecture 6: TM Eager Implementations. Topics: Eager conflict detection (LogTM), TM pathologies

Lecture 8: Eager Transactional Memory. Topics: implementation details of eager TM, various TM pathologies

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 3: Flow-Control

TDT Appendix E Interconnection Networks

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

A More Sophisticated Snooping-Based Multi-Processor

Lecture 17: Transactional Memories I

Advanced Computer Networks. Flow Control

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Advanced Computer Networks. Flow Control

Lecture 22: Router Design

EECS 570 Final Exam - SOLUTIONS Winter 2015

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Interconnection Networks

Lecture: Networks, Disks, Datacenters, GPUs. Topics: networks wrap-up, disks and reliability, datacenters, GPU intro (Sections

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Interconnection Networks

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

A Basic Snooping-Based Multi-Processor Implementation

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 23: Router Design

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Basic Low Level Concepts

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

Deadlock and Router Micro-Architecture

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Packet Switch Architecture

Flow Control can be viewed as a problem of

Packet Switch Architecture

4. Networks. in parallel computers. Advances in Computer Architecture

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Chapter 9 Multiprocessors

A Basic Snooping-Based Multi-Processor Implementation

Routing Algorithms. Review

A Basic Snooping-Based Multi-processor

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

This Lecture. BUS Computer Facilities Network Management. Switching Network. Simple Switching Network

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

ECE 669 Parallel Computer Architecture

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Interconnection Networks

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Concurrent Preliminaries

Lecture 24: Board Notes: Cache Coherency

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Advanced Topic: Efficient Synchronization

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Deadlock and Livelock. Maurizio Palesi

Potential violations of Serializability: Example 1

ES1 An Introduction to On-chip Networks

6.852: Distributed Algorithms Fall, Class 20

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Lecture 7: Flow Control - I

NOC Deadlock and Livelock

Computer Science 146. Computer Architecture

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Lecture 25: Multiprocessors

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Outline: Connecting Many Computers

Architecture or Parallel Computers CSC / ECE 506

Transcription:

Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1

Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock 2

Design Space Data Versioning Eager: based on an undo log Lazy: based on a write buffer Conflict Detection Optimistic detection: check for conflicts at commit time (proceed optimistically thru transaction) Pessimistic detection: every read/write checks for conflicts (reduces work during commit) 3

Lazy Implementation An implementation for a small-scale multiprocessor with a snooping-based protocol Lazy versioning and lazy conflict detection Does not allow transactions to commit in parallel 4

Lazy Implementation When a transaction issues a read, fetch the block in read-only mode (if not already in cache) and set the rd-bit for that cache line When a transaction issues a write, fetch that block in read-only mode (if not already in cache), set the wr-bit for that cache line and make changes in cache If a line with wr-bit set is evicted, the transaction must be aborted (or must rely on some software mechanism to handle saving overflowed data) 5

Lazy Implementation When a transaction reaches its end, it must now make its writes permanent A central arbiter is contacted (easy on a bus-based system), the winning transaction holds on to the bus until all written cache line addresses are broadcasted (this is the commit) (need not do a writeback until the line is evicted must simply invalidate other readers of these cache lines) When another transaction (that has not yet begun to commit) sees an invalidation for a line in its rd-set, it realizes its lack of atomicity and aborts (clears its rd- and wr-bits and re-starts) 6

Lazy Implementation Lazy versioning: changes are made locally the master copy is updated only at the end of the transaction Lazy conflict detection: we are checking for conflicts only when one of the transactions reaches its end Aborts are quick (must just clear bits in cache, flush pipeline and reinstate a register checkpoint) Commit is slow (must check for conflicts, all the coherence operations for writes are deferred until transaction end) No fear of deadlock/livelock the first transaction to acquire the bus will commit successfully Starvation is possible need additional mechanisms 7

Lazy Implementation Parallel Commits Writes cannot be rolled back hence, before allowing two transactions to commit in parallel, we must ensure that they do not conflict with each other One possible implementation: the central arbiter can collect signatures from each committing transaction (a compressed representation of all touched addresses) Arbiter does not grant commit permissions if it detects a possible conflict with the rd-wr-sets of transactions that are in the process of committing The lazy design can also work with directory protocols 8

Eager Implementation A write is made permanent immediately (we do not wait until the end of the transaction) This means that if some other transaction attempts a read, the latest value is returned and the memory may also be updated with this latest value Can t lose the old value (in case this transaction is aborted) hence, before the write, we copy the old value into a log (the log is some space in virtual memory -- the log itself may be in cache, so not too expensive) This is eager versioning 9

Eager Implementation Since Transaction-A s writes are made permanent rightaway, it is possible that another Transaction-B s rd/wr miss is re-directed to Tr-A At this point, we detect a conflict (neither transaction has reached its end, hence, eager conflict detection): two transactions handling the same cache line and at least one of them does a write One solution: requester stalls: Tr-A sends a NACK to Tr-B; Tr-B waits and re-tries again; hopefully, Tr-A has committed and can hand off the latest cache line to B neither transaction needs to abort 10

Eager Implementation Can lead to deadlocks: each transaction is waiting for the other to finish Need a separate (hw/sw) contention manager to detect such deadlocks and force one of them to abort Tr-A write X read Y Tr-B write Y read X 11

Eager Implementation Note that if Tr-B is doing a write, it may be forced to stall because Tr-A may have done a read and does not want to invalidate its cache line just yet If new reading transactions keep emerging, Tr-B may be starved again, need other sw/hw mechanisms to handle starvation Commits are inexpensive (no additional step required); Aborts are expensive, but rare (must reinstate data from logs) 12

Other Issues Nesting: when one transaction calls another flat nesting: collapse all nested transactions into one large transaction closed nesting: inner transaction s rd-wr set are included in outer transaction s rd-wr set on inner commit; on an inner conflict, only the inner transaction is re-started open nesting: on inner commit, its writes are committed and not merged with outer transaction s commit set What if a transaction performs I/O? What if a transaction overflows out of cache? 13

Useful Rules of Thumb Transactions are often short more than 95% of them will fit in cache Transactions often commit successfully less than 10% are aborted 99.9% of transactions don t perform I/O Transaction nesting is not common Amdahl s Law again: optimize the common case! 14

Discussion Eager optimizes the common case and does not waste energy when there s a potential conflict TM implementations require relatively low hardware support Multiple commercial examples: Sun Rock, AMD ASF, IBM BG/Q, Intel Haswell 15

Network Topology Examples Grid Torus Hypercube 16

Routing Deterministic routing: given the source and destination, there exists a unique route Adaptive routing: a switch may alter the route in order to deal with unexpected events (faults, congestion) more complexity in the router vs. potentially better performance Example of deterministic routing: dimension order routing: send packet along first dimension until destination co-ord (in that dimension) is reached, then next dimension, etc. 17

Deadlock Deadlock happens when there is a cycle of resource dependencies a process holds on to a resource (A) and attempts to acquire another resource (B) A is not relinquished until B is acquired 18

Deadlock Example 4-way switch Input ports Output ports Packets of message 1 Packets of message 2 Packets of message 3 Packets of message 4 Each message is attempting to make a left turn it must acquire an output port, while still holding on to a series of input and output ports 19

Deadlock-Free Proofs Number edges and show that all routes will traverse edges in increasing (or decreasing) order therefore, it will be impossible to have cyclic dependencies Example: k-ary 2-d array with dimension routing: first route along x-dimension, then along y 17 18 19 1 2 3 2 1 0 18 1 2 3 2 1 0 17 1 2 3 2 1 0 16 1 2 3 2 1 0 20

Breaking Deadlock Consider the eight possible turns in a 2-d array (note that turns lead to cycles) By preventing just two turns, cycles can be eliminated Dimension-order routing disallows four turns Helps avoid deadlock even in adaptive routing West-First North-Last Negative-First Can allow deadlocks 21

Packets/Flits A message is broken into multiple packets (each packet has header information that allows the receiver to re-construct the original message) A packet may itself be broken into flits flits do not contain additional headers Two packets can follow different paths to the destination Flits are always ordered and follow the same path Such an architecture allows the use of a large packet size (low header overhead) and yet allows fine-grained resource allocation on a per-flit basis 22

Flow Control The routing of a message requires allocation of various resources: the channel (or link), buffers, control state Bufferless: flits are dropped if there is contention for a link, NACKs are sent back, and the original sender has to re-transmit the packet Circuit switching: a request is first sent to reserve the channels, the request may be held at an intermediate router until the channel is available (hence, not truly bufferless), ACKs are sent back, and subsequent packets/flits are routed with little effort (good for bulk transfers) 23

Buffered Flow Control A buffer between two channels decouples the resource allocation for each channel Packet-buffer flow control: channels and buffers are allocated per packet H B B B T Store-and-forward Cut-through Channel Channel 0 1 2 3 0 1 2 3 H B B B T H B B B T H B B B T Time-Space diagrams H B B B T H B B B T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Cycle Wormhole routing: same as cut-through, but buffers in each router are allocated on a per-flit basis, not per-packet 24

Virtual Channels Buffers channel Buffers Flits do not carry headers. Once a packet starts going over a channel, another packet cannot cut in (else, the receiving buffer will confuse the flits of the two packets). If the packet is stalled, other packets can t use the channel. With virtual channels, the flit can be received into one of N buffers. This allows N packets to be in transit over a given physical channel. The packet must carry an ID to indicate its virtual channel. Buffers Buffers Physical channel Buffers Buffers 25

Example Wormhole: Node-0 A is going from Node-1 to Node-4; B is going from Node-0 to Node-5 Node-1 A Virtual channel: B idle B idle Node-2 Node-3 Node-4 Node-5 (blocked, no free VCs/buffers) Node-0 Traffic Analogy: B is trying to make a left turn; A is trying to go straight; there is no left-only lane with wormhole, but there is one with VC Node-1 A B A B A Node-2 Node-3 Node-4 Node-5 (blocked, no free VCs/buffers) 26

Virtual Channel Flow Control Incoming flits are placed in buffers For this flit to jump to the next router, it must acquire three resources: A free virtual channel on its intended hop We know that a virtual channel is free when the tail flit goes through Free buffer entries for that virtual channel This is determined with credit or on/off management A free cycle on the physical channel Competition among the packets that share a physical channel 27

Buffer Management Credit-based: keep track of the number of free buffers in the downstream node; the downstream node sends back signals to increment the count when a buffer is freed; need enough buffers to hide the round-trip latency On/Off: the upstream node sends back a signal when its buffers are close to being full reduces upstream signaling and counters, but can waste buffer space 28

Deadlock Avoidance with VCs VCs provide another way to number the links such that a route always uses ascending link numbers 17 18 19 2 1 0 18 1 2 3 2 1 0 17 1 2 3 2 1 0 16 1 2 3 2 1 0 117 118 119 102 101 100 118 117 101 102 103 116 202 201 200 217 218 218 217 201 202 203 Alternatively, use West-first routing on the 1 st plane and cross over to the 2 nd plane in 219 216 case you need to go West again (the 2 nd plane uses North-last, for example) 29

Title Bullet 30