A Scalable SAS Machine

Size: px

Start display at page:

Download "A Scalable SAS Machine"

Allyson Quinn
5 years ago
Views:

1 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation of directory protocols (8.2) erformance issues (8.3) orrectness issues (8.4) ase studies to focus on detailed issues ( ) 2/18/2009 slide 1 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 Scalable SS achine Scalable interconnection network Three important design decisions: Scalable interconnection network Distributed memory organization Scalable cache coherence protocol 2/18/2009 slide 2 OD: Lecture 8 er Stenström 2008, Sally. ckee

2 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Directory rotocols Snooping protocols use broadcasting and do not scale Scalable interconnection network N Directory entry associated with each memory block Bookkeeping tracks which nodes have copies along with state of memory ll global requests for that block are sent 2/18/2009 slide 3 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 ain em B1 Snooping dapter ommon pproach B2 Snooping dapter (a) Snooping-snooping B1 ain em Dir. B1 ain em ssist Network ssist (b) Snooping-directory B1 ain em Dir. Ss form building blocks in larger systems Network1 Network1 Network1 Network1 Directory adapter Directory adapter Dir/Snoop y adapter Dir/Snoop y adapter Network2 (c) Directory-directory Bus (or Ring) (d) Directory-snooping Examples: onvex Exemplar (directory-directory) SGI Origin, Sequent NU-Q, HL, (snooping-directory) 2/18/2009 slide 4 OD: Lecture 8 er Stenström 2008, Sally. ckee

3 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Operation of a Simple Directory rotocol 1(2) Interconnection network Local Home Remote Local node: Node initiating request Home node: Node with directory entry for block Remote node: Other node(s) involved in transaction 2/18/2009 slide 5 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 Operation of a Simple Directory rotocol 2(2) Requestor 3. Read req. to owner 4a. Data Reply 1. Read request 2. Reply with owner identity 4b. Revision message Directory node for block Requestor 1. RdEx request 2. Reply with sharers identity 3a. 3b. Inval. req. Inval. req. to sharer to sharer 4a. 4b. Inval. ack Inval. ack Directory node Node with dirty copy (a) Read miss to a block in dirty state Sharer Sharer (b) Write miss to a block with two sharers Important performance issues: Number, latency, and traffic of transactions 2/18/2009 slide 6 OD: Lecture 8 er Stenström 2008, Sally. ckee

4 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Implementation of a Simple rotocol ache ache Interconnection Network Full vector directory: + 1 bits/block directory entries are distributed emory Directory presence bits dirty bit Scalability considerations: erformance: how does latency and bandwidth scale? ost: how does directory grow in size with? 2/18/2009 slide 7 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 erformance Insights Inherent program characteristics: determine whether directories provide big advantages over broadcast provide insights into how to organize and store directory information haracteristics that matter frequency of write misses how many sharers on a write miss how these scale 2/18/2009 slide 8 OD: Lecture 8 er Stenström 2008, Sally. ckee

5 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 ache Invalidation atterns LU Invalidation atterns # of invalidations Ocean Invalidation atterns to to to to to to to to to to to to to 59 % of shared writes 60 to to to to to to to to to to to to to to to 63 2/18/2009 slide 9 # of invalidations OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 Sharing atterns: Summary ommon case: only a few sharers at a write, scales slowly with ode and read-only objects: no problem, never or rarely written igratory objects: only 1-2 invalidations per write ostly read objects: large but infrequent Frequently read/written objects: small but frequent invalidations Synchronization objects: low contention -> small invalidations Implications: directories useful in containing traffic (as opposed to snoop) techniques to reduce storage overhead can be important 2/18/2009 slide 10 OD: Lecture 8 er Stenström 2008, Sally. ckee

6 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Directory rotocol Taxonomy Directory Schemes entralized Distributed How to find source of directory information Flat Hierarchical How to locate copies emory-based ache-based ll approaches have different tradeoffs wrt scalability considerations 2/18/2009 slide 11 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 entralized Directory Directory Scalable interconnection network 1 N ll transactions to all blocks go to a centralized directory ay become a bottleneck Has only been popular for a small number of nodes 2/18/2009 slide 12 OD: Lecture 8 er Stenström 2008, Sally. ckee

7 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Hierarchical Directories DIR DIR Extension of snooping concept Bandwidth: limited at the root Latency: multiple directory lookups on the way ost: duplication of entries but smaller entries Therefore, not a popular approach 2/18/2009 slide 13 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 Flat emory-based Schemes Example: the simple directory protocol full bit vector Scaling of performance characteristics write traffic: proportional to number of sharers write latency: invalidations can issue in parallel Scaling of storage for directory Example: (assuming 64-Byte lines) 64 nodes: 12.5% overhead 256 nodes: 50% overhead 1024 nodes: 200% overhead Storage grows as * 2/18/2009 slide 14 OD: Lecture 8 er Stenström 2008, Sally. ckee

8 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Reducing Storage Overhead Optimizations for full bit vector schemes increase cache block size use multiprocessor nodes 256 procs, 4/cluster, 128B line: 6.25% overhead rovide pointers to a few nodes (address the term) intuition: most blocks cached by only a few nodes =1024 => 10 bit pointers, can accommodate 100 pointers need an overflow strategy when there are more sharers Reducing height: (address the term) intuition: # memory blocks >> # cache blocks organize directory as a cache, rather than one entry/block 2/18/2009 slide 15 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 Flat, ache-based Schemes How they work: home has a single pointer that points to head of list cache has pointer to next sharer on read, cache is linked into list ache ache ain emory (Home) Node 0 Node 1 Node 2 ache on write, send invalidations down the list Example: Scalable oherent Interface (SI) IEEE Standard 2/18/2009 slide 16 OD: Lecture 8 er Stenström 2008, Sally. ckee

9 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scaling roperties (ache-based) erformance: Traffic on write: proportional to number of sharers Latency on write: proportional to number of sharers Storage overhead: quite good scaling along both and axes Other properties: good: mature, IEEE Standard, fair bad: complex 2/18/2009 slide 17 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 orrectness Issues Ensure basics of coherence at state transition level lines are updated/invalidated/fetched correct state transitions and actions happen Ensure ordering and serialization constraints are met coherence (single location), consistency (multiple locations) avoid deadlock, livelock, starvation roblems amplified in comparison with bus-based machines multiple copies ND multiple paths through network large latency makes optimizations attractive 2/18/2009 slide 18 OD: Lecture 8 er Stenström 2008, Sally. ckee

10 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 oherence Enforcement Revisit the simple directory protocol Requestor 3. Read req. to o wner Data Reply 4a Read request Reply with o wner identity 4b. Re vision message Dir ectory node for block Inv al. req. to sharer In val. ack Requestor 3a. 3b. In val. req. to sharer 4a. 4b RdEx request Reply with sharers identity In val. ack Dir ectory node Node with dirty cop y Shar er Shar er (a) Read miss to a block in dirty state (b) Write miss to a block with tw o sharers oherence is enforced because writes are serialized through home memory module invalidations are serialized if single path between any two nodes 2/18/2009 slide 19 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 Sequential onsistency =1; while (==0) ; B=1; while (B==0) ; print ; em ache em ache :0->1 :0 ache B:0->1 em =1 delay B=1 =1 Interconnection Netw ork How do we guarantee write atomicity? 2/18/2009 slide 20 OD: Lecture 8 er Stenström 2008, Sally. ckee

11 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Enforcing Write tomicity with the Simple rotocol Requestor 3. Read req. to o wner Data Reply 4a Read request Reply with o wner identity 4b. Re vision message Dir ectory node for block Requestor RdEx request Reply with sharers identity 3a. 3b. Inv al. req. In val. req. to sharer to sharer 4a. 4b. In val. ack In val. ack Dir ectory node Node with dirty cop y Shar er Shar er (a) Read miss to a block in dirty state (b) Write miss to a block with tw o sharers Requestor may not issue another global transaction until all invalidations have been acknowledged 2/18/2009 slide 21 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 Deadlock, Livelock,, Starvation Request-response protocol Similar issues to those discussed earlier a node may receive too many messages flow control can cause deadlock separate request and reply networks with request-reply protocol New problem: protocols often are not strict request-reply e.g. rd-excl generates inval requests (which generate ack replies) other cases to reduce latency and allow concurrency ust address livelock and starvation 2/18/2009 slide 22 OD: Lecture 8 er Stenström 2008, Sally. ckee

12 arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 rotocol Enhancements for Latency Forwarding messages: memory-based protocols 3:interv ention 1: req 4a:revise L H R 2:reply 4b:response (a) Strict r equest-r eply 1: req 2:interv ention L H R 4:reply 3:response (a) Intervention forwar ding 1: req 2:interv ention 3a:re vise L H R 3b:response (a) Reply forwar ding 2/18/2009 slide 23 OD: Lecture 8 er Stenström 2008, Sally. ckee 2009 rotocol Enhancements for Latency Forwarding messages: cache-based protocols 1: inval 3:inval 5:inval 1: inval 2a:inval 3a:inval H S 1 S 2 2:ack 4:ack 6:ack S 3 H S 1 S 2 2b:ack 3b:ack 4b:ack S 3 (a) (b) 1:inval 2:inval 3:inval H S 1 S 2 S 3 4:ack (c) 2/18/2009 slide 24 OD: Lecture 8 er Stenström 2008, Sally. ckee

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must: ontext for Scalable ache oherence ache oherence in Scalable Machines Realizing gm Models through net transaction protocols - efficient node-to-net interface - interprets transactions Switch Scalable network