5 Chip Multiprocessors (II) Robert Mullins

Size: px

Start display at page:

Download "5 Chip Multiprocessors (II) Robert Mullins"

Clifford Singleton
5 years ago
Views:

1 5 Chip Multiprocessors (II) ( MPhil Chip Multiprocessors (ACS Robert Mullins

2 Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses Cache coherence and interconnects Directory-based Coherency Protocols 2 ( MPhil Chip Multiprocessors (ACS

3 Synchronization The lock problem The lock is suppose to provide atomicity for critical sections Unfortunately, as implemented this lock is lacking atomicity in its own implementation Multiple processors could read the lock as free and progress past the branch simultaneously lock: ld reg, lock-addr cmp reg, #0 bnz lock st lock-addr, #1 ret unlock: st lock-addr, #0 ret Culler p ( MPhil Chip Multiprocessors (ACS

4 Synchronization Test and Set Executes the following atomically: reg=m[lock-addr] m[lock-addr]=1 The branch makes sure that if the lock was already taken we try again A more general, but similar, instruction is swap reg1=m[lock-addr] m[lock-addr]=reg2 lock: t&s reg, lock-addr bnz reg, lock ret unlock: st lock-addr, #0 ret 4 ( MPhil Chip Multiprocessors (ACS

5 Synchronization We could implement test&set with two bus transactions A read and a write transaction We could lock down the bus for these two cycles to ensure the sequence is atomic More difficult with a split-transaction bus performance and deadlock issues 5 ( MPhil Chip Multiprocessors (ACS Culler p.391

6 Synchronization If we assume an invalidation-based CC protocol with a WB cache, a better approach is to: Issue a read exclusive (BusRdX) transaction then perform the read and write (in the ( cache without giving up ownership Any incoming requests to the block are buffered until the data is written in the cache Any other processors are forced to wait 6 ( MPhil Chip Multiprocessors (ACS

7 Synchronization Other common synchronization instructions: swap fetch&op fetch&inc fetch&add compare&swap Many x86 instructions can be prefixed with the lock modifier to make them atomic A simpler general purpose solution? 7 ( MPhil Chip Multiprocessors (ACS

8 LL/SC LL/SC Load-Linked (LL) Read memory Set lock flag and put address in lock register Intervening writes to the address in the lock register will cause the lock flag to be reset Store-Conditional (SC) Check lock flag to ensure an intervening conflicting write has not occurred If lock flag is not set, SC will fail If (atomic_update) then mem[addr]=rt, rt=1 else rt=0 8 ( MPhil Chip Multiprocessors (ACS

9 LL/SC reg2=1 lock: ll reg1, lock-addr bnz reg1, lock ; lock already taken? sc lock-addr, reg2 beqz lock ; if SC failed goto lock ret unlock: st lock-addr, #0 ret 9 ( MPhil Chip Multiprocessors (ACS

10 LL/SC This SC will fail as the lock flag will be reset by the store from P2 Culler p ( MPhil Chip Multiprocessors (ACS

11 LL/SC LL/SC can be implemented using the CC protocol: LL loads cache line with write permission (issues ( M BusRdX, holds line in state SC Only succeeds if cache line is still in state M, otherwise fails 11 ( MPhil Chip Multiprocessors (ACS

12 LL/SC Need to ensure forward progress. May prevent LL giving up M state for n cycles (or after repeated fails ( state guarantee success, i.e. simply don't give up M We normally implement a restricted form of LL/SC called RLL/RSC: SC may experience spurious failures e.g. due to context switches and TLB misses We add restrictions to avoid cache line (holding lock ( variable from being replaced Disallow memory memory-referencing instructions between LL and SC Prohibit out-of-order execution between LL and SC 12 ( MPhil Chip Multiprocessors (ACS

13 Coherence misses Remember your 3 C's! Compulsory Cold-start of first-reference misses Capacity If cache is not large enough to store all the blocks needed during the execution of the program ( collision Conflict (or Conflict misses occur due to direct-mapped or set associative block placement strategies Coherence Misses that arise due to interprocessor communication 13 ( MPhil Chip Multiprocessors (ACS

14 True sharing A block typically contains many words (e.g. 4- Coherency is maintained at the granularity.( 8 of cache blocks True sharing miss Misses that arise from the communication of data e.g., the 1 st write to a shared block ( S ) will causes an invalidation to establish ownership Additionally, subsequent reads of the invalidated block by another processor will also cause a miss Both these misses are classified as true sharing if data is communicated and they would occur irrespective of block size. 14 ( MPhil Chip Multiprocessors (ACS

15 False sharing False sharing miss Different processors are writing and reading different words in a block, but no communication is taking place e.g. a block may contain words X and Y P1 repeatedly writes to X, P2 repeatedly writes to Y The block will be repeatedly invalidated (leading to cache ( misses even though no communication is taking place These are false misses and are due to the fact that the block contains multiple words They would not occur if the block size = a single word For more details see Coherence miss classification for performance debugging in multi-core processors, Venkataramani et al. Interact ( MPhil Chip Multiprocessors (ACS

16 Cache coherence and interconnects Broadcast-based snoopy protocols These protocols rely on bus-based interconnects Buses have limited scalability Energy and bandwidth implications of broadcasting They permit direct cache-to-cache transfers Low-latency communication 2 hops» 1. broadcast» 2. receive data from remote cache Very useful for applications with lots of fine-grain sharing 16 ( MPhil Chip Multiprocessors (ACS

17 Cache coherence and interconnects Totally-ordered interconnects All messages are delivered to all destinations in the same order. Totally-ordered interconnects often employ a centralised arbiter or switch e.g. a bus or pipelined broadcast tree Traditional snoopy protocols are built around the concept of a bus (or virtual :( bus ( 1 ) Broadcast - All transactions are visible to all components connected to the bus ( 2 ) The interconnect provides a total order of messages Chip Multiprocessors (ACS ( MPhil 17

18 Cache coherence and interconnects A pipelined broadcast tree is sufficiently similar to a bus to support traditional [( Wisconsin ) snooping protocols. [Reproduced from Milo Martin's PhD thesis The centralised switch guarantees a total ordering of messages, i.e. messages are sent to the root switch then broadcast. 18 ( MPhil Chip Multiprocessors (ACS

19 Cache coherence and interconnects Unordered interconnects Networks (e.g. mesh, ( torus can't typically provide strong ordering guarantees, i.e. nodes don't perceive transactions in a single global order. Point-to-point ordering Networks may be able to ensure messages sent between a pair of nodes are guaranteed not to be reordered. e.g. a mesh with a single VC and deterministic dimension ordered ( XY ) routing 19 ( MPhil Chip Multiprocessors (ACS

20 Directory-based cache coherence The state of the blocks in each cache in a snoopy protocol is maintained by broadcasting all memory operations on the bus We want to avoid the need to broadcast. So maintain the state of each block explicitly We store this information in the directory Requests can be made to the appropriate directory entry to read or write a particular block The directory orchestrates the appropriate actions necessary to satisfy the request 20 ( MPhil Chip Multiprocessors (ACS

21 Directory-based cache coherence The directory provides a per-block ordering point to resolve races All requests for a particular block are made to the same directory. The directory decides the order the requests will be satisfied. Directory protocols can operate over unordered interconnects 21 ( MPhil Chip Multiprocessors (ACS

22 Broadcast-based directory protocols A number of recent coherence protocols broadcast transactions over unordered interconnects: Similar to snoopy coherence protocols They provide a directory, or coherence hub, that serves as an ordering point. The directory simply broadcasts requests ( maintained to all nodes (no sharer state is The ordering point also buffers subsequent coherence requests to the same cache line to prevent races with a request in progress e.g. early example AMD's Hammer protocol High bandwidth requirements, but simple, no need to maintain/read sharer state 22 ( MPhil Chip Multiprocessors (ACS

23 Directory-based cache coherence The directory keeps track of who has a copy of the block and their states Broadcasting is replaced by cheaper point-to-point communications by maintaining a list of sharers The number of invalidations on a write is typically small in real applications, giving us a significant reduction in communication costs (esp. in systems ( processors with a large number of 23 ( MPhil Chip Multiprocessors (ACS

24 Directory-based cache coherence Read Miss to a block in a modified state in a cache ( 8.5 Fig. (Culler, An example of a simple protocol. This is only meant to introduce the concept of a directory 24 ( MPhil Chip Multiprocessors (ACS

25 Directory-based cache coherence Write miss to a block with two sharers 25 ( MPhil Chip Multiprocessors (ACS

26 Directory-based cache coherence Let's consider the requester, directory and sharer state transitions for the previous slide... Requester State Directory State Sharer State The processor executes a store I->P (1) The block is initially in the ( I(nvalid state We make a ExclReq to the directory and move to a pending state P->E (4) We receive write permission and data from the directory Block is initially marked as shared. The directory holds a list of the sharers Shared->TransWaitForInvalidate (2) The directory receives a ExclReq from cache 'id', id is not in the sharers list and the sharers list is not empty. It must send invalidate requests to all sharers and wait for their responses TransWaitForInvalidate->M (4) All invalidate acks are recieved, directory can reply to requester and provide data + write permissions. It moves to a state that records that the requester has the only copy S->I (3) On receiving a InvReq each sharer invalidates its copy of the block and moves to state I. It then acks with a InvRep message 26 ( MPhil Chip Multiprocessors (ACS

27 Directory-based cache coherence We now have two types of controller, one at each directory and one at each private cache The complete cache coherence protocol is specified in state-diagrams for both controllers The stable cache states are often MESI as in a snoopy protocol There are some complete example protocols ( Gold available on the wiki (Courtesy of Brian Exercise: try and understand how each of these protocols handle read and write misses 27 ( MPhil Chip Multiprocessors (ACS

28 Organising directory information How do we know which directory to send our request to? How is directory state actually stored? 28 ( MPhil Chip Multiprocessors (ACS

29 Organising directory information Directory Schemes Centralized Distributed How to find source of directory information How to locate copies Memory-based Information about all sharers is stored at the directory using a full bit-vector organization, limited-pointer scheme etc. ( book Figure 8.7 (reproduced from Culler Parallel Flat Cache-based Hierarchical Requests traverse up a tree to find a node with information on the block information is distributed amongst sharers, e.g. sharers form a linked list ( NUMA-Q (IEEE SCI, Sequent Typically operations to: add to head, ( neighbours remove a node (by contacting ( only and invalidate all nodes (from head we won't discuss 29 ( MPhil Chip Multiprocessors (ACS

30 Organising directory information How do we store the list of sharers in a flat, memorybased directory scheme? Full bit-vector P presence bits, which indicate for each of the p processors whether the processor has a copy of the block Limited-pointer schemes Maintain a fixed (and ( limited number of pointers [Culler p.568] Typically the number of sharers is small (4 pointers may often ( suffice Need a backup or overflow strategy Overflow to memory or resort to broadcast ( Origin Or a coarse vector scheme (e.g. SGI ( processors (where each bit represents groups of Extract from duplicated L1 tags (reverse-mapped) Query local copy of tags to find sharers 30 ( MPhil Chip Multiprocessors (ACS

31 Organising directory information Four examples of how we might store our directory information in a CMP: Append state to L2 tags ( 1 Duplicate L1 tags at the directory ( 2 Store directory state in main memory and include ( 3 a directory cache at each node A hierarchical directory ( 4 I assume the L2 is the first shared cache. In a real system this could as easily be the L3 or interface to main memory. The directory is placed at the first shared memory regardless of the number of levels of cache. 31 ( MPhil Chip Multiprocessors (ACS

32 Organising directory information 1. Append state to L2 tags Perhaps conceptually the simplest scheme Assume a shared banked inclusive L2 cache The location of the directory depends only on the block address Directory state can simply be appended to the L2 cache tags Reproduced from Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, Zhang/Asanovic, ISCA'05 32 ( MPhil Chip Multiprocessors (ACS

33 Organising directory information 1. Append state to L2 tags May be expensive in terms of memory L2 may contain many more cache lines than can reside in the aggregated L1s (or on a per bank basis, those L1 ( bank lines that can map to the L2 May be unnecessarily power and area hungry Doesn't provide support for non-inclusive L2 caches Assumes the L2 is always caching anything in the L1's Problematic if L2 is small in comparison to aggregated L1 capacity 33 ( MPhil Chip Multiprocessors (ACS

34 Organising directory information 2. Duplicating L1 tags (A reverse mapped directory CAM) :( bank At each directory (e.g. L2 Duplicate the L1 tags of those L1 lines that can map to the bank We can interrogate the duplicated tags to determine the sharers list At what granularity do we interleave addresses across banks for the directory and L2 cache? Simpler if we interleave the directory and L2 in the same way What about the impact of granularity on the directory? 34 ( MPhil Chip Multiprocessors (ACS

35 Organising directory information 2. Duplicating L1 tags In this example precisely one quarter of the L1 lines map to each of the 4 L2 banks 35 ( MPhil Chip Multiprocessors (ACS

36 Organising directory information 2. Duplicating L1 tags A fine-grain interleaving as illustrated on the previous slide means that only a subset of each L1's lines may map to a particular L2 bank Each directory is organised as: s/n sets of n*a ways Where n = no. of processors, a = associativity, s = no. of sets in L1 If a coarse-grain interleaving is selected (where the L2 bank is selected from bits outside the L1s index ( bits any L1 line could map to any L2 bank, hence each directory is organised as: s sets of n*a ways each 36 ( MPhil Chip Multiprocessors (ACS

37 Organising directory information 2. Duplicating L1 tags Example: Sun Niagara T1 L1 caches are write-through, 16-byte lines Allocate on load, no-allocate on a store L2 maintains directory by duplicating L1 tags L2 is banked and interleaved at a 64-byte granularity No. of L1 lines that may map to each L2 bank is much less than the total number of L2 lines in a bank. Duplicating L1 tags saves area and power over adding directory state to each L2 tag. 37 ( MPhil Chip Multiprocessors (ACS

38 Organising directory information 3. Directory-caches Directory state is stored in main memory and cached at each node Note: The L2 caches are private in this example Figure reproduced from Proximity- Aware Directory-based Coherence for Multi-core Processor Architectures, Brown/Kumar/Tullsen, SPAA'07 38 ( MPhil Chip Multiprocessors (ACS

39 Organising directory information 3. Directory-caches Each tile and corresponding memory channel has access to a different range of physical memory locations ( directory There is only one possible home (location of the associated for each memory block Two different directories never share directory state, so there are no coherence worries between directory caches! Each cache line in the directory cache may hold state corresponding to multiple contiguous memory blocks to exploit spatial locality (as you ( cache would in a normal We typically assign home nodes at a page-granularity using a firsttouch policy The cited work use a 4-way set associative, 16KB cache at each tile. (The proximity-aware protocol described is able to request data from a (. L2 nearby sharer if it is not present in the home node's 39 ( MPhil Chip Multiprocessors (ACS

40 Organising directory information 4. A hierarchical directory Reproduced from A consistency architecture for hierarchical shared caches, Ladan-Mozes/Lesierson, SPAA'08 40 ( MPhil Chip Multiprocessors (ACS

41 Organising directory information 4. A hierarchical directory Aimed at processors with large number of cores The black dots indicate where a particular block may be cached or stored in memory There is only one place as we move up each level of the tree Example: If a L3 cache holds write permissions for a block (holds block in state ( M it can manage the line in its subtree as if it were main memory No need to tell its parent (! proofs See paper for details (and See also Fractal Coherence paper from MICRO'10 41 ( MPhil Chip Multiprocessors (ACS

42 Organising directory information 4. A hierarchical directory Less extreme examples of hierarchical schemes are common where larger-scale machines exploit bus-based first-level coherence (commodity ( hardware and a directory protocol at the secondlevel. In such schemes a bridge between the two protocols monitors activity on the bus and when necessary intervenes to ensure coherence actions are handled at the second level (removing the transaction from the bus, completing the coherence actions at the 2 nd level and then ( bus replaying the request on the 42 ( MPhil Chip Multiprocessors (ACS

43 Sharing patterns Invalidation frequency and size distribution How many writes require copies in other caches to be invalidated? (invalidating ( writes i.e. the local private cache does not already hold block in M state What is the distribution of the no. of invalidations ( sharers ) required upon these writes? 43 ( MPhil Chip Multiprocessors (ACS

44 Sharing patterns Barnes-Hut Invalidation Patterns to to to to to to to to to to to to to to 63 # of invalidations Radiosity Invalidation Patterns to to to to to to to to to to to to to to 63 # of invalidations See Culler p.574 for more (assumes inf. large private ( caches 44 ( MPhil Chip Multiprocessors (ACS

45 Sharing patterns Read-only No invalidating writes Producer-consumer Processor A writes, then one or more processors read the data, then processor A writes again, the data is read again, and so on Invalidation size is often 1, all or a few This categorization is originally from Cache invalidation patterns in shared memory multiprocessors, Gupta/Weber, See also Culler Section ( MPhil Chip Multiprocessors (ACS

46 Sharing patterns Migratory Data migrates from one processor to another. Often being read as well as written along the way Invalidation size = 1, only previous writer has a copy (it ( copy invalidated the previous Irregular read-write Irregular/unpredictable read/write access patterns Invalidation size is normally concentrated around the small end of the spectrum 46 ( MPhil Chip Multiprocessors (ACS

47 Protocol optimisations Goals? Performance, power, complexity and area! Aim to lower the average memory access time If we look at the protocol in isolation, the typical approach is to: Aim to reduce the number of network transactions ( 1 Reduce the number of transactions on the critical ( 2 path of the processor 47 ( MPhil Chip Multiprocessors (ACS Culler Section 8.4.1

48 Protocol optimisations Let's look again at the simple protocol we introduced in slides 24/25 In the case of a read miss to a block in a modified state in another cache we required: 5 transactions in total 4 transactions are on the critical path Let's look at forwarding as a protocol optimisation An intervention here is just like a request, but issued in reaction to a request to a cache 48 ( MPhil Chip Multiprocessors (ACS

49 Directory-based cache coherence Read Miss to a block in a modified state in a cache ( 8.5 Fig. (Culler, 49 ( MPhil Chip Multiprocessors (ACS

50 Directory-based cache coherence 1: req 3:intervention 4a:revise L H R 2:reply 4b:response (a) Strict request-reply 1: req 2:intervention L H R 4:reply 3:response (a) Intervention forwar ding 1: req 2:intervention 3a:revise L H R 3b:response (a) Reply forwarding Culler, p ( MPhil Chip Multiprocessors (ACS

51 Protocol optimisations Other possible ways improvements can be made: Optimise the protocol for common sharing patterns e.g. producer-consumer and migratory Exploit a particular network topology or hierarchical directory structure Perhaps multiple networks tuned to different types of traffic ( sense Exploit locality (in a physical Obtain required data using a cache-to-cache transfer from the nearest sharer or an immediate neighbour Perform speculative transactions to accelerate acquisition of permissions or data Compiler assistance ( MPhil Chip Multiprocessors (ACS

52 Correctness Directory protocols can quickly become very complicated Timeouts, retries, negative acknowledgements have all been used in different protocols to avoid deadlock and livelock issues (and guarantee ( progress forward 52 ( MPhil Chip Multiprocessors (ACS

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses, false sharing Cache coherence and interconnects