5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

Size: px

Start display at page:

Download "5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins"

Anissa Welch
5 years ago
Views:

1 5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

2 Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses, false sharing Cache coherence and interconnects Directory-based Coherency Protocols Introduction, organising directories in a CMP, sharing patterns and protocol optimisations, correctness, broadcasting over unordered interconnects Chip Multiprocessors (ACS MPhil) 2

3 Synchronization The lock problem The lock is suppose to provide atomicity for critical sections Unfortunately, as implemented this lock is lacking atomicity in its own implementation Multiple processors could read the lock as free and progress past the branch simultaneously lock: ld reg, lock-addr cmp reg, #0 bnz lock st lock-addr, #1 ret unlock: st lock-addr, #0 ret Culler p.338 Chip Multiprocessors (ACS MPhil) 3

4 Synchronization Test and Set Executes the following atomically: reg=m[lock-addr] m[lock-addr]=1 The branch makes sure that if the lock was already taken we try again A more general, but similar, instruction is swap reg1=m[lock-addr] m[lock-addr]=reg2 lock: t&s reg, lock-addr bnz reg, lock ret unlock: st lock-addr, #0 ret Chip Multiprocessors (ACS MPhil) 4

5 Synchronization We could implement test&set with two bus transactions A read and a write transaction We could lock down the bus for these two cycles to ensure the sequence is atomic More difficult with a split-transaction bus performance and deadlock issues Chip Multiprocessors (ACS MPhil) 5 Culler p.391

6 Synchronization If we assume an invalidation-based CC protocol with a WB cache, a better approach is to: Issue a read exclusive (BusRdX) transaction then perform the read and write (in the cache) without giving up ownership Any incoming requests to the block are buffered until the data is written in the cache Any other processors are forced to wait Chip Multiprocessors (ACS MPhil) 6

7 Synchronization Other common synchronization instructions: swap fetch&op fetch&inc fetch&add compare&swap Many x86 instructions can be prefixed with the lock modifier to make it atomic A simpler general purpose solution? Chip Multiprocessors (ACS MPhil) 7

8 Synchronization LL/SC Load-Linked (LL) Read memory Set lock flag and put address in lock register Intervening writes to the address in the lock register will cause the lock flag to be reset Store-Conditional (SC) Check lock flag to ensure an intervening conflicting write has not occurred If lock flag is not set, SC will fail If (atomic_update) then mem[addr]=rt, rt=1 else rt=0 Chip Multiprocessors (ACS MPhil) 8

9 Synchronization reg2=1 lock: ll reg1, lock-addr bnz reg1, lock ; lock already taken? sc lock-addr, reg2 beqz lock ; if SC failed goto lock ret unlock: st lock-addr, #0 ret Chip Multiprocessors (ACS MPhil) 9

10 Synchronization This SC will fail as the lock flag will be reset by the store from P2 Culler p.391 Chip Multiprocessors (ACS MPhil) 10

11 Synchronization LL/SC can be implemented using the CC protocol: LL loads cache line with write permission (issues BusRdX, holds line in state M) SC Only succeeds if cache line is still in state M, otherwise fails Implementations often come with caveats: SC may experience spurious failures e.g. due to context switches and TLB misses Restrictions to avoid cache line (holding lock variable) from being replaced Disallow memory memory-referencing instructions between LL and SC Prohibit out-of-order execution between LL and SC Chip Multiprocessors (ACS MPhil) 11

12 Coherence misses Remember your 3 C's! Compulsory Cold-start of first-reference misses Capacity If cache is not large enough to store all the blocks needed during the execution of the program Conflict (or collision) Conflict misses occur due to direct-mapped or set associative block placement strategies Coherence Misses that arise due to interprocessor communication Chip Multiprocessors (ACS MPhil) 12

13 True sharing A block typically contains many words (e.g. 4-8). Coherency is maintained at the granularity of cache blocks True sharing miss Misses that arise from the communication of data 1 st write to a shared block causes an invalidate Subsequent read of the block by another processor will also cause a miss Both these misses are classified as true sharing Chip Multiprocessors (ACS MPhil) 13

14 False sharing False sharing miss Different processors are writing and reading different words in a block, but no communication is taking place e.g. a block may contain words X and Y P1 repeatedly writes to X, P2 repeatedly writes to Y The block will be repeatedly invalidated (leading to cache misses) even though no communication is taking place These are false misses and are due to the fact that the block contains multiple words They would not occur if the block size = a single word Chip Multiprocessors (ACS MPhil) 14

15 Cache coherence and interconnects Broadcast-based snoopy protocols Discussed in Seminar 4 These protocols rely on bus-based interconnects Buses have limited scalability Energy and bandwidth implications of broadcasting They permit direct cache-to-cache transfers Low-latency communication 2 hops» 1. broadcast» 2. receive data from remote cache Very useful for applications with lots of fine-grain sharing Chip Multiprocessors (ACS MPhil) 15

16 Cache coherence and interconnects Totally-ordered interconnects All messages are delivered to all destinations in the same order. Totally-ordered interconnects often employ a centralised arbiter or switch e.g. a bus or pipelined broadcast tree Traditional snoopy protocols are built around the concept of a bus (or virtual bus): (1) Broadcast - All transactions are visible to all components connected to the bus (2) The interconnect provides a total order of messages Chip Multiprocessors (ACS MPhil) 16

17 Cache coherence and interconnects A pipelined broadcast tree is sufficiently similar to a bus to support traditional snooping protocols. [Reproduced from Milo Martin's PhD thesis (Wisconsin)] The centralised switch guarantees a total ordering of messages. Chip Multiprocessors (ACS MPhil) 17

18 Cache coherence and interconnects Unordered interconnects Networks (e.g. mesh, torus) can't typically provide strong ordering guarantees, i.e. nodes don't perceive transactions in a single global order. Point-to-point ordering Networks may be able to ensure messages sent between a pair of nodes are guaranteed not to be reordered. e.g. a mesh with a single VC and XY routing Chip Multiprocessors (ACS MPhil) 18

19 Directory-based cache coherence The state of the blocks in each cache in a snoopy protocol is maintained by broadcasting all memory operations on the bus We want to avoid the need to broadcast. So maintain the state of each block explicitly We store this information in the directory Requests can be made to the appropriate directory entry to read or write a particular block The directory orchestrates the appropriate actions necessary to satisfy the request Chip Multiprocessors (ACS MPhil) 19

20 Directory-based cache coherence The directory provides a per-block ordering point to resolve races All requests for a particular block are made to the same directory entry. The directory decides the order the requests will be satisfied. Directory protocols can operate over unordered interconnects Chip Multiprocessors (ACS MPhil) 20

21 Broadcasting over unordered interconnects A number of recent commercial solutions broadcast transactions over unordered interconnects: Cannot use snoopy protocols directly But don't require directory state to be stored, just provide an ordering point The ordering point also blocks subsequent coherent requests to the same cache line to prevent races with a request in progress. e.g. AMD's Hammer, Intel's E8870 Scalability port, IBM's Power4, xseries Summit systems Disadvantage: high bandwidth requirements Chip Multiprocessors (ACS MPhil) 21

22 Directory-based cache coherence The directory keeps track of who has a copy of the block and their states Broadcasting is replaced by cheaper point-to-point communications by maintaining a list of sharers The number of invalidations on a write is typically small in real applications, giving us a significant reduction in communication costs. Chip Multiprocessors (ACS MPhil) 22

23 Directory-based cache coherence Read Miss to a block in a modified state in a cache (Culler, Fig. 8.5) An example of a simple protocol. This is only meant to introduce the concept of a directory Chip Multiprocessors (ACS MPhil) 23

24 Directory-based cache coherence Write miss to a block with two sharers Chip Multiprocessors (ACS MPhil) 24

25 Directory-based cache coherence Let's consider the requester, directory and sharer state transitions for the previous slide... Requester State Directory State Sharer State The processor executes a store I->P (1) The block is initially in the I(nvalid) state We make a ExclReq to the directory and move to a pending state P->E (4) We receive write permission and data from the directory Block is initially marked as shared. The directory holds a list of the sharers Shared->TransWaitForInvalidate (2) The directory receives a ExclReq from cache 'id', id is not in the sharers list and the sharers list is not empty. It must send invalidate requests to all sharers and wait for their responses TransWaitForInvalidate->M (4) All invalidate acks are recieved, directory can reply to requester and provide data + write permissions. It moves to a state that records that the requester has the only copy S->I (3) On receiving a InvReq each sharer invalidates its copy of the block and moves to state I. It then acks with a InvRep message Chip Multiprocessors (ACS MPhil) 25

26 Directory-based cache coherence We now have two types of controller, one at each directory and one at each private cache The complete cache coherence protocol is specified in state-diagrams for both controllers The stable cache states are often MESI as in a snoopy protocol There are some complete example protocols available on the wiki (Courtesy of Brian Gold) Exercise: try and understand how each of these protocols handles the situations described in slides 22 and 23. Chip Multiprocessors (ACS MPhil) 26

27 Organising directory information How do we know which directory to send our request to? How is directory state actually stored? Chip Multiprocessors (ACS MPhil) 27

28 Organising directory information Directory Schemes Centralized Distributed How to find source of directory information How to locate copies Memory-based Information about all sharers is stored at the directory, e.g. using a full bit-vector organization or limited-pointer scheme Flat Cache-based Hierarchical Requests traverse up a tree to find a node with information on the block information is distributed amongst sharers, e.g. sharers form a linked list (IEEE SCI, Sequent NUMA-Q) Figure 8.7 (reproduced from Culler Parallel book) Chip Multiprocessors (ACS MPhil) 28

29 Organising directory information How do we store the sharer's list in a flat, memory-based directory scheme? Full bit-vector P presence bits, which indicate for each of the p processors whether the processor has a copy of the block Limited-pointer schemes Maintain a fixed (and limited) number of pointers Typically the number of sharers is small (4 pointers may often suffice) Need a backup or overflow strategy Overflow to memory or resort to broadcast Or a coarse vector scheme (where each bit represents groups of processors) Extract from duplicated L1 tags Query local copy of tags to find sharers Chip Multiprocessors (ACS MPhil) 29 [Culler p.568]

30 Organising directory information Four examples of how we might store our directory information in a CMP: 1) Append state to L2 tags 2) Duplicate L1 tags at the directory 3) Store directory state in main memory and include a directory cache at each node 4) A hierarchical directory I assume the L2 is the first shared cache. In a real system this could as easily be the L3 or interface to main memory. The directory is placed at the first shared memory regardless of the number of levels of cache. Chip Multiprocessors (ACS MPhil) 30

31 Organising directory information 1. Append state to L2 tags Perhaps conceptually the simplest scheme Assume a shared banked inclusive L2 cache The location of the directory depends only on the block address Directory state can simply be appended to the L2 cache tags Chip Multiprocessors (ACS MPhil) 31 Reproduced from Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, Zhang/Asanovic, ISCA'05

32 Organising directory information 1. Append state to L2 tags May be expensive in terms of memory L2 may contain many more cache lines than can reside in the aggregated L1s (or on a per bank basis, those L1 lines that can map to the L2 bank) May be unnecessarily power and area hungry Doesn't provide support for non-inclusive L2 caches Assumes the L2 is always caching anything in the L1's Problematic if L2 is small in comparison to aggregated L1 capacity Chip Multiprocessors (ACS MPhil) 32

33 Organising directory information 2. Duplicating L1 tags (A reverse mapped directory CAM) At each directory (e.g. L2 bank): Duplicate the L1 tags of those L1 lines that can map to the bank We can interrogate the duplicated tags to determine the sharers list At what granularity do we interleave addresses across banks for the directory and L2 cache? Simpler if we interleave the directory and L2 in the same way What about the impact of granularity on the directory? Chip Multiprocessors (ACS MPhil) 33

34 Organising directory information 2. Duplicating L1 tags In this example precisely one quarter of the L1 lines map to each of the 4 L2 banks Chip Multiprocessors (ACS MPhil) 34

35 Organising directory information 2. Duplicating L1 tags A fine-grain interleaving as illustrated on the previous slide means that only a subset of each L1's lines may map to a particular L2 bank Each directory is organised as: s/n sets of n*a ways Where n no. of processors If a coarse-grain interleaving is selected (where the L2 bank is selected from bits outside the L1s index bits) any L1 line could map to any L2 bank, hence each directory is organised as: s sets of n*a ways each Chip Multiprocessors (ACS MPhil) 35

36 Organising directory information 2. Duplicating L1 tags Example: Sun Niagara T1 L1 caches are write-through, 16-byte lines Allocate on load, no-allocate on a store L2 maintains directory by duplicating L1 tags L2 is banked and interleaved at a 64-byte granularity No. of L1 lines that may map to each L2 bank is much less than the total number of L2 lines in a bank. Duplicating L1 tags saves area and power over adding directory state to each L2 tag. Chip Multiprocessors (ACS MPhil) 36

37 Organising directory information 3. Directory-caches Directory state is stored in main memory and cached at each node Note: The L2 caches are private in this example Figure reproduced from Proximity- Aware Directory-based Coherence for Multi-core Processor Architectures, Brown/Kumar/Tullsen, SPAA'07 Chip Multiprocessors (ACS MPhil) 37

38 Organising directory information 3. Directory-caches Each tile and corresponding memory channel has access to a different range of physical memory locations There is only one possible home (location of the associated directory) for each memory block Two different directories never share directory state, so there are no coherence worries between directory caches! Directory information can be associated with multiple contiguous memory blocks to take advantage of spatial locality We typically assign home nodes at a page-granularity using a first-touch policy Chip Multiprocessors (ACS MPhil) 38

39 Organising directory information 4. A hierarchical directory Reproduced from A consistency architecture for hierarchical shared caches, Ladan-Mozes/Lesierson, SPAA'08 Chip Multiprocessors (ACS MPhil) 39

40 Organising directory information 4. A hierarchical directory Aimed at processors with large number of cores The black dots indicate where a particular block may be cached or stored in memory There is only one place as we move up each level of the tree Example: If a L3 cache holds write permissions for a block (holds block in state M) it can manage the line in its subtree as if it were main memory No need to tell its parent See paper for details (and proofs!) See also Fractal Coherence paper from MICRO'10 Chip Multiprocessors (ACS MPhil) 40

41 Organising directory information 4. A hierarchical directory Less extreme examples of hierarchical schemes are common where larger scale machines exploit bus-based first-level coherence (commodity hardware) and a directory protocol at the secondlevel. In such schemes a bridge between the two protocols monitors activity on the bus and when necessary intervenes to ensure coherence actions are handled at the second level when necessary (removing the transaction from the bus, completing the coherence actions at the 2 nd level and then replaying the request on the bus) Chip Multiprocessors (ACS MPhil) 41

42 Sharing patterns Invalidation frequency How many writes might require invalidating other copies? (invalidating writes) i.e. the local private cache does not already hold block in M state What is the distribution of the no. of invalidations (sharers) required upon these writes? Invalidation size distribution Chip Multiprocessors (ACS MPhil) 42

43 Sharing patterns Barnes-Hut Invalidation Pattern to to to to to to to to to to to to to to 63 # of invalidations Radiosity Invalidation Patterns to to to to to 27 # of invalidations 28 to to to to to to to to to 63 See Culler p.574 for more Chip Multiprocessors (ACS MPhil) 43

44 Sharing patterns Read-only No invalidating writes Producer-consumer A processor writes, then one or more reads the data, the processor writes again, the data is read again, and so on Invalidation size is often 1, all or a few This categorization is originally from Cache invalidation patterns in shared memory multiprocessors, Gupta/Weber, See also Culler Section 8.3 Chip Multiprocessors (ACS MPhil) 44

45 Sharing patterns Migratory Data migrates from one processor to another. Often being read as well as written along the way Invalidation size = 1, only previous writer has a copy (it invalidated the previous copy) Irregular read-write Irregular/unpredictable read/write access patterns Invalidation size is normally concentrated around the small end of the spectrum Chip Multiprocessors (ACS MPhil) 45

46 Protocol optimisations Goals? Performance, power, complexity and area! Aim to lower the average memory access time If we look at the protocol in isolation, the typical approach is to: 1) Aim to reduce the number of network transactions 2) Reduce the number of transactions on the critical path of the processor Chip Multiprocessors (ACS MPhil) 46 Culler Section 8.4.1

47 Protocol optimisations Let's look again at the simple protocol we introduced in slides 22/23 In the case of a read miss to a block in a modified state in another cache we required: 5 transactions in total 4 transactions are on the critical path Let's look at forwarding as a protocol optimisation An intervention here is just like a request, but issued in reaction to a request to a cache Chip Multiprocessors (ACS MPhil) 47

48 Directory-based cache coherence Read Miss to a block in a modified state in a cache (Culler, Fig. 8.5) Chip Multiprocessors (ACS MPhil) 48

49 Directory-based cache coherence 1: req 3:interv ention 4a:revise L H R 2:reply 4b:response (a) Strict request-reply 1: req 2:interv ention L H R 4:reply 3:response (a) Intervention forwarding 1: req 2:interv ention 3a:revise L H R 3b:response (a) Reply forwarding Culler, p.586 Chip Multiprocessors (ACS MPhil) 49

50 Protocol optimisations Other possible ways improvements can be made: Optimise the protocol for common sharing patterns e.g. producer-consumer and migratory Exploit a particular network topology or hierarchical directory structure Perhaps multiple networks tuned to different types of traffic Exploit locality (in a physical sense) Obtain required data using a cache-to-cache transfer from the nearest sharer or an immediate neighbour Perform speculative transactions to accelerate acquisition of permissions or data Compiler assistance... Chip Multiprocessors (ACS MPhil) 50

51 Correctness Directory protocols can quickly become very complicated Timeouts, retries, negative acknowledgements have all been used in different protocols to avoid deadlock and livelock issues (and guarantee forward progress) Chip Multiprocessors (ACS MPhil) 51

5 Chip Multiprocessors (II) Robert Mullins

5 Chip Multiprocessors (II) Robert Mullins 5 Chip Multiprocessors (II) ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses Cache coherence and interconnects Directory-based