Performance study example ( 5.3) Performance study example

Size: px

Start display at page:

Download "Performance study example ( 5.3) Performance study example"

Laurence Craig
5 years ago
Views:

commercial benchmarks. ead b a b c d Invalid Write a.

1 erformance study example ( 5.3) Coherence misses: - True sharing misses - Write to a shared block - ead an invalid block - False sharing misses - ead an unmodified word in an invalidated block CI for commercial benchmarks. ead b a b c d Invalid Write a. a b c d Modified 26 erformance study example How do you handle coherence if you do not have a shared bus?? 27

2 Sample Machines CU Interrupt controller Bus interface 256-KB L 2 -ro module -ro module -ro module Intel entium ro Quad Coherent 4 processors -ro bus (64-bit data, 36-bit address, 66 MHz) CI bridge CI bridge Memory controller CU/mem cards CI I/O cards CI bus CI bus MIU 1-, 2-, or 4-way interleaved DM 2 2 Mem ctrl Bus interface/switch Sun Enterprise server Coherent Up to 16 processor and/or memory-i/o cards Gigaplane bus (256 data, 41 address, 83 MHz) I/O cards Bus interface 100bT, SCSI SBUS SBUS SBUS 2 FiberChannel 28 Directory-based Coherence ( 5.4) Idea: Implement a directory that keeps track of where each copy of a block is cached and its state in each cache (note that with snooping, the state of a block was kept only in the cache). rocessors must consult the directory before caching blocks from memory. If block is exclusive, then its owner should provide the most up-to-date copy. When a block in memory is updated (written), the directory is consulted to either update or invalidate other cached copies. Eliminates the overhead of broadcasting/snooping (bus bandwidth) Hence, scales up with the numbers of processors that would saturate a single bus. Slower in terms of latency?? 1 2 n network/bus Shared space (memory, ) 29

Directory-based Coherence The memory and the directory can be centralized 0 1 Network n Mem Dir Mem Dir Shared memory Mem Dir Or distributed 0 Mem Dir 1 Mem Dir n Mem Dir Shared memory Network

3 Directory-based Coherence The memory and the directory can be centralized 0 1 Network n Mem Dir Mem Dir Shared memory Mem Dir Or distributed 0 Mem Dir 1 Mem Dir n Mem Dir Shared memory Network lternatively, the memory may be distributed but the directory can be centralized. Or the memory may be centralized but the directory can be distributed (as we will discuss in the case of CM with private caches) 30 Distributed directory-based coherence The location (home) of each memory block is determined by its address. controller decides if access is Local or emote s in snooping caches, the state of every block in every cache is tracked in that cache (exclusive/dirty, shared/clean, invalid) to avoid the need for write through and unnecessary write back. In addition, with each block in memory, a directory entry keeps track of where the block is cached. ccordingly, a block can be in one of the following states: Uncached: no processor has it (not valid in any cache) Shared/clean: cached in one or more processors and memory is up-to-date Exclusive/modified/dirty: one processor (owner) has data; memory out-of-date 31

4 Enforcing coherence Coherence is enforced by exchanging messages between nodes Three types of nodes may be involved Local requestor node (L): the node that reads or write the cache block Home node (H): the node that stores the block (and its directory entry) in its memory -- may be the same as L emote nodes (): other nodes that have a cached copy of the requested block. When L encounters a ead Hit, it just reads the data When L encounters a ead Miss, it sends a message to the home node, H, of the requested block three cases may arise: The directory indicates that the block is not cached The directory indicates that the block is shared/clean The directory indicates that the block is exclusive/modified 32 What happens on a read miss? (when block is invalid in local cache) (a)ead miss (if block is shared or uncached) -- L sends request to H -- H sends the block to L L -- state of block is shared in directory -- state of block is shared in L 1 equest to Home node eturn data 2 H (b) ead miss (if block is exclusive in another cache) -- L sends request to H -- H informs L about the block owner, -- L requests the block from -- send the block to L -- L and set the state of block to shared -- informs H that it should change the state of the block to shared L 3 4 equest to owner eturn data 1 equest to Home node eturn owner 4 2 evise entry H 33

5 What happens on a write miss? (when block is invalid in local cache) (a) Write miss to an uncached block -- similar to a read miss to an uncached block except that the state of the block is set to exclusive (b) Write miss to an block that is exclusive in another cache -- similar to a read miss to an exclusive block except that the state of the block is set to exclusive in H and L and to Invalid in. (c) Write miss to a shared block -- L sends request to H -- H sets the state to exclusive -- H sends the block to L -- H sends to L the list of other sharers -- L sets the block s state to exclusive -- L sends invalidating messages to each sharers () -- Each sets block s state to invalid 3 Invalidate ack L 4 4 ack 1 equest to Home node eturn sharers and data 3 2 Invalidate 5 evise entry H 34 What happens on a write hit? (when block is shared or exclusive in local cache) (a) If the block is exclusive in L, just write the data 5 (b) If the block is shared in L -- L sends a request to H to have the block as exclusive -- H sets the state to exclusive -- H informs L of the block s other sharers -- L sets the block s state to exclusive -- L sends invalidating messages to each sharers () -- sets block s state to invalid 3 Invalidate degree of complexity that we will ignore: ack L 4 4 ack 1 equest to Home node eturn sharers and data 3 2 Invalidate We need a busy state to handle simultaneous requests to the same block. For example, if there are two writes to the same block it has to be serialized. eason: order of events depends on message orders, which is non-deterministic. evise entry H 35

6 The coherence protocol at a node s cache controller 36 The coherence protocol (Directory response to a coherence message) 37

7 MSI Directory-based coherence - example Case 1: X is in the uncached (U) state in home directory j i Home of X U dir state of cached blocks where X is cached k ossible scenario: j reads X Then j writes to X 38 MSI Directory-based coherence - example Case 2: X is exclusive (E) in home directory and owned by j (dirty, d, in j) i Home of X E{j} j dir X d State of cached blocks where X is cached k Trace the state of X if: Then k reads X 39

8 MSI Directory-based coherence - example Case 3: X is exclusive (E) in home directory and owned by j (dirty, d, in j) i Home of X E{j} j dir X d State of cached blocks where X is cached k Trace the state of X if: k writes to X 40 MSI Directory-based coherence - example Case 4: X is shared (S) in home directory and clean (c) in j and K i Home of X S{j,k} j dir X c State of cached blocks where X is cached k Trace the state of X if: j reads X Then k writes into X X c 41

9 The MESI protocol s described earlier, in MSI, a cache block can be in one of three states Invalid (uncached) : not in the cache (not valid in any cache) Shared/clean: cached in one or more processors and memory is up-to-date Modified/dirty/exclusive: one processor (owner) has data; memory out-of-date The MESI protocol divides the Exclusive state to two states Invalid (uncached): same as in MSI Shared: cached in more than one processors and memory is up-to-date Exclusive: one processor (owner) has data and it is clean Modified: one processor (owner) has data, but it is dirty If MESI is implemented using a directory, then the information kept for each block in the directory is the same as the three state protocol: Shared in MESI = shared/clean but more than one sharer Exclusive in MESI = shared/clean but only one sharer Modified in MESI = Exclusive/Modified/dirty However, at each cached copy, a distinction is made between shared, exclusive and modified (rather than only shared and modified). 42 The MESI protocol On a read miss (local block is invalid), load the block and change its state to exclusive if it was uncached in memory shared if it was already shared, modified or exclusive - if it was modified, the owner will send you a clean copy - if was modified or exclusive, the previous owner will change the state of the block to shared in its cache. On a write miss: same as read miss, except set the state to modified copies in other caches (if any) are invalidated On a write hit to a modified block, do nothing On a write hit to an exclusive block change the block to modified no need for invalidation. this is the main advantage of MESI over MSI On a write hit to a shared block change the block to modified and invalidate the other cached copies. When a modified block is evicted, write it back. In snooping bus implementations of MESI, on a read miss, we need to if the block is in some other cache(s) to set its state correctly to shared or exclusive. To take full advantage of MESI, should know when a block is to be changed from shared to exclusive 43

10 The MESI protocol If MESI is implemented as a snooping protocol, then the main advantage over the three state protocol is when a read to an uncached block is followed by a write to that block. fter the uncached block is read, it is marked exclusive Note that, when writing to a shared block, the transaction has to be posted on the bus so that other sharers invalidate their copies. But when writing to an exclusive block, there is no need to post the transaction on the bus. Hence, by distinguishing between shared and exclusive states, we can avoid bus transactions when writing on an exclusive block. However, now a cache that has an exclusive block has to monitor the bus for any read to that block. Such a read will change the state to shared. This advantage disappears in a directory protocol since after a write onto an exclusive block, the directory has to be notified to change the state to modified. 44 Latency optimization 1) Forwarding requests 3: req L 1: req 4: reply H 2: forward 3: respond 1: req L H 4: revise 2: reply 4: reply L 1: req H 2: forward 3: revise 3: reply 2) Use SM for directories (hardware optimization) 3) Overlap activities on the critical path - parallel multiple invalidation - parallel lookup of directory and memory at home node. 45

11 Storage overhead In the simplest representation of a directory entry, a full bit vector is used for each entry (one bit used to indicate presence in each node.) storage overhead doesn t scale well with number of nodes. Larger blocks (cache lines) means lower overhead For very large number of nodes, may use a list of sharers instead of a bit vector Lower overhead if only few sharers Example; for 1024 processors, overhead is reduced if fewer than 100 sharers May reduce overhead further by keeping only directory entries for the blocks that are cached (uncached blocks do not need an entry) Can keep the directory entries for the cached blocks in a hash table (associative cache structure) should invalidate cached copies when the directory entry is removed (evicted) from the hash table. 46 Cache-based Directory Schemes x x x cache cache cache x Mem Keep the information about the sharers of a cached block in the cache by linking the replicated cached entries in a linked list rather than storing a list of sharers with the block in the main memory. When a processor caches a block, it inserts itself at the front of the linked list To invalidate a cache block in the other caches, follow the link list (easier if a doubly link list) Scalable Coherent Interface (SCI) IEEE Standard 47

12 Hierarchical approaches to coherence Multi-levels - especially useful for multi-node systems, when each node is a multiprocessor (example: multi SMs) Examples of two-level systems: B1 B1 Dir. Main Mem Main Mem Dir. Network Snooping-directory Network1 Network1 Network1 Network1 adapter adapter adapter adapter Network 2 Directory-directory Bus Directory-snooping 48 Cache organization in multicore systems Shared systems rivate systems Memory controller System interconnect Memory controller Memory system Memory system Examples: Intel Core Duo entium Uses MESI (Modified, Exclusive, Shared, Invalid) cache coherence protocol to keep the data coherent Examples: MD Dual Core Opteron Uses MOESI (M + Owned + ESI) cache coherence protocol to keep the data coherent ( is inclusive to ) 49

13 Example of distributed directories in CMs 0 1 n Distributed shared cache Dir Dir Dir Network (on chip) Off chip Memory (or on-chip L3) Directories are used to keep track of the state of shared entities that are cached in multiple private caches. If the modules form a shared cache space, then the directories perform a role very similar to their roles in distributed shared memory systems. reserve coherence in the private caches One directory entry for each entry in Location of a cache line in is determine by address of cache entry 50 Example of distributed directories in CMs dir dir Network (on chip) Network (on chip) Shared memory Directory Shared memory If each module is private to the corresponding core, then on chip directories may be used as replacement for a centralized directory Each cache block is associated with a directory entry. Only cache blocks that are on chip need to have directory entries How do you organize and distribute the directory entries among tiles? location of directory entry (called its home) is determined by the address. 51

14 The Tilera TILE-Gx36 rchitecture: MiC UTx2, USBx2, JTG, I 2 C, SI CIe lane CIe lane CIe lane Flexible I/O Memory Controller (DD3) Memory Controller (DD3) mie XUI XUI XUI XUI 36 rocessor Cores 866M, 1.2GHz, 1.5GHz clk 12 MBytes total cache 40 Gbps total packet I/O 4 ports 10GbE (XUI) 16 ports 1GbE () 48 Gbps CIe I/O 2 16Gbps Stream IO ports Wire-speed packet engine 60Mpps MiC engine: 20 Gbps crypto Compress & decompress 52 TILE-Gx100 : Complete System-on-a-Chip with bit cores MiC UT x2, USB x2, JTG, I 2 C, SI CIe lane CIe lane CIe lane Flexible I/O MiC Memory Controller (DD3) Memory Controller (DD3) Memory Controller (DD3) Memory Controller (DD3) mie Interlaken Interlaken XUI XUI XUI XUI XUI XUI XUI XUI 1.2GHz 1.5GHz 32 MBytes total cache 546 Gbps peak mem BW 200 Tbps imesh BW Gbps packet I/O 8 ports XUI / 2 XUI 2 40Gb Interlaken 32 ports 1GbE () 80 Gbps CIe I/O 3 StreamIO ports (20Gb) Wire-speed packet eng. 120Mpps MiC engines: 40 Gbps crypto compress & decompress 53

The Tilera core rocessor Each core is a complete computer 3-way VLIW

physical address space Instruction and data TLBs Cache integrated 2D

processing and general apps Core egister File Three Execution

15 The Tilera core rocessor Each core is a complete computer 3-way VLIW CU rotection and interrupts Memory cache and Cache Virtual and physical address space Instruction and data TLBs Cache integrated 2D DM engine uns SM Linux uns off-the-shelf C/C++ programs Signal processing and general apps Core egister File Three Execution ipelines Cache 16K -I I-TLB 2D 8K -D D-TLB DM 64K Terabit Switch 54 Tilera Tile64 x5 55

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually