CS 152 Computer Architecture and Engineering

Size: px

Start display at page:

Download "CS 152 Computer Architecture and Engineering"

Samson Greene
5 years ago
Views:

1 CS 152 Computer Architecture and Engineering Lecture 14 - Cache Design and Coherence John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: 1

2 Today: Shared Cache Design and Coherence CPU... CPU CPU multi-threading Keeps memory system busy. Private Cache... Private Cache Crossbars and Rings How to do on-chip sharing. Shared Caches DRAM Shared Ports I/O Concurrent requests Interfaces that don t stall. Coherency Protocols Building coherent caches. 2

3 Multithreading Sun Microsystems Niagara series 3

4 The case for multithreading Amdahl s Law tells us that optimizing C is the wrong thing to do... Some applications spend their lives waiting for memory. C = compute M = waiting Single issue ILP TLP (on shared single issue pipeline) C M C M C M C M C M C M C C M C M M Time saved Memory latency Compute latency Idea: Create a design that can multiplex threads onto one pipeline. Goal: Maximize throughput of a large number of threads. 4

5 Multi-threading: Assuming perfect caches Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) Labels show this state: t0 t1 t2 t3 t4 t5 t6 t7 t8 F D X M W F D X M W F D X M W F D X M W F D X M W t9 Last instruction in a thread always completes writeback before next instruction in same thread reads regfile 4 CPUs, 1/4 clock. S. Cray, PC PC PC 1 PC I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 T4 2 Thread select 2 T3 T2 T1 5

6 Bypass network is no longer needed... Result: Critical path shortens -- can trade for speed or power. IR ID (Decode) IR EX IR MEM IR WE, MemToReg WB Mux,Logic From WB 32 op rs1 rs2 RegFile rd1 A 32 A L U 32 Y Data Memory Addr Dout Din WE MemToReg R ws wd WE rd2 M M Ext B 6

7 Multi-threading: Supporting cache misses A thread scheduler keeps track of information about all threads that share pipeline. When a thread experiences a cache miss, it is taken off the pipeline during the miss penalty period. PC PC PC 1 PC I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 Thread scheduler 7

8 Sun Niagara II # threads/core? 8 threads/core: Enough to keep one core busy, given clock speed, memory system latency, and target application characteristics. 8

9 Crossbar Networks 9

10 Shared-memory CPU Private Cache Shared Caches DRAM Shared Ports I/O CPU Private Cache CPUs share lower level of memory system, and I/O. Common address space, one operating system image. Communication occurs through the memory system (100ns latency, 20 GB/s bandwidth) 10

11 Sun s Niagara II: Single-chip implementation... SPC == SPARC Core. Only DRAM is not on chip. 11

12 Crossbar: Like N ports on an N-register file sel(ws) 5 WE D E M U X... clk wd R0 - The constant 0 Q 32 Flexible, but... reads slows down as O(N 2 )... D D D En En En R1 R2... R31 Q Q Q... Why? Number of loads on each Q goes as O(N), and the wire length to port mux goes as O(N) sel(rs1)... M U X M U X 5 32 rd1 sel(rs2) rd2 12

13 Design challenge: High-performance crossbar Niagara II: 8 cores, 8 L2 banks, 4 DRAM channels. Apps are locality-poor. Goal: saturate DRAM BW. Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW. Crossbar BW: 270 GB/s total (Read + Write). 13

14 Sun Niagara II 8 x 9 Crossbar Tri-state distributed mux, as in microcode talk. Every cross of blue and purple is a tri-state buffer with a unique control signal. 72 control signals (if distributed unencoded). 14

15 Sun Niagara II 8 x 9 Crossbar 8 ports on CPU side (one per core) wires/ port (each way). 4 cycle latency (715ps/cycle). Cycles 1-3 are for arbitration. Transmit data on cycle 4. Pipelined. 8 ports for L2 banks, plus one for I/0 15

16 A complete switch transfer (4 epochs) Epoch 1: All input ports (that are ready to send data) request an output port. Epoch 2: Allocation algorithm decides which inputs get to write. Epoch 3: Allocation system informs the winning inputs and outputs. Epoch 4: Actual data transfer takes place. Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests. 16

Epoch 3: The Allocation Problem (4 x 4) Output Ports (W, X, Y, Z) Input Ports (A,

switches. Algorithm should be fair, so no port always loses.

17 Epoch 3: The Allocation Problem (4 x 4) Output Ports (W, X, Y, Z) Input Ports (A, B, C, D) W X Y Z A B C D A 1 codes that an input has data ready to send to an output. Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be fair, so no port always loses... should also scale to run large matrices fast. CS 152 L21: Networks and Routers W X Y Z A B C D UC Regents Fall 2006 UCB 17

18 Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns). Uniform latency between all port pairs. Crossbar defines floorplan: all port devices should be equidistant to the crossbar. 18

19 19

20 Sun Niagara II Energy Facts Crossbar only 1% of total power. 20

21 Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns). Uniform latency between all port pairs. Crossbar defines floorplan: all port devices should be equidistant to the crossbar. Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores. Design alternatives to crossbar? 21

22 CLOS Networks: From telecom world... Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency. 22

23 CLOS Networks: An example route Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking. 23

24 Ring Networks 24

25 Intel Xeon Data Center server chip 20% of Intel s revenues, 40% of profits. Why? Cloud is growing, Xeon is dominant. 25

26 Compiled Chips Xeon is a chip family, varying by # of cores, L3 cache size. Chip family mask layouts generated automatically, by adding core/cache slices. Ring Bus 26

27 Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O. Chip compiler might size the ring bus to scale bandwidth with # of cores. Ring latency increases with # of cores. But compared to baseline, small. Ring Stop 27

28 Tiles along x-axis are 20 ways of cache 2.5 MB L3 cache slice from Xeon E5 Ring stop interface lives in the Cache Control Box (CBOX) Wednesday, March 5, Fig MB L3 cache floor-plan. 28

29 Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient #, command) 1024 Ring Stop #1 Ring Stop #2 Ring Stop #3 Empty Data Data Out In Control Ring Stop #2 Interface Reading: Sense Data Out to see if message is for Ring Stop #2. If so, latch data, mux Empty onto ring. Writing: Check is Data Out is Empty. If so, mux a message onto the ring via the Data In port. 29

30 In practice: Extreme EE to co-optimize bandwidth, reliability. 30

31 Debugging: Network analyzer built into chip to capture ring messages of a particular kind. Sent off chip via an aux port. 31

32 A derivative of this ring bus is also used on laptop and desktop chips. 32

33 Break Play: 33

34 Hit-over-Miss Caches 34

35 Recall: CPU-cache port that doesn t stall on a miss CPU makes a request by placing the following items in Queue 1: CMD: Read, write, etc... From CPU To CPU Queue 1 Queue 2 MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit. TAG: 9-bit number identifying the request. MADDR: Memory address of first byte. STORE-DATA: For stores, the data to store. 35

36 This cache is used in an ASPIRE CPU (Rocket) When request is ready, cache places the following items in Queue 2: From CPU To CPU Queue 1 Queue 2 TAG: Identity of the completed command. LOAD-DATA: For loads, the requested data. CPU saves info about requests, indexed by TAG. Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order. 36

37 Today: How a read request proceeds in L1 D-Cache CPU requests a read by placing MTYPE, TAG, MADDR in Queue 1. From CPU To CPU Queue 1 Queue 2 We == L1 D-Cache controller We do a normal cache access. If there is a hit, we put place load result in Queue 2... In the case of a miss, we use the Inverted Miss Status Holding Register. 37

38 Inverted MSHR (Miss Status Holding Register) (1) Associatively look up block # of memory address in table. If there are no hits, do memory request. To look up a memory address... Cache Block # 42 0 Valid Bit MTYPE 1 0 1st Byte in Block 4 0 Tag ID (ROM) 8 0 = 0 Hit Valid Qualifies Hit [... ] 512-entry table, so that every 9-bit TAG value has an entry. [... ] = 511 Hit Valid Qualifies Hit Assumptions: 32-byte blocks, 48-bit physical address space. 38

39 Inverted MSHR (Miss Status Holding Register) (2) Index into table using 9-bit TAG, and set all fields using MADDR and MTYPE queue values. To look up a memory address... This indexing always finds V=0, because CPU promises not to reuse in-flight tags. Cache Block # 42 0 Valid Bit MTYPE st Byte in Block 4 0 TAG (9 bits) Tag ID (ROM) 8 0 = 0 Hit Valid Qualifies Hit [... ] 512-entry table, so that every 9-bit TAG value has an entry. [... ] = 511 Hit Valid Qualifies Hit Assumptions: 32-byte blocks, 48-bit physical address space. 39

40 Inverted MSHR (Miss Status Holding Register) (3) Whenever memory system returns data, associatively look up block # to find all pending transactions. Place transaction data for all hits in Queue 2, and clear valid bits. Also update L1 cache. To look up a memory address... Cache Block # 42 0 Valid Bit MTYPE 1 0 1st Byte in Block 4 0 Tag ID (ROM) 8 0 = 0 Hit Valid Qualifies Hit [... ] 512-entry table, so that every 9-bit TAG value has an entry. [... ] = 511 Hit Valid Qualifies Hit Assumptions: 32-byte blocks, 48-bit physical address space. 40

41 Inverted MHSR notes. Structural hazards only occur when TAG space is exhausted by the CPU. High cost (# comparators + SRAM cells). See Farkas and Jouppi on class website, for low-cost designs that are often good enough. We will return to MHSRs to discuss CPI performance later in the semester. 41

42 Coherency Hardware 42

43 Cache Placement 43

44 Two CPUs, two caches, shared DRAM... CPU0 Cache CPU1 Cache Addr Value Addr Value Shared Main Memory Addr 16 Value 5 Write-through caches 0 CPU0: LW R2, 16(R0) CPU1: LW R2, 16(R0) CPU1: SW R0,16(R0) View of memory no longer coherent. Loads of location 16 from CPU0 and CPU1 see different values! Today: What to do... 44

45 The simplest solution... one cache! CPU0 CPU1 Memory Switch Shared Multi-Bank Cache Shared Main Memory CPUs do not have internal caches. Only one cache, so different values for a memory address cannot appear in 2 caches! Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank. In that case, one request is stalled. 45

46 Not a complete solution... good for L2. CPU0 CPU1 For modern clock rates, access to shared cache through switch takes 10+ cycles. Memory Switch Shared Multi-Bank Cache Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good. Shared Main Memory Sequent Systems (1980s) This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched. 46

47 Modified form: Private L1s, shared L2 CPU0 CPU1 L1 Caches L1 Caches Memory Switch or Bus Shared Multi-Bank L2 Cache Shared Main Memory Thus, we need to solve the cache coherency problem for L1 cache. Advantages of shared L2 over private L2s: Processors communicate at cache speed, not DRAM speed. Constructive interference, if both CPUs need same data/instr. Disadvantage: CPUs share BW to L2 cache... 47

48 IBM Power 4 (2001) Dual core Shared, multi-bank L2 cache. Private L1 caches Off-chip L3 caches 48

49 Cache Coherency 49

50 Cache coherency goals... Addr CPU0 Cache Value CPU1 Shared Memory Hierarchy Addr 16 Addr Cache Value Value Only one processor at a time has write permission for a memory location. 2. No processor can load a stale copy of a location after a write. 50

51 Simple Implementation: Snoopy Caches CPU0 CPU1 Cache Snooper Cache Memory bus Snooper Shared Main Memory Hierarchy Each cache has the ability to sno op on memory bus transactions of other CPUs. The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs. 51

52 Writes from 10,000 feet... for write-thru L1 For write-thru caches... Cache CPU0 Snooper Cache Memory bus CPU1 Shared Main Memory Hierarchy Snooper To a first-order, reads will just work if write-thru caches implement this policy. A two-state protocol (cache lines are valid or invalid ). 1. Writing CPU takes control of bus. 2. Address to be written is invalidated in all other caches. Reads will no longer hit in cache and get stale data. 3. Write is sent to main memory. Reads will cache miss, retrieve new value from main memory 52

53 Limitations of the write-thru approach CPU0 CPU1 Every write goes to the bus. Cache Snooper Cache Memory bus Shared Main Memory Hierarchy Snooper Total bus write bandwidth does not support more than 2 CPUs, in modern practice. Write-back big trick: add extra states. Simplest version: MSI -- Modified, Shared, Invalid. More efficient versions add more states (MESI adds Exclusive). State definitions are subtle... To scale further, we need to use write-back caches. 53

54 Figure 5.5, page the best starting point. 54

55 Read misses... for a MESI protocol... For write-back caches... Cache CPU0 Snooper Cache Memory bus CPU1 Shared Main Memory Hierarchy Snooper These sketches are just to give you a sense of how coherency protocols work. Deep understand requires understanding the complete state machine for protocol. 1. A cache requests a cache-line fill for a read miss. 2. Another cache with an exclusive on this line responds with fresh data. Reads miss will not hit main memory, retrieve stale data. 3. The responding cache changes line from exclusive to modified. Future writes will go to bus to be snooped.. 55

56 Snoopy mechanism doesn t scale... CPU0 CPU1 Cache Snooper Cache Memory bus Snooper Shared Main Memory Hierarchy Single-chip implementations have moved to a centralized directory service that tracks the status of each line of each private cache. Multi-socket systems use distributed directories. 56

57 Directories attached to on-chip cache network... 57

58 2 socket system... each socket a multi-core chip Each chip has its own bank of DRAM. 58

59 Distributed directories for multi-socket systems Directories for Chip 0... and Chip 1 L1 L1 L2 L2 Directory for Chip 0 DRAM. Directory for Chip 1 DRAM. 59

60 Figure 5.21, page directory message basics Conceptually similar to snoopy caches... but the different mechanisms require rethinking the protocol to get correct behaviors. 60

61 Other Machine Architectures 61

62 NUMA: Non-uniform Memory Access CPU 0... CPU 1023 Each CPU has part of main memory attached to it. Cache DRAM Cache DRAM To access other parts of main memory, use the interconnection network. Interconnection Network Network uses coherent global address space. Directory protocols over fiber networking. For best results, applications take the non-uniform memory latency into account. 62

Clusters: Supercomputing version of WSC Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology (not normal Ethernet).

63 Clusters: Supercomputing version of WSC Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology (not normal Ethernet). University of Illinois, CPU Apple Xserve cluster, connected with Myrinet (3.5 μs ping time - low latency network). Instead of using hardware to create a shared memory abstraction, let an application build its own memory model. 63

64 On Tuesday We return to CPU design... Have a good weekend! 64

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering Lecture 27 Multiprocessors 2005-4-28 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/ Last Time: