SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Size: px

Start display at page:

Download "SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto"

Griffin Reynolds
5 years ago
Views:

1 SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto

2 Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols needed for many-core architectures Consider interconnection network optimizations to help facilitate scalable coherence

3 Talk Outline 3 Introduction Network-on-Chip Challenges with Scalable Coherence Protocol SigNet Architecture: Network filtering solution Evaluation Conclusion

4 Many-Core Cache Coherence Challenges 4

5 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements

6 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements

7 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads

8 Many-Core Cache Coherence Challenges 4 Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads

9 Many-Core Cache Coherence Challenges 4 ForwardX Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads

10 Many-Core Cache Coherence Challenges 4 Response ForwardX Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads

11 Scalable Cache Coherence 5

12 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

13 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

14 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

15 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

16 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

17 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line

18 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage

19 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage

20 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage

21 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15

22 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15

23 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15

24 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15

25 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow

26 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow

27 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow

28 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow

29 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage 3 sharers but 6 invalidations Only 2 sharers: 1 & 15 3 rd sharer: overflow

30 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Represents cores 0 & 1 3 sharers but 6 invalidations Cores 1, 2, 5 & 15 share cache line 2 pointers Only 2 sharers: 1 & 15 3 rd sharer: overflow

31 Extraneous Invalidations with Coarse Vectors 6 For many applications: number of sharers is small 2 to 3 When number of sharers exceeds i, pointers overflow Directory entry operates in coarse mode 1 bit represents multiple cores Imprecise sharing list Extra processors will receive and acknowledge invalidation Consumes network power Requires additional cache lookups

32 Latency Impact of Coarse Vectors 7 Normalized Packet Latency 5% 10% 12% 15% Increase in coarseness leads to significant contention

33 Bandwidth Impact 8 Normalized Link Traversals Increased load results in decreased effectiveness of pipeline optimizations Increased dynamic power consumption

34 System-level impact Inval Complete Local Remote Coarse vectors increase average packet latency Average Cycles SPECjbb SPECweb TPC-H TPC-W Barnes Ocean Radiosity Raytrace Increase completion time for invalidations All acknowledgments must be received Delay subsequent requests to pending cache line

35 SigNet Overview 10 Coarseness reduces directory storage But with significant potential impact on network Due to extraneous invalidations Safely remove extraneous invalidations Save power and reduce network contention Place cache summary information in routers Counting Bloom filters used for cache summary signatures Use summary information to filter network packets

36 SigNet Bloom Filters 11 Counting Bloom filters summarize cache information in routers Signature of cache contents Bloom Filter Hit Core exists between current node and destination that is caching an address mapping to same entry Bloom Filter Miss None of the downstream caches are caching lines that map to this entry

37 Modifying Bloom Filters 12 Bloom Filter Insertion Cache misses increment counter as they travel to directory Bloom Filter Deletion Writeback and invalidation acknowledgments decrement associated counter at routers between cache and directory

38 SigNet Architecture 13!"#$%& '"()#$*+",& <%0"=%& 24>&?& 24>&2& -'&:&-'&F&.0C&D%66*>%& E"1(*+",& 8,)#$&:& -'&:& -'&,& 8,)#$&7#B%16& 24>&A& -'&.//"0*$"1& 234$05&.//"0*$"1& ;#$)#$&:& -'&:& 8,)#$&9& -'&,& 8,)#$&7#B%16& '1"667*1&634$05& ;#$)#$&9&

39 SigNet Pipeline 14 BW RC VA SA ST LT 3'4*#5-6#!"# $%# &'()*'+,-.#,-.#3-6#,-.#8-99# /0#,0# 0(7#89.#,1# 21# /0#,0# :);<#,1# 21# Header flits traverse modified pipeline If the packet needs to check/update signature

40 SigNet Example 15 Requests follow X-Y path Responses follow Y-X path

41 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path

42 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path

43 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path

44 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path

45 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path

46 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path

47 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

48 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

49 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

50 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

51 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

52 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

53 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

54 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

55 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

56 SigNet Example (2) 16

57 SigNet Example (2) 16 Write request for A 4

58 SigNet Example (2) 16 Write request for A 4

59 SigNet Example (2) 16 Write request for A 4

60 SigNet Example (2) Directory forwards invalidation requests Write request for A

61 SigNet Example (2) Directory forwards invalidation requests Write request for A

62 SigNet Example (2) 16 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

63 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

64 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

65 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

66 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

67 SigNet Example (2) 16 Invalidations reach 2 destination cores and acks sent to directory 8 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 4 5 Directory forwards invalidation requests Write request for A

68 SigNet Implementation 17 Recall: Full-map directory requires 32 bytes per sharing vector for 256 cores 50% overhead per cache line Evaluation uses 8K entry Bloom filters at each output port Reduces overhead to 12.5% to 25% per cache line Depends on size of counters and number of pointers in coarse vector directory

69 Correctness and Utilization 18 All cores caching a block must receive invalidation request Bloom filter can have false positives Lessens performance benefit but correct Cannot have false negatives Differences in utilization due to location and memory usage

70 Simulation Methodology 19 Network configuration Number of Nodes 256 Topology Virtual Channels and Buffers Link Width 16 x16 mesh 4 VC/port 8 Buffers/VC 16 Bytes Signature Size 8192

71 Simulation Methodology (2) 20 Workload Parameters Name %Invalidates Average Sharers Database Web Java Scientific A Scientific B Create synthetic benchmarks based on characteristics of 16-core workloads

72 Results: 2 pointers, 16 cores per region 21 Normalized Average Packet Latency CV Directory SigNet Full Map Database Web Java SciA SciB Filters effectively reduce network contention Lower average packet latency Additional research needed to further close gap with Full Map Directory

73 Results: Invalidation Completion Time 22 Cycles for Invalidations to Complete CV Directory SigNet Full Map Database Web Java SciA SciB SigNet improves invalidation completion time Comparison with Pruning Caches in paper

74 Related Work 23 Interconnection Network Support Pruning Caches In-network coherence filters Cache Coherence Optimizations with Bloom filters Jetty filters: reduce cache snoops Tagless Coherence Directories Reduce storage overheads

75 Conclusions 24 Characterize impact of CV directories Significant power consumption and performance degradation Interconnect support to facilitate scalable cache coherence SigNet: network filters to reduce extraneous invalidations

76 Thank you 25 Questions?

SigNet: Network-on-Chip Filtering for Coarse Vector Directories

SigNet: Network-on-Chip Filtering for Coarse Vector Directories Natalie Enright Jerger Department of Electrical and Computer Engineering University of Toronto Toronto, ON enright@eecg.toronto.edu Abstract