SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto
|
|
- Griffin Reynolds
- 5 years ago
- Views:
Transcription
1 SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto
2 Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols needed for many-core architectures Consider interconnection network optimizations to help facilitate scalable coherence
3 Talk Outline 3 Introduction Network-on-Chip Challenges with Scalable Coherence Protocol SigNet Architecture: Network filtering solution Evaluation Conclusion
4 Many-Core Cache Coherence Challenges 4
5 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements
6 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements
7 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads
8 Many-Core Cache Coherence Challenges 4 Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads
9 Many-Core Cache Coherence Challenges 4 ForwardX Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads
10 Many-Core Cache Coherence Challenges 4 Response ForwardX Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads
11 Scalable Cache Coherence 5
12 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!
13 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!
14 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!
15 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!
16 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!
17 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line
18 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage
19 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage
20 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage
21 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15
22 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15
23 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15
24 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15
25 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow
26 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow
27 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow
28 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow
29 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage 3 sharers but 6 invalidations Only 2 sharers: 1 & 15 3 rd sharer: overflow
30 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Represents cores 0 & 1 3 sharers but 6 invalidations Cores 1, 2, 5 & 15 share cache line 2 pointers Only 2 sharers: 1 & 15 3 rd sharer: overflow
31 Extraneous Invalidations with Coarse Vectors 6 For many applications: number of sharers is small 2 to 3 When number of sharers exceeds i, pointers overflow Directory entry operates in coarse mode 1 bit represents multiple cores Imprecise sharing list Extra processors will receive and acknowledge invalidation Consumes network power Requires additional cache lookups
32 Latency Impact of Coarse Vectors 7 Normalized Packet Latency 5% 10% 12% 15% Increase in coarseness leads to significant contention
33 Bandwidth Impact 8 Normalized Link Traversals Increased load results in decreased effectiveness of pipeline optimizations Increased dynamic power consumption
34 System-level impact Inval Complete Local Remote Coarse vectors increase average packet latency Average Cycles SPECjbb SPECweb TPC-H TPC-W Barnes Ocean Radiosity Raytrace Increase completion time for invalidations All acknowledgments must be received Delay subsequent requests to pending cache line
35 SigNet Overview 10 Coarseness reduces directory storage But with significant potential impact on network Due to extraneous invalidations Safely remove extraneous invalidations Save power and reduce network contention Place cache summary information in routers Counting Bloom filters used for cache summary signatures Use summary information to filter network packets
36 SigNet Bloom Filters 11 Counting Bloom filters summarize cache information in routers Signature of cache contents Bloom Filter Hit Core exists between current node and destination that is caching an address mapping to same entry Bloom Filter Miss None of the downstream caches are caching lines that map to this entry
37 Modifying Bloom Filters 12 Bloom Filter Insertion Cache misses increment counter as they travel to directory Bloom Filter Deletion Writeback and invalidation acknowledgments decrement associated counter at routers between cache and directory
38 SigNet Architecture 13!"#$%& '"()#$*+",& <%0"=%& 24>&?& 24>&2& -'&:&-'&F&.0C&D%66*>%& E"1(*+",& 8,)#$&:& -'&:& -'&,& 8,)#$&7#B%16& 24>&A& -'&.//"0*$"1& 234$05&.//"0*$"1& ;#$)#$&:& -'&:& 8,)#$&9& -'&,& 8,)#$&7#B%16& '1"667*1&634$05& ;#$)#$&9&
39 SigNet Pipeline 14 BW RC VA SA ST LT 3'4*#5-6#!"# $%# &'()*'+,-.#,-.#3-6#,-.#8-99# /0#,0# 0(7#89.#,1# 21# /0#,0# :);<#,1# 21# Header flits traverse modified pipeline If the packet needs to check/update signature
40 SigNet Example 15 Requests follow X-Y path Responses follow Y-X path
41 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path
42 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path
43 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path
44 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path
45 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path
46 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path
47 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
48 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
49 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
50 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
51 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
52 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
53 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
54 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
55 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path
56 SigNet Example (2) 16
57 SigNet Example (2) 16 Write request for A 4
58 SigNet Example (2) 16 Write request for A 4
59 SigNet Example (2) 16 Write request for A 4
60 SigNet Example (2) Directory forwards invalidation requests Write request for A
61 SigNet Example (2) Directory forwards invalidation requests Write request for A
62 SigNet Example (2) 16 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A
63 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A
64 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A
65 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A
66 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A
67 SigNet Example (2) 16 Invalidations reach 2 destination cores and acks sent to directory 8 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 4 5 Directory forwards invalidation requests Write request for A
68 SigNet Implementation 17 Recall: Full-map directory requires 32 bytes per sharing vector for 256 cores 50% overhead per cache line Evaluation uses 8K entry Bloom filters at each output port Reduces overhead to 12.5% to 25% per cache line Depends on size of counters and number of pointers in coarse vector directory
69 Correctness and Utilization 18 All cores caching a block must receive invalidation request Bloom filter can have false positives Lessens performance benefit but correct Cannot have false negatives Differences in utilization due to location and memory usage
70 Simulation Methodology 19 Network configuration Number of Nodes 256 Topology Virtual Channels and Buffers Link Width 16 x16 mesh 4 VC/port 8 Buffers/VC 16 Bytes Signature Size 8192
71 Simulation Methodology (2) 20 Workload Parameters Name %Invalidates Average Sharers Database Web Java Scientific A Scientific B Create synthetic benchmarks based on characteristics of 16-core workloads
72 Results: 2 pointers, 16 cores per region 21 Normalized Average Packet Latency CV Directory SigNet Full Map Database Web Java SciA SciB Filters effectively reduce network contention Lower average packet latency Additional research needed to further close gap with Full Map Directory
73 Results: Invalidation Completion Time 22 Cycles for Invalidations to Complete CV Directory SigNet Full Map Database Web Java SciA SciB SigNet improves invalidation completion time Comparison with Pruning Caches in paper
74 Related Work 23 Interconnection Network Support Pruning Caches In-network coherence filters Cache Coherence Optimizations with Bloom filters Jetty filters: reduce cache snoops Tagless Coherence Directories Reduce storage overheads
75 Conclusions 24 Characterize impact of CV directories Significant power consumption and performance degradation Interconnect support to facilitate scalable cache coherence SigNet: network filters to reduce extraneous invalidations
76 Thank you 25 Questions?
SigNet: Network-on-Chip Filtering for Coarse Vector Directories
SigNet: Network-on-Chip Filtering for Coarse Vector Directories Natalie Enright Jerger Department of Electrical and Computer Engineering University of Toronto Toronto, ON enright@eecg.toronto.edu Abstract
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationLecture 3: Flow-Control
High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor
More informationScalable Cache Coherence
arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy
More informationLecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks
Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock
More informationCCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward
More informationPhastlane: A Rapid Transit Optical Routing Network
Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens
More informationInterconnection Networks: Flow Control. Prof. Natalie Enright Jerger
Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationBandwidth Adaptive Snooping
Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin
More informationEECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal
Lecture 19 Interconnects: Flow Control Winter 2018 Subhankar Pal http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk,
More informationLecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)
Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationScalable Cache Coherence
Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient
More informationLecture 16: On-Chip Networks. Topics: Cache networks, NoC basics
Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality
More informationLecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance
Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,
More informationLecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background
Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation
More informationCache Coherence in Scalable Machines
ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor
More informationScalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels
More informationLecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID
Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationLecture 5: Directory Protocols. Topics: directory-based cache coherence implementations
Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes
More informationLecture 7: Flow Control - I
ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 7: Flow Control - I Tushar Krishna Assistant Professor School of Electrical
More informationLecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)
Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a
More informationSCORPIO: 36-Core Shared Memory Processor
: 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett
More informationVirtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti Electrical and Computer Engineering Department, University of Wisconsin-Madison
More informationCache Coherence: Part II Scalable Approaches
ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor
More informationCache Coherence in Scalable Machines
Cache Coherence in Scalable Machines COE 502 arallel rocessing Architectures rof. Muhamed Mudawar Computer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor
More informationLecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control
Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,
More informationLecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control
Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection
More informationInterconnect Routing
Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through
More informationLect. 6: Directory Coherence Protocol
Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor
More informationLecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations
Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,
More informationEECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch
Directory-based Coherence Winter 2019 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt,
More informationSPACE : Sharing Pattern-based Directory Coherence for Multicore Scalability
SPACE : Sharing Pattern-based Directory Coherence for Multicore Scalability Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas Deptartment of Computer Science, University of Rochester {hozhao,ashriram,
More informationToken Coherence. Milo M. K. Martin Dissertation Defense
Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software
More informationVirtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti Electrical and Computer Engineering Department, University of Wisconsin-Madison
More informationCache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols
Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors
More informationLecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics
Lecture 14: Large Cache Design III Topics: Replacement policies, associativity, cache networks, networking basics 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that simultaneously
More informationLecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols
Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols 1 Future Scalable Designs Intel s Single Cloud Computer (SCC): an example prototype
More informationCMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3
MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance
More informationLecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues
Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationScalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:
Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist
More informationLow-Power Interconnection Networks
Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationLecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels
Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a
More informationSwitching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.
Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationTDT Appendix E Interconnection Networks
TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages
More informationLecture 18: Communication Models and Architectures: Interconnection Networks
Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi
More informationFCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationLecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections
Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)
More informationPseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks
Department of Computer Science and Engineering, Texas A&M University Technical eport #2010-3-1 seudo-circuit: Accelerating Communication for On-Chip Interconnection Networks Minseon Ahn, Eun Jung Kim Department
More informationThomas Moscibroda Microsoft Research. Onur Mutlu CMU
Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank
More informationCommunication Cost in Parallel Computing
Communication Cost in Parallel Computing Ned Nedialkov McMaster University Canada SE/CS 4F03 January 2016 Outline Cost Startup time Pre-hop time Pre-word time Store-and-forward Packet routing Cut-through
More informationLecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization
Lecture 25: Multiprocessors Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 Snooping-Based Protocols Three states for a block: invalid,
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 25: Multiprocessors
Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed
More informationLecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections
Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationNetworks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized
More informationScalable Cache Coherent Systems
NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication
More informationThread- Level Parallelism. ECE 154B Dmitri Strukov
Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationEFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES
EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes
More informationTiny Directory: Efficient Shared Memory in Many-core Systems with Ultra-low-overhead Coherence Tracking
Tiny Directory: Efficient Shared Memory in Many-core Systems with Ultra-low-overhead Coherence Tracking Sudhanshu Shukla Mainak Chaudhuri Department of Computer Science and Engineering, Indian Institute
More informationES1 An Introduction to On-chip Networks
December 17th, 2015 ES1 An Introduction to On-chip Networks Davide Zoni PhD mail: davide.zoni@polimi.it webpage: home.dei.polimi.it/zoni Sources Main Reference Book (for the examination) Designing Network-on-Chip
More informationECE 551 System on Chip Design
ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs
More informationSwizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems
1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna
More informationLecture 24: Board Notes: Cache Coherency
Lecture 24: Board Notes: Cache Coherency Part A: What makes a memory system coherent? Generally, 3 qualities that must be preserved (SUGGESTIONS?) (1) Preserve program order: - A read of A by P 1 will
More information5008: Computer Architecture
5008: Computer Architecture Chapter 4 Multiprocessors and Thread-Level Parallelism --II CA Lecture08 - multiprocessors and TLP (cwliu@twins.ee.nctu.edu.tw) 09-1 Review Caches contain all information on
More informationEXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS
EXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS by Ahmed Abousamra B.Sc., Alexandria University, Egypt, 2000 M.S., University of Pittsburgh, 2012 Submitted to
More informationAchieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation
More informationArchitecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC
BWCCA 2010 Fukuoka, Japan November 4-6 2010 Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC Akram Ben Ahmed, Abderazek Ben Abdallah, Kenichi Kuroda The University of Aizu
More informationLecture 22: Router Design
Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationBrief Background in Fiber Optics
The Future of Photonics in Upcoming Processors ECE 4750 Fall 08 Brief Background in Fiber Optics Light can travel down an optical fiber if it is completely confined Determined by Snells Law Various modes
More informationCache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas
Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More informationDeadlock and Livelock. Maurizio Palesi
Deadlock and Livelock 1 Deadlock (When?) Deadlock can occur in an interconnection network, when a group of packets cannot make progress, because they are waiting on each other to release resource (buffers,
More informationPower and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip
2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip Nasibeh Teimouri
More informationPortland State University ECE 588/688. Directory-Based Cache Coherence Protocols
Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All
More informationScalable Multiprocessors
Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationTDT4260/DT8803 COMPUTER ARCHITECTURE EXAM
Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:
More informationSTLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip
STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,
More informationImproving Bandwidth Efficiency When Bridging on RPR. November 2001
Improving Bandwidth Efficiency When Bridging on RPR November 2001, Nortel Networks Anoop Ghanwani, Lantern Communications Raj Sharma, Luminous Robin Olsson, Vitesse CP Fu, NEC 11/1/01 Page 1 Components
More informationA Scalable SAS Machine
arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation
More informationWe are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors
We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists 3,900 116,000 120M Open access books available International authors and editors Downloads Our
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationLecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations
Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationMinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect
MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU
More informationLecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )
Lecture: Coherence, Synchronization Topics: directory-based coherence, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory) keeps track
More informationEECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects
1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More information