SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Size: px
Start display at page:

Download "SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto"

Transcription

1 SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto

2 Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols needed for many-core architectures Consider interconnection network optimizations to help facilitate scalable coherence

3 Talk Outline 3 Introduction Network-on-Chip Challenges with Scalable Coherence Protocol SigNet Architecture: Network filtering solution Evaluation Conclusion

4 Many-Core Cache Coherence Challenges 4

5 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements

6 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements

7 Many-Core Cache Coherence Challenges 4 Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads

8 Many-Core Cache Coherence Challenges 4 Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads

9 Many-Core Cache Coherence Challenges 4 ForwardX Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads

10 Many-Core Cache Coherence Challenges 4 Response ForwardX Store Miss Broadcast Good latency Poor scaling due to bandwidth requirements Directory Good scalability due to point to point communication Storage overheads

11 Scalable Cache Coherence 5

12 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

13 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

14 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

15 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

16 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line!

17 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line

18 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage

19 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage

20 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage

21 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15

22 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15

23 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15

24 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15

25 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow

26 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow

27 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow

28 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Only 2 sharers: 1 & 15 3 rd sharer: overflow

29 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Cores 1, 2, 5 & 15 share cache line 2 pointers Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage 3 sharers but 6 invalidations Only 2 sharers: 1 & 15 3 rd sharer: overflow

30 Scalable Cache Coherence 5 Directory protocol storage overheads Single bit per core in sharing vector (full map) 256 cores 32 Bytes of overhead per cache line! Coarse Vector Directories DiriCVr i: # of pointers r: # of cores in region Example: Dir2CV2 Requires 1/2 as much storage Represents cores 0 & 1 3 sharers but 6 invalidations Cores 1, 2, 5 & 15 share cache line 2 pointers Only 2 sharers: 1 & 15 3 rd sharer: overflow

31 Extraneous Invalidations with Coarse Vectors 6 For many applications: number of sharers is small 2 to 3 When number of sharers exceeds i, pointers overflow Directory entry operates in coarse mode 1 bit represents multiple cores Imprecise sharing list Extra processors will receive and acknowledge invalidation Consumes network power Requires additional cache lookups

32 Latency Impact of Coarse Vectors 7 Normalized Packet Latency 5% 10% 12% 15% Increase in coarseness leads to significant contention

33 Bandwidth Impact 8 Normalized Link Traversals Increased load results in decreased effectiveness of pipeline optimizations Increased dynamic power consumption

34 System-level impact Inval Complete Local Remote Coarse vectors increase average packet latency Average Cycles SPECjbb SPECweb TPC-H TPC-W Barnes Ocean Radiosity Raytrace Increase completion time for invalidations All acknowledgments must be received Delay subsequent requests to pending cache line

35 SigNet Overview 10 Coarseness reduces directory storage But with significant potential impact on network Due to extraneous invalidations Safely remove extraneous invalidations Save power and reduce network contention Place cache summary information in routers Counting Bloom filters used for cache summary signatures Use summary information to filter network packets

36 SigNet Bloom Filters 11 Counting Bloom filters summarize cache information in routers Signature of cache contents Bloom Filter Hit Core exists between current node and destination that is caching an address mapping to same entry Bloom Filter Miss None of the downstream caches are caching lines that map to this entry

37 Modifying Bloom Filters 12 Bloom Filter Insertion Cache misses increment counter as they travel to directory Bloom Filter Deletion Writeback and invalidation acknowledgments decrement associated counter at routers between cache and directory

38 SigNet Architecture 13!"#$%& '"()#$*+",& <%0"=%& 24>&?& 24>&2& -'&:&-'&F&.0C&D%66*>%& E"1(*+",& 8,)#$&:& -'&:& -'&,& 8,)#$&7#B%16& 24>&A& -'&.//"0*$"1& 234$05&.//"0*$"1& ;#$)#$&:& -'&:& 8,)#$&9& -'&,& 8,)#$&7#B%16& '1"667*1&634$05& ;#$)#$&9&

39 SigNet Pipeline 14 BW RC VA SA ST LT 3'4*#5-6#!"# $%# &'()*'+,-.#,-.#3-6#,-.#8-99# /0#,0# 0(7#89.#,1# 21# /0#,0# :);<#,1# 21# Header flits traverse modified pipeline If the packet needs to check/update signature

40 SigNet Example 15 Requests follow X-Y path Responses follow Y-X path

41 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path

42 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path

43 SigNet Example 15 1 Cache miss to A Send request to directory Requests follow X-Y path Responses follow Y-X path

44 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path

45 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path

46 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature Requests follow X-Y path Responses follow Y-X path

47 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

48 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

49 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

50 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

51 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

52 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

53 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

54 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

55 SigNet Example 15 1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request Requests follow X-Y path Responses follow Y-X path

56 SigNet Example (2) 16

57 SigNet Example (2) 16 Write request for A 4

58 SigNet Example (2) 16 Write request for A 4

59 SigNet Example (2) 16 Write request for A 4

60 SigNet Example (2) Directory forwards invalidation requests Write request for A

61 SigNet Example (2) Directory forwards invalidation requests Write request for A

62 SigNet Example (2) 16 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

63 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

64 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

65 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

66 SigNet Example (2) 16 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 5 4 Directory forwards invalidation requests Write request for A

67 SigNet Example (2) 16 Invalidations reach 2 destination cores and acks sent to directory 8 7 Spawn 2 Acks 6 North Port Filter Hit West Port Filter Miss 4 5 Directory forwards invalidation requests Write request for A

68 SigNet Implementation 17 Recall: Full-map directory requires 32 bytes per sharing vector for 256 cores 50% overhead per cache line Evaluation uses 8K entry Bloom filters at each output port Reduces overhead to 12.5% to 25% per cache line Depends on size of counters and number of pointers in coarse vector directory

69 Correctness and Utilization 18 All cores caching a block must receive invalidation request Bloom filter can have false positives Lessens performance benefit but correct Cannot have false negatives Differences in utilization due to location and memory usage

70 Simulation Methodology 19 Network configuration Number of Nodes 256 Topology Virtual Channels and Buffers Link Width 16 x16 mesh 4 VC/port 8 Buffers/VC 16 Bytes Signature Size 8192

71 Simulation Methodology (2) 20 Workload Parameters Name %Invalidates Average Sharers Database Web Java Scientific A Scientific B Create synthetic benchmarks based on characteristics of 16-core workloads

72 Results: 2 pointers, 16 cores per region 21 Normalized Average Packet Latency CV Directory SigNet Full Map Database Web Java SciA SciB Filters effectively reduce network contention Lower average packet latency Additional research needed to further close gap with Full Map Directory

73 Results: Invalidation Completion Time 22 Cycles for Invalidations to Complete CV Directory SigNet Full Map Database Web Java SciA SciB SigNet improves invalidation completion time Comparison with Pruning Caches in paper

74 Related Work 23 Interconnection Network Support Pruning Caches In-network coherence filters Cache Coherence Optimizations with Bloom filters Jetty filters: reduce cache snoops Tagless Coherence Directories Reduce storage overheads

75 Conclusions 24 Characterize impact of CV directories Significant power consumption and performance degradation Interconnect support to facilitate scalable cache coherence SigNet: network filters to reduce extraneous invalidations

76 Thank you 25 Questions?

SigNet: Network-on-Chip Filtering for Coarse Vector Directories

SigNet: Network-on-Chip Filtering for Coarse Vector Directories SigNet: Network-on-Chip Filtering for Coarse Vector Directories Natalie Enright Jerger Department of Electrical and Computer Engineering University of Toronto Toronto, ON enright@eecg.toronto.edu Abstract

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Bandwidth Adaptive Snooping

Bandwidth Adaptive Snooping Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin

More information

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal Lecture 19 Interconnects: Flow Control Winter 2018 Subhankar Pal http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk,

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes

More information

Lecture 7: Flow Control - I

Lecture 7: Flow Control - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 7: Flow Control - I Tushar Krishna Assistant Professor School of Electrical

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

SCORPIO: 36-Core Shared Memory Processor

SCORPIO: 36-Core Shared Memory Processor : 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett

More information

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti Electrical and Computer Engineering Department, University of Wisconsin-Madison

More information

Cache Coherence: Part II Scalable Approaches

Cache Coherence: Part II Scalable Approaches ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines Cache Coherence in Scalable Machines COE 502 arallel rocessing Architectures rof. Muhamed Mudawar Computer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Interconnect Routing

Interconnect Routing Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch Directory-based Coherence Winter 2019 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt,

More information

SPACE : Sharing Pattern-based Directory Coherence for Multicore Scalability

SPACE : Sharing Pattern-based Directory Coherence for Multicore Scalability SPACE : Sharing Pattern-based Directory Coherence for Multicore Scalability Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas Deptartment of Computer Science, University of Rochester {hozhao,ashriram,

More information

Token Coherence. Milo M. K. Martin Dissertation Defense

Token Coherence. Milo M. K. Martin Dissertation Defense Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software

More information

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti Electrical and Computer Engineering Department, University of Wisconsin-Madison

More information

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors

More information

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics Lecture 14: Large Cache Design III Topics: Replacement policies, associativity, cache networks, networking basics 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that simultaneously

More information

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols 1 Future Scalable Designs Intel s Single Cloud Computer (SCC): an example prototype

More information

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3 MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance

More information

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a

More information

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching. Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks Department of Computer Science and Engineering, Texas A&M University Technical eport #2010-3-1 seudo-circuit: Accelerating Communication for On-Chip Interconnection Networks Minseon Ahn, Eun Jung Kim Department

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

Communication Cost in Parallel Computing

Communication Cost in Parallel Computing Communication Cost in Parallel Computing Ned Nedialkov McMaster University Canada SE/CS 4F03 January 2016 Outline Cost Startup time Pre-hop time Pre-word time Store-and-forward Packet routing Cut-through

More information

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization Lecture 25: Multiprocessors Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 Snooping-Based Protocols Three states for a block: invalid,

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication

More information

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Thread- Level Parallelism. ECE 154B Dmitri Strukov Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes

More information

Tiny Directory: Efficient Shared Memory in Many-core Systems with Ultra-low-overhead Coherence Tracking

Tiny Directory: Efficient Shared Memory in Many-core Systems with Ultra-low-overhead Coherence Tracking Tiny Directory: Efficient Shared Memory in Many-core Systems with Ultra-low-overhead Coherence Tracking Sudhanshu Shukla Mainak Chaudhuri Department of Computer Science and Engineering, Indian Institute

More information

ES1 An Introduction to On-chip Networks

ES1 An Introduction to On-chip Networks December 17th, 2015 ES1 An Introduction to On-chip Networks Davide Zoni PhD mail: davide.zoni@polimi.it webpage: home.dei.polimi.it/zoni Sources Main Reference Book (for the examination) Designing Network-on-Chip

More information

ECE 551 System on Chip Design

ECE 551 System on Chip Design ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

Lecture 24: Board Notes: Cache Coherency

Lecture 24: Board Notes: Cache Coherency Lecture 24: Board Notes: Cache Coherency Part A: What makes a memory system coherent? Generally, 3 qualities that must be preserved (SUGGESTIONS?) (1) Preserve program order: - A read of A by P 1 will

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 4 Multiprocessors and Thread-Level Parallelism --II CA Lecture08 - multiprocessors and TLP (cwliu@twins.ee.nctu.edu.tw) 09-1 Review Caches contain all information on

More information

EXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS

EXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS EXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS by Ahmed Abousamra B.Sc., Alexandria University, Egypt, 2000 M.S., University of Pittsburgh, 2012 Submitted to

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC BWCCA 2010 Fukuoka, Japan November 4-6 2010 Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC Akram Ben Ahmed, Abderazek Ben Abdallah, Kenichi Kuroda The University of Aizu

More information

Lecture 22: Router Design

Lecture 22: Router Design Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Brief Background in Fiber Optics

Brief Background in Fiber Optics The Future of Photonics in Upcoming Processors ECE 4750 Fall 08 Brief Background in Fiber Optics Light can travel down an optical fiber if it is completely confined Determined by Snells Law Various modes

More information

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

Deadlock and Livelock. Maurizio Palesi

Deadlock and Livelock. Maurizio Palesi Deadlock and Livelock 1 Deadlock (When?) Deadlock can occur in an interconnection network, when a group of packets cannot make progress, because they are waiting on each other to release resource (buffers,

More information

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip Nasibeh Teimouri

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Scalable Multiprocessors

Scalable Multiprocessors Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

Improving Bandwidth Efficiency When Bridging on RPR. November 2001

Improving Bandwidth Efficiency When Bridging on RPR. November 2001 Improving Bandwidth Efficiency When Bridging on RPR November 2001, Nortel Networks Anoop Ghanwani, Lantern Communications Raj Sharma, Luminous Robin Olsson, Vitesse CP Fu, NEC 11/1/01 Page 1 Components

More information

A Scalable SAS Machine

A Scalable SAS Machine arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation

More information

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists 3,900 116,000 120M Open access books available International authors and editors Downloads Our

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU

More information

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections ) Lecture: Coherence, Synchronization Topics: directory-based coherence, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory) keeps track

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information