On-Chip Interconnect Implications of Shared Memory Multicores

Size: px

Start display at page:

Download "On-Chip Interconnect Implications of Shared Memory Multicores"

Malcolm Merritt
6 years ago
Views:

1 On-Chi Interconnect Ilications of Shared Meory Multicores Srini Devadas Couter Science and Artificial Intelligence Laboratory (CSAIL) Massachusetts Institute of Technology 1

2 Prograing 1000 cores MPI has been used to exloit large-scale arallelis in soe alications (e.g., 3D rendering) Requires individual tasks to be large; becoes difficult to aly at a fine-grained level Paradigs such as MaReduce and shard-based databases have been successful in articular alication doains A shared eory abstraction is required for generalurose rograing and running an oerating syste 3

3 The Proble Will failiar shared eory rograing odels be feasible at 1000 cores? 4

4 The Proble Will failiar shared eory rograing odels be feasible at 1000 cores? Cache Coherence Challenges Interconnection Network Challenges 5

5 The Proble Will failiar shared eory rograing odels be feasible at 1000 cores? Cache Coherence Proble Existing full-a directory-based rotocols do not scale High area overhead [O(N 2 )] Energy overhead roortional to area Hotsots on networks due to frequent invalidations Interconnection Network Proble 6

6 The Proble Will failiar shared eory rograing odels be feasible at 1000 cores? Cache Coherence Proble Interconnection Network Proble Existing eshes, rings unlikely to scale Energy-inefficient due to ultile routers and links High latency fro one core to another 7

7 Directory Cache Coherence Background Directory-based rotocols Need to kee track of who is sharing a cache block 8

8 Directory Cache Coherence Directory-based rotocols Need to kee track of who is sharing a cache block Full-a directories Background Maintain a bit for every ossible sharer Baseline rotocol requires invalidation of all read coies (otentially at every core) and collection of acknowledgeents (otentially fro every core) 9

9 Full-a directories Full-Ma Directories Maintain a bit for every ossible sharer For 1000 cores, need a 1000-bit vector for each 512-bit cache block (in naïve ileentation) Full-a directories consue too uch area. 10

10 Liited Directories Liited directories Liited nuber of hardware ointers (k) in the sharer list Address State Sharer 1 Sharer 2 Sharer k 11

11 Liited Directories Liited directories Liited nuber of hardware ointers (k) in the sharer list Dir(k)B Allow unliited sharers, but if (# sharers > k), use broadcast invalidate on exclusive request Requires ACKs fro ALL cores If sharers > k, we are broadcasting to 1000 cores and waiting for acknowledgeents fro 1000 cores Interconnect network will need to handle this traffic efficiently, else erforance suffers. 13

12 1-to-M (Broadcast/Multicast) and M-to-1 (Acks) occurrence? 14 % 14 % 2% 51 % 71 % 47 % AMD HyerTransort Token Coherence 64-core full-syste siulations 16

13 Why are 1-to-M and M-to-1 bad? Increased bandwidth consution U to M ties Increased network contention M essages at src/dest links More ackets in network 1-to-M => ulticasts M-to-1 => ACKs Worse as M Increased ower consution Bursty, not sustained Can we handle using an efficient network design? 17

14 Proble: how to route broadcasts? Sarse Multicast Tree Broadcast/Dense Multicast Tree Contention! Idle links! Tree constructed dynaically based on destination set Sae destination set (all nodes) => sae tree structure 18

15 Network-Centric Aroaches ATAC and ACKwise architecture that leverages hotonics (Agarwal grou) Assues a fast, energy-efficient otical network that enables energy-efficient broadcasts and long distance essages 19

16 ATAC Fro 10,000 Feet Electrical Mesh Interconnect Tiled Multicore Processor with Otical Network Overlay 2-D array of sile cores connected by an electrical esh network Electrical network rovides efficient short-range counication Otical overlay network rovides fast broadcast and long-distance counication Otical WDM Interconnect 20

17 ACKwise Protocol Extension of Dir(k)B rotocol Designed to leverage the ATAC broadcast network Address State Global Sharers 1-3 addr shared false Core-A Core-B Core-C Structure of an ACKwise(3) Directory Entry 21

18 ACKwise Protocol Nuber of Sharers > Nuber of Hardware Pointers (k) Tracks the nuber of sharers If (# sharers > k), use broadcast invalidate on exclusive request Requires ACKs fro ONLY sharers Address State Global Sharers 1-3 addr shared true 4 Structure of an ACKwise(3) Directory Entry 22

19 ATAC Architecture Details StarNet ENet Hub StarNet ENet (a) 64 Otically-Connected Clusters ONet (b) Electrical Networks Connecting 16 cores Takeaway: Otical network necessary but not sufficient for efficient coherence and high erforance 23

Counters Otical Technology Paraeters McPAT Modified Orion 2.

20 Evaluation Requires New Toolflow Cache Models Benchark Network Models Inuts Cache Counters Electrical Technology Paraeters Grahite NM Electrical Router & Link Counters Otical Link Counters Otical Technology Paraeters McPAT Modified Orion 2.0 Otical Models Tools Coletion Tie Cache Energy & Area Electrical Router & Link Energy & Area Otical Link Energy & Area Oututs 24

Network-Centric Aroaches ATAC and ACKwise architecture that leverages hotonics (Agarwal grou) Assues a fast, energy-efficient otical network that enables energy-efficient broadcasts

21 Network-Centric Aroaches ATAC and ACKwise architecture that leverages hotonics (Agarwal grou) Assues a fast, energy-efficient otical network that enables energy-efficient broadcasts and long distance essages Directoryless coherence via execution igration Migrate threads as oosed to igrating data for faster data access Requires high-bandwidth interconnect network 25

22 Execution Migration Machine (EM²) No data relication: Reote Access (RA) eory organization Idea: send thread on 1 st eory access ove context (RF etc.) to core where data lives ossibly evict a context currently at destination and execute in its lace igration entirely at hardware level for seed Avoids directories and directory rotocols but oses challenges in interconnect design

23 EM² Pluses and Minuses + One-way data access through thread igration Deterine core iss and reote core destination in arallel with L1 looku + No relication across on-chi caches lowers off-chi eory access rates in coarison to DirCC + No directories, broadcast or ulticast required - Context size of 2-4Kb significantly greater than 1 word (RA), and greater than 512-bit cache block size (Directory rotocols) - Contention esecially for reads of shared data (even read-only data is not relicated)

24 AML versus network bandwidth Needs high bandwidth, lowcontention network to be coetitive

25 Suary Network design and ileentation is going to be hugely iortant in roviding shared eory abstractions for ulticores regardless of articular aroach used! 29

Recap Consistent cuts. CS514: Intermediate Course in Operating Systems. What time is it? But what does time mean? Drawing time-line pictures:

Recap Consistent cuts. CS514: Intermediate Course in Operating Systems. What time is it? But what does time mean? Drawing time-line pictures: CS514: Interediate Course in Oerating Systes Professor Ken iran Vivek Vishnuurthy: T Reca Consistent cuts On Monday we saw that sily gathering the state of a syste isn t enough Often the state includes