SCORPIO: 36-Core Shared Memory Processor

Size: px

Start display at page:

Download "SCORPIO: 36-Core Shared Memory Processor"

Hilary Buddy Kelly
5 years ago
Views:

: 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon,

implementation (Bhavya) Memory interface controller implementation (Owen) High-level idea of notification network (Woo-Cheol) Network architecture (Woo-Cheol, Bhavya, Owen, Tushar, Suvinay) Network

1 : 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett Wilkerson, John Arends, Anantha Chandrakasan, Li-Shiuan Peh Contributions: integration (Bhavya and Owen), Cache coherence protocol design (Bhavya and Woo Cheol) L2 cache controller implementation (Bhavya) Memory interface controller implementation (Owen) High-level idea of notification network (Woo-Cheol) Network architecture (Woo-Cheol, Bhavya, Owen, Tushar, Suvinay) Network implementation (Suvinay) DDR2 and PHY integration (Sunghyun and Owen) Backend of entire chip (Owen) FPGA interfaces, on-chip testers and scan chains (Tushar) RTL functional simulations (Bhavya, Owen, Suvinay) Full-system GEMS simulations (Woo-Cheol) Board Design (Sunghyun) Software Stack (Bhavya and Owen) Package Design (Freescale) MTL

2 Overview 0 1 Memory controller IBM 45nm SOI, 143mm 2 600M transistors 13 mm cores with total 4.5MB L2 Focus on how snoopy coherence is enabled on a mesh interconnect 6 6 mesh on-chip network supporting snoopy coherence Memory controller 1 11 mm 35 Dual channel DDR2 memory controller Owen Chen / MIT 2

3 Architecture Freescale e200 z760n3 In-order Dual-issue L1-I L Network 33 L1-D Writethrough Private L1 cache Split 16KB for Inst / Data 4-way set associative Private L2 cache 128KB 4-way set associative Inclusive Backinvalidate Owen Chen / MIT 3

Snoopy Coherence 0 1 35 L1-I L2 L1-D Writethrough L1-I L2 L1-D L1-I L2 L1-D Backinvalidate REQX

cacheline) Network Network Directory Directory Cache Cache Memory Memory Controller Controller

4 Snoopy Coherence L1-I L2 L1-D Writethrough L1-I L2 L1-D L1-I L2 L1-D Backinvalidate REQX Network Network Network DATA Scalable On-Chip Network Low overhead directory entry (2 bits per cacheline) Network Network Directory Directory Cache Cache Memory Memory Controller Controller MOSI coherence protocol Additional O dirty state Data forwarding list Destination filtering Owen Chen / MIT 4

5 Scalable On-Chip Network L1-I L1-D L2 Network L1-I L1-D L2 Network 6 6 mesh interconnect 137b wide data-path One network node / tile Owen Chen / MIT 5

Scalable On-Chip Network VC Allocators Switch Allocators East Input Port Router Other Controls 5 5 Crossbar Local East 6 6 mesh interconnect 137b wide data-path One network node /

Resp Virtual Network Buffer Read (BR) Switch Allocation Outport (SA-O) VC Allocation (VA) Lookahead/Header Generation Switch Traversal (ST) North Optimizations Multiple virtual

6 Scalable On-Chip Network VC Allocators Switch Allocators East Input Port Router Other Controls 5 5 Crossbar Local East 6 6 mesh interconnect 137b wide data-path One network node / tile East Req Virtual Network South West Deadlock avoidance Two virtual networks Dimensional X-Y routing Regular Pipeline Stages Buffer Write (BW) Switch Arbitration Inport (SA-I) Resp Virtual Network Buffer Read (BR) Switch Allocation Outport (SA-O) VC Allocation (VA) Lookahead/Header Generation Switch Traversal (ST) North Optimizations Multiple virtual channels / VN In-network broadcast support Virtual router pipeline bypass Bypass Pipeline Stages Bypass Intermediate Pipelines Switch Traversal (ST) 3 cycle 1 cycle Owen Chen / MIT 6

7 Globally-Ordered Virtual Network Problem: Broadcast Messages delivered to different nodes in Solution: different Decouple orders on message unordered delivery networks from ordering We want: Every node to see all messages in the same global order L1-I L1-D L1-I L1-D L1-I L1-D L2 Network Req2 Req1 Req1 Req2 L2 Network Req2 Req1 Req1 Req2 L2 Network Req1 Req2 Unordered Interconnect Owen Chen / MIT 7

all messages in the same global order 30 31 32 33 34 35 Main network Message delivery 24 25 26 27 28 29 18

8 Globally-Ordered Virtual Network Problem: Broadcast Messages delivered to different nodes in Solution: different Decouple orders on message unordered delivery networks from ordering We want: Every node to see all messages in the same global order Main network Message delivery Notification network Message ordering Owen Chen / MIT 8

DFF + ORs Notification Router Bitwise- OR All tiles determine the global order locally Broadcast messages on main

9 Notification Network In east In south In west In north In nic Notification DFF Out local Out north Out west Out south Out east Bounded latency ( 12 cycle ) Non-blocking 1 cycle / hop broadcast mesh Dedicated 1 bit / tile Low cost Only DFF + ORs Notification Router Bitwise- OR All tiles determine the global order locally Broadcast messages on main network Inject corresponding notifications All tiles receive the same notifications Timeline Time Window Owen Chen / MIT 9

10 Walkthrough T1. 11 injects T3. Both cores inject notification N1 N2 Timeline T2. 1 injects 1, 2, 3, 5, 6, 9 receive Main network Notification network GETS Addr2 T GETX Addr1 T T N2 N1 T3 Owen Chen / MIT 10 N1 Broadcast notification for N2 Broadcast notification for

11 Walkthrough T3. Both cores inject notification N1 N2 T4. Notifications N1 N2 guaranteed to reach all nodes now Timeline 1, 2, 3, 5, 6, 9 receive T5. 1, 2, 3, 5, 6, 9 processed T4 N1 N2 Merged Notification T Owen Chen / MIT 11

12 Walkthrough s receive and process in any order, followed by T7. 6, owner of Addr1, responds R1 with data to 11 Timeline T5. 1, 2, 3, 5, 6, 9 processed Data Addr2 T6 T6. 13, owner of Addr2, responds R2 with data to 1 R Data Addr1 T7 R Owen Chen / MIT 12

13 Synchronization Primitives lwarx, stwcx Link in L2 cacheline granularity Detect modifications after load-link using coherence protocol msync Broadcast sync requests Gather acks from all cores when they complete the sync request Owen Chen / MIT 13

Evaluation Setup Simulator Access times GEMS + GARNET L1 1 cycle; L2 10 cycles; DRAM 90 cycles Limited Pointer Directory Coherence AMD HyperTransport Coherence Snoopy Coherence:

14 Evaluation Setup Simulator Access times GEMS + GARNET L1 1 cycle; L2 10 cycles; DRAM 90 cycles Limited Pointer Directory Coherence AMD HyperTransport Coherence Snoopy Coherence: MOSI What is tracked? Who orders requests? Few sharers Presence of owner Presence of owner Directory Directory Network Isolate Storage overhead Indirection latency Owen Chen / MIT 14

15 Normalized Runtime Runtime Comparison % better than Limited Pointer Directory 13% better than Hyper-Transport Owen Chen / MIT 15

Cycles Cycles L2 Service Latency Requests served by other caches Network: Req to Dir Dir Access

19% lower than 18% lower than 0 barnes fft lu blackscholes canneal fluidanimate average Requests served

Network: Bcast Req Dir Access Req Ordering Network: Resp 90% requests served by other caches 150 100 50

16 Cycles Cycles L2 Service Latency Requests served by other caches Network: Req to Dir Dir Access Network: Dir to Sharer Network: Bcast Req Req Ordering Sharer Access Network: Resp % lower than 18% lower than 0 barnes fft lu blackscholes canneal fluidanimate average Requests served by directory -- MC Average L2 service latency 17% lower than 14% lower than Network: Req to Dir Network: Bcast Req Dir Access Req Ordering Network: Resp 90% requests served by other caches % lower than 4.2% higher than barnes fft lu blackscholes canneal fluidanimate average Owen Chen / MIT 16

17 Network Cost Network occupies only 10% of the area Area Network consumes 20% of the power Power Post-layout frequency: 833 MHz Owen Chen / MIT 17

18 Contributions : A 36-core shared-memory processor Snoopy coherency on a mesh interconnect: Runtime: 24% better than, 13% better than Cost: 833MHz Novel network-on-chip for scalable snoopy coherence New ideas: Distributed in-network ordering mechanism Decouple message delivery from message ordering Owen Chen / MIT 18

19 Ongoing Work Software stack development Boot Linux Run PARSEC, SPLASH,, etc Chip measurement Power, timing Performance Owen Chen / MIT 19

20 MTL THANK YOU!

SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In- Network Ordering

SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In- Network Ordering The MIT Faculty has made this article openly available. Please share how this access benefits