Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Size: px

Start display at page:

Download "Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems"

Magnus Stokes
5 years ago
Views:

1 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna Das, Thomas Wenisch, Dennis Sylvester, David Blaauw, and Trevor Mudge 1 1

2 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 2 2 2

3 Swizzle Switch 3 Conventional Matrix Crossbar Swizzle Switch Embeds arbitration within crossbar single cycle arbitration Re-use input/output data buses for arbitration SRAM-like layout with priority bits at cross-points Low-power optimizations Excellent scalability 3 3

4 Data Routing 4 Multicast & Broadcast Bitlines discharged if Data = 1 Crosspoint = 1 4 4

5 Swizzle Switch Architecture 5 Priority vector Priority vectors Data routing, arbitration, And priority update control embedded within crosspoints 5 5

6 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 6 6 6

7 Inhibit Based Arbitration 7 This diagram is a single column in the Swizzle-Switch (output), each output arbitrates/transfers data independently Each Crosspoint has a sense amp/ latch to indicate connectivity. Each input samples a unique bit of the output bus to determine if it has been granted the channel Priority vectors are stored and when a request is issued they discharge bits along the output columns to INHIBIT lower priority requests Finally, the priority vectors are updated when the data transfer completes. 7 7

8 Least Recently Granted(LRG) 8 LOWEST Priority Discharges NO Priority Lines INTERMEDIATE Priority Discharges SOME Priority Lines HIGHEST Priority Discharges ALL Priority Lines 8 8

9 Least Recently Granted(LRG) 9 Example Arbitration: (1) Req l and Req m Request the bus (red lines) (2) Req m discharges Priority line l, priority lines m and n remain charged (green lines) (3) Req l senses Priority line l and is inhibited (not granted), Req m senses Priority line m and is not inhibited (4) The crosspoint records the connectivity at input m 9 9

10 Least Recently Granted(LRG) 10 Example Priority Update: SET Input m signals it is done with data transfer by asserting Rel m RESET 10 10

11 Least Recently Granted(LRG) 11 INTERMEDIATE Priority Discharges SOME Priority Lines LOWEST Priority Discharges NO Priority Lines HIGHEST Priority Discharges ALL Priority Lines 11 11

12 Least Recently Granted(LRG)

13 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation

14 64x64 Prototype

15 Measurement Results

16 Measurement Results

17 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation

18 Scaling Interconnect for Many-Cores 18 Existing interconnects Buses, Crossbars, Rings Limited to ~16 cores Other s Interconnect proposals for Many-Cores Packet-switched, multi-hop, network-on-chip (NoC) Grid of routers meshes, tori and flattened butterfly Our Proposal Swizzle Switch Networks Flat single-stage, one-hop, crossbar++ interconnect 18 18

73 mm ARM Cortex A5 256 kb L2 Cache Bank 32 kb DCache R 1.

19 Mesh Network-on-Chip 19 Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller 13.8 mm 32 kb ICache 1.73 mm ARM Cortex A5 256 kb L2 Cache Bank 32 kb DCache R 1.73 mm Memory Controller Memory Controller Area = 3.0 mm mm Area = 190 mm 2 (a) Router R 19 19

20 Flattened Butterfly Network-on-Chip 20 Memory Controller Memory Controller Memory Controller Memory Controller Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Memory Controller Memory Controller 13.8 mm Memory Controller Memory Controller 13.8 mm Area = 190 mm

21 Motivating Swizzle Switch Networks 21 Uniform access latency Ease of programming, data placement, thread placement,... Low Power Simplicity Packet-switched NoCs need routing, congestion management, flow control, wormhole switching,

22 Motivating Swizzle Switch Networks 22 Mesh SSN!"#$%&#'(')*)+,'-./01234#10#5' 627%.$2#00'('89*9'''!"#$%&#'(')*)+,'-./01234#10#5' 627%.$2#00'('+*))8''' accepted throughput [flits/node/cycle] node index (X) node index (Y) node index (X) node index (Y) Unfairness = Node highest_throughput / Node lowest_throughput Hotspot Traffic = All nodes sending data to node 8,8 Under Hotspot traffic, the Crossbar has a slightly less throughput than the Mesh but is 40x more fair

23 Motivating Swizzle Switch Networks 23 Mesh!"#$%&#'(')*+,'-./01234#10#5' 627%.$2#00'('8*9:''' SSN!"#$%&#'(')*+,'-./01234#10#5' 627%.$2#00'('8*)9''' node index (X) node index (Y) node index (X) node index (Y) In the Mesh, nodes closest to the center receive the highest throughput Under Uniform Random traffic, the Crossbar has more throughput than the Mesh and is 87% more fair

24 Motivating Swizzle Switch Networks

25 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation

26 Top-Level Floorplan 26 Memory Controller Memory Controller Memory Controller Memory Controller NW WN WS SW Swizzle Switch x 3 NE EN ES SE Memory Controller Memory Controller 14.3 mm 3.27 mm 512 kb L2 cache L2 Area = 4.50 mm 2 32 kb Icache 0.86 mm Arm Cortex A5 32 kb Dcache 0.86 mm 1.38 mm Memory Controller 14.3 mm Total Area = 204 mm 2 Memory Controller Core + L1 Area =.74 mm

27 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation

28 Evaluation 28 Simulation Parameters Feature NoC (Mesh/FBFly) SSN Processors L1 Cache 64 in-order cores, 1 IPC, 1.5 GHz 32kB I/D Caches, 4-way associative, 64-byte line size, 1 cycle latency L2 Cache Shared L2, 16 MB, 64-way banked, 8- way associative, 64-byte line size, 10 cycle latency Interconnect Main Memory 3.0 GHz, 128-bit, 4-stage Routers, 3 virt. networks w/ 3 virt. channels 4096MB, 50 cycle latency Shared L2, 16MB, 32-way banked, 16-way associative, 64-byte line size, 11 cycle latency 1.5 GHz, 64x32x128bit Swizzle Switch Network Benchmarks SPLASH 2 : Scientific parallel application suite 28 28

29 Results Performance & QoS 29 Normalized Execution Time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Synchronization Stall Core Active Memory Stall Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Barnes. Cholesky. FFT. FMM. Lu Contig. Lu NonContig. Ocean Contig. Ocean NonContig. Radix. Raytrace. Water NSquared. Water Spatial Overall Performance Quality-of-Service 29 29

30 Results Power 30 Interconnect Power (mw)! 7,000! 6,000! 5,000! 4,000! 3,000! 2,000! 1,000! 0! Wire Dynamic! SSN/Router Dynamic! SSN/Router Leakage! Clock! Buffer! Barnes!.! Cholesky!.! FFT!.! FMM!.! Lu Contig!.! Lu NonContig!.! Ocean Contig!.! Ocean NonContig!.! Radix!.! Raytrace!.! Water NSquared!.! Water Spatial!.! Weighted Average! On average the SSN uses 28% less power in the interconnect compared to a flattened butterfly Total Benchmark Energy! (pj/inst)! 900! 800! 700! 600! 500! 400! 300! 200! 100! 0! Interconnect! L2 Dynamic! L2 Leakage! Core/L1 Dynamic! Core/L1 Leakage! Barnes!.! Cholesky!.! FFT!.! FMM!.! LuContig!.! Lu NonContig!.! Ocean Contig!.! Ocean NonContig!.! Radix!.! Raytrace!.! Water NSquared! Which results in an average reduction in total system energy to complete the task of 11%.! Water Spatial!.! Average! 30 30

31 Summary Swizzle Switch Prototype (45nm) 64x64 Crossbar with 128-bit busses Embedded LRG priority arbitration Achieved 4.4 ~600MHz consuming only 1.3W of power 31 Swizzle Switch Network Evaluation Improved performance by 21% Reduced power by 28% Reduced latency variability by 3x 31 31

32 32 Additional Detailed Slides 32 32

33 Arbitration Mechanism (Matrix View) 33 Inhibits (X) Requests (R) X 0 X 1 X 2 X 3 X 4 Priority R 0 X R 1 0 X R X R X 1 4 R X

34 Least Recently Granted (LRG) 34 S et Reset X 0 X 1 X 2 X 3 X 4 Priority In 0 X In 1 0 X In X In X 1 4 In X 3 X 0 X 1 X 2 X 3 X 4 Priority In 0 X In 1 0 X In X In X 1 4 In X

35 Round Robin Arbitration 35 SET RESET 35 35

36 Round Robin Arbitration

37 QoS Arbitration 37 QoS Arbitration LRG Arbitration 37 37

38 Timing Diagram

39 Crosspoint Circuit 39 Priority-line Ack 39 39

Regenerative Bit-line Repeater 40 16 16 Macro Bit-line

40 Regenerative Bit-line Repeater Macro Bit-line Repeaters SSN Regeneration and Decoupling improves speed 40 40

41 Simulated bit-line delay improvement 41 Technology : 45nm Supply : 1.1V Temperature : 25 C 41 41

42 SSN Scaling: Simulation Technology : 45nm Supply : 1.1V Temperature : 25 C 42 wo Repeater 256 bit Bus with Repeater 128 bit Bus with Repeater Regenerative repeaters improve SSN scalability 42 42

43 Swizzle Switch Network-on-Chip 43 (1%)*%3, (1%)!%-./ ('%)!%-./ ('%)*%+, (1%)!%+, (1%)*%-./ ('%)*%-./ ('%)!%+, Memory Controller Memory Controller $ # 4 $ # $ $ 4 Memory Controller Memory Controller NW WN WS SW Swizzle Switch x 3 Memory Controller L mm NE EN ES SE Memory Controller Destination L2 Memory Controller Memory Controller 14.3 mm 1(%)*%-./ 1(%)!%-./ 1(%)*%+, 10%)*%+, 10%)!%-./ 1(%)!%+, 10%)!%+, 10%)*%-./ $ 4 $ $ # 4 # $ $!"#$%&$!'# ()*+'*"', -./0012$ -./ '%)!%-./ 01%)!%-./ 01%)*%3, 4 # $ 4!"#$%&$!"- ()*()*"', -./ /345 "6"88 9!"$:;*!'#$%&$!"# +'*()*"', -./0012$-./ $ $ # 0'%)!%+, $ # $ # 4 $ $ 4 '(%)*%+,% '(%)!%-./% '(%)*%-./!"#$%&& '(%)!%+, '0%)!%+, '0%)*%+, '0%)*%-./ '0%)!%-./ Source L1 L2 Shared Data Data Forwarding Responses Invalidations Requests Writebacks 0'%)*%-./ 01%)*%-./ 01%2!%+, 0'%)*%+,!"#$%&& (b) 43 43

44 Results 64-core with A9 O3 cores

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip