EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

Size: px

Start display at page:

Download "EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects"

Beryl Boone
5 years ago
Views:

1 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter

2 Announcements Upcoming lecture schedule Today: On-chip Interconnects (Circuits and Photonics) 2/25: Off-chip Interconnects 3/1 & 3/3: Spring Break 3/8: Midterm Review, Structure for rest of lectures, Project Specification Midterm Exam: In-Class March 1 th, closed notes 2 Reading Assignments: For this week: "Swizzle-Switch Networks for Many-Core Systems," Korey Sewell et al. JETCAS IEEE Journal on Emerging and Selected Topics in Circuits and Systems, June 212 "Corona: System implications of emerging nanophotonic technology." Dana Vantrease et al. ISCA

3 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 3 3 3

4 Swizzle Switch 4 Conventional Matrix Crossbar Swizzle Switch Embeds arbitration within crossbar single cycle arbitration Re-use input/output data buses for arbitration SRAM-like layout with priority bits at cross-points Low-power optimizations Excellent scalability 4 4

5 Data Routing Source IP Source IP Pre-charge 1 Multicast & Broadcast Bitlines discharged if Data = 1 Crosspoint = 1 5 Source IP 1 Read Buffers 128 Dest. IP Dest. IP Dest. IP 5 5

6 Swizzle Switch Architecture 6 Output bus used as prioritybus Source IP Source IP Pre-charge 1 Req k Rel k Req l Rel l vector Source IP vectors 1 Req m Rel m Dest. IP Read Buffers Dest. IP Dest. IP Sense Amp + Latch Req n Rel n 1 1 Data routing, arbitration, And priority update control embedded within crosspoints 6 6

7 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 7 7 7

8 Inhibit Based Arbitration Req l Req m Sense Amp + Latch Req n Output bus used as priority bus x y z line l x < y line m 1 z > y 1 1 line n This diagram is a single column in the Swizzle-Switch (output), each output arbitrates/transfers data independently Each Crosspoint has a sense amp/ latch to indicate connectivity. Each input samples a unique bit of the output bus to determine if it has been granted the channel vectors are stored and when a request is issued they discharge bits along the output columns to INHIBIT lower priority requests Finally, the priority vectors are updated when the data transfer completes

9 Least Recently Granted(LRG) Output bus used as priority bus 9 Req l x line l x < y line n LOWEST Discharges NO Lines Req m Sense Amp + Latch Req n y z line m 1 z > y 1 1 INTERMEDIATE Discharges SOME Lines HIGHEST Discharges ALL Lines 9 9

10 Least Recently Granted(LRG) Output bus used as priority bus 1 Req l Req m Sense Amp + Latch Req n x 1 y z line l x < y line m 1 z > y 1 1 line n Example Arbitration: (1) Req l and Req m Request the bus (red lines) (2) Req m discharges line l, priority lines m and n remain charged (green lines) (3) Req l senses line l and is inhibited (not granted), Req m senses line m and is not inhibited (4) The crosspoint records the connectivity at input m 1 1

11 Least Recently Granted(LRG) Output bus used as priority bus 11 Example Update: Req l Rel l x SET Input m signals it is done with data transfer by asserting Rel m Req m Rel m 1 y 1 RESET Req n Rel n z

12 Least Recently Granted(LRG) Output bus used as priority bus 12 Req l Rel l x INTERMEDIATE Discharges SOME Lines Req m Rel m LOWEST Discharges NO Lines Req n Rel n z 1 1 HIGHEST Discharges ALL Lines 12 12

13 Least Recently Granted(LRG) Output bus used as priority bus 13 Req l Rel l Req m Rel m x Increasing x y 62 z LRG Algorithm l m 62 n m m releases channel l m-1 62 n Least priority upgraded unchanged Req n Rel n z

14 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation

15 64x64 Prototype

16 Measurement Results

17 Measurement Results

18 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation

19 Scaling Interconnect for Many-Cores 19 Existing interconnects Buses, Crossbars, Rings Limited to ~16 cores Other s Interconnect proposals for Many-Cores Packet-switched, multi-hop, network-on-chip (NoC) Grid of routers meshes, tori and flattened butterfly Our Proposal Swizzle Switch Networks Flat single-stage, one-hop, crossbar++ interconnect 19 19

73 mm ARM Cortex A5 256 kb L2 Cache Bank 32 kb DCache R 1.

20 Mesh Network-on-Chip 2 Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller 13.8 mm 32 kb ICache 1.73 mm ARM Cortex A5 256 kb L2 Cache Bank 32 kb DCache R 1.73 mm Memory Controller Memory Controller Area = 3. mm mm Area = 19 mm 2 (a) Router R 2 2

21 Flattened Butterfly Network-on-Chip 21 Memory Controller Memory Controller Memory Controller Memory Controller Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Memory Controller Memory Controller 13.8 mm Memory Controller Memory Controller 13.8 mm Area = 19 mm

22 Motivating Swizzle Switch Networks 22 Uniform access latency Ease of programming, data placement, thread placement,... Low Power Simplicity Packet-switched NoCs need routing, congestion management, flow control, wormhole switching,

23 Motivating Swizzle Switch Networks 23 Mesh SSN Average'='.17'flits/node/sec' Unfairness'='42.2''' Average'='.13'flits/node/sec' Unfairness'='1.8''' accepted throughput [flits/node/cycle] node index (X) node index (Y) node index (X) node index (Y) Unfairness = Node highest_throughput / Node lowest_throughput Hotspot Traffic = All nodes sending data to node 8,8 Under Hotspot traffic, the Crossbar has a slightly less throughput than the Mesh but is 4x more fair

24 Motivating Swizzle Switch Networks 24 Mesh Average'='.36'flits/node/sec' Unfairness'='1.89''' SSN Average'='.47'flits/node/sec' Unfairness'='1.2''' node index (X) node index (Y) node index (X) node index (Y) In the Mesh, nodes closest to the center receive the highest throughput Under Uniform Random traffic, the Crossbar has more throughput than the Mesh and is 87% more fair

25 Motivating Swizzle Switch Networks

26 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation

27 Top-Level Floorplan 27 Memory Controller Memory Controller Memory Controller Memory Controller NW WN WS SW Swizzle Switch x 3 NE EN ES SE Memory Controller Memory Controller 14.3 mm 3.27 mm 512 kb L2 cache L2 Area = 4.5 mm 2 32 kb Icache.86 mm Arm Cortex A5 32 kb Dcache.86 mm 1.38 mm Memory Controller 14.3 mm Total Area = 24 mm 2 Memory Controller Core + L1 Area =.74 mm

28 Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation

29 Evaluation 29 Simulation Parameters Feature NoC (Mesh/FBFly) SSN Processors L1 Cache 64 in-order cores, 1 IPC, 1.5 GHz 32kB I/D Caches, 4-way associative, 64-byte line size, 1 cycle latency L2 Cache Shared L2, 16 MB, 64-way banked, 8- way associative, 64-byte line size, 1 cycle latency Interconnect Main Memory 3. GHz, 128-bit, 4-stage Routers, 3 virt. networks w/ 3 virt. channels 496MB, 5 cycle latency Shared L2, 16MB, 32-way banked, 16-way associative, 64-byte line size, 11 cycle latency 1.5 GHz, 64x32x128bit Swizzle Switch Network Benchmarks SPLASH 2 : Scientific parallel application suite 29 29

30 Results Performance & QoS 3 Normalized Execution Time 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Synchronization Stall Core Active Memory Stall Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Barnes. Cholesky. FFT. FMM. Lu Contig. Lu NonContig. Ocean Contig. Ocean NonContig. Radix. Raytrace. Water NSquared. Water Spatial Overall Performance Quality-of-Service 3 3

31 Results Power 31 Interconnect Power (mw)! 7,! 6,! 5,! 4,! 3,! 2,! 1,!! Wire Dynamic! SSN/Router Dynamic! SSN/Router Leakage! Clock! Buffer! Barnes!.! Cholesky!.! FFT!.! FMM!.! Lu Contig!.! Lu NonContig!.! Ocean Contig!.! Ocean NonContig!.! Radix!.! Raytrace!.! Water NSquared!.! Water Spatial!.! Weighted Average! On average the SSN uses 28% less power in the interconnect compared to a flattened butterfly Total Benchmark Energy! (pj/inst)! 9! 8! 7! 6! 5! 4! 3! 2! 1!! Interconnect! L2 Dynamic! L2 Leakage! Core/L1 Dynamic! Core/L1 Leakage! Barnes!.! Cholesky!.! FFT!.! FMM!.! LuContig!.! Lu NonContig!.! Ocean Contig!.! Ocean NonContig!.! Radix!.! Raytrace!.! Water NSquared! Which results in an average reduction in total system energy to complete the task of 11%.! Water Spatial!.! Average! 31 31

32 Summary Swizzle Switch Prototype (45nm) 64x64 Crossbar with 128-bit busses Embedded LRG priority arbitration Achieved 4.4 ~6MHz consuming only 1.3W of power 32 Swizzle Switch Network Evaluation Improved performance by 21% Reduced power by 28% Reduced latency variability by 3x 32 32

33 33 Additional Detailed Slides 33 33

34 Arbitration Mechanism (Matrix View) 34 Pre-charge 128 Source IP Req Rel Source IP Req Rel Inhibits (X) Source IP Req Rel 128 Dest. IP Read Buffers Dest. IP Dest. IP Requests (R) X X 1 X 2 X 3 X 4 R X 1 1 R 1 X R X 2 R X 1 4 R X

Least Recently Granted (LRG) 35 S et Reset X X 1 X 2 X 3 X 4 In X 1 1 In 1 X In 2 1 1 X 2 In 3 1 1 1 X

35 Least Recently Granted (LRG) 35 S et Reset X X 1 X 2 X 3 X 4 In X 1 1 In 1 X In X 2 In X 1 4 In X 3 X X 1 X 2 X 3 X 4 In X In 1 X 1 1 In X 1 3 In X 1 4 In 4 X 35 35

36 Round Robin Arbitration Output bus used as priority bus 36 Req l Rel l x SET Req m Rel m y 1 Req n Rel n z 1 1 RESET 36 36

37 Round Robin Arbitration Output bus used as priority bus 37 Req l Rel l Req m Rel m x + 1 y Increasing Round Robin Algorithm x y 62 z l m 62 n m l m-1 62 n Least priority upgraded Req n Rel n 37 37

38 QoS Arbitration 38 Increasing QoS Arbitration QoS values LRG Arbitration Highest QoS Winner

39 Timing Diagram

40 Crosspoint Circuit 4 wl out<> out<64> clock_b Crosspoint(x,y) Rel_b in<> in<64> wl -line bit clock Sense_ amp out<y> Reset_b Rel in<y> out<y> (priority_line) Rel Req wl Ack in<x+64> clock_b in<x> used to index Req/Rel out<y> sampled for Arbitration in<x+64> used for sending ack out<y> discharged for LRG update in<63> Rel_b wl out<63> out<127> in<127> 4 4

Regenerative Bit-line Repeater 41 16 16 Macro clock_b clock_b Bit-line Repeaters

41 Regenerative Bit-line Repeater Macro clock_b clock_b Bit-line Repeaters Bitline_R SSN Bitline_L clock Regeneration and Decoupling improves speed Bitline_L Bitline_R 41 41

42 Simulated bit-line delay improvement 42 Technology : 45nm Supply : 1.1V Temperature : 25 C 42 42

43 SSN Scaling: Simulation Technology : 45nm Supply : 1.1V Temperature : 25 C 43 wo Repeater 256 bit Bus with Repeater 128 bit Bus with Repeater Regenerative repeaters improve SSN scalability 43 43

44 Swizzle Switch Network-on-Chip 44 (1%)*%3, (1%)!%-./ ('%)!%-./ ('%)*%+, (1%)!%+, (1%)*%-./ ('%)*%-./ ('%)!%+, Memory Controller Memory Controller $ # 4 $ # $ $ 4 Memory Controller Memory Controller NW WN WS SW Swizzle Switch x 3 Memory Controller L mm NE EN ES SE Memory Controller Destination L2 Memory Controller Memory Controller 14.3 mm 1(%)*%-./ 1(%)!%-./ 1(%)*%+, 1%)*%+, 1%)!%-./ 1(%)!%+, 1%)!%+, 1%)*%-./ $ 4 $ $ # 4 # $ $!"#$%&$!'# ()*+'*"', -./12$ -./ '%)!%-./ 1%)!%-./ 1%)*%3, 4 # $ 4!"#$%&$!"- ()*()*"', -./12 -./345 "6"88 9!"$:;*!'#$%&$!"# +'*()*"', -./12$-./ $ $ # '%)!%+, $ # $ # 4 $ $ 4 '(%)*%+,% '(%)!%-./% '(%)*%-./!"#$%&& '(%)!%+, '%)!%+, '%)*%+, '%)*%-./ '%)!%-./ Source L1 L2 Shared Data Data Forwarding Responses Invalidations Requests Writebacks '%)*%-./ 1%)*%-./ 1%2!%+, '%)*%+,!"#$%&& (b) 44 44

45 Results 64-core with A9 O3 cores

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna