1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1
Announcements Upcoming lecture schedule Today: On-chip Interconnects (Circuits and Photonics) 2/25: Off-chip Interconnects 3/1 & 3/3: Spring Break 3/8: Midterm Review, Structure for rest of lectures, Project Specification Midterm Exam: In-Class March 1 th, closed notes 2 Reading Assignments: For this week: "Swizzle-Switch Networks for Many-Core Systems," Korey Sewell et al. JETCAS IEEE Journal on Emerging and Selected Topics in Circuits and Systems, June 212 "Corona: System implications of emerging nanophotonic technology." Dana Vantrease et al. ISCA 28. 2 2
Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 3 3 3
Swizzle Switch 4 Conventional Matrix Crossbar Swizzle Switch Embeds arbitration within crossbar single cycle arbitration Re-use input/output data buses for arbitration SRAM-like layout with priority bits at cross-points Low-power optimizations Excellent scalability 4 4
Data Routing Source IP Source IP 128 1 Pre-charge 1 Multicast & Broadcast Bitlines discharged if Data = 1 Crosspoint = 1 5 Source IP 1 Read Buffers 128 Dest. IP Dest. IP Dest. IP 5 5
Swizzle Switch Architecture 6 Output bus used as prioritybus Source IP Source IP 128 1 Pre-charge 1 Req k Rel k Req l Rel l vector 1 1 1 1 Source IP vectors 1 Req m Rel m 1 1 1 1 1 128 Dest. IP Read Buffers Dest. IP Dest. IP Sense Amp + Latch Req n Rel n 1 1 Data routing, arbitration, And priority update control embedded within crosspoints 6 6
Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 7 7 7
Inhibit Based Arbitration Req l Req m Sense Amp + Latch Req n Output bus used as priority bus x y z line l x < y line m 1 z > y 1 1 line n This diagram is a single column in the Swizzle-Switch (output), each output arbitrates/transfers data independently Each Crosspoint has a sense amp/ latch to indicate connectivity. Each input samples a unique bit of the output bus to determine if it has been granted the channel vectors are stored and when a request is issued they discharge bits along the output columns to INHIBIT lower priority requests Finally, the priority vectors are updated when the data transfer completes. 8 8 8
Least Recently Granted(LRG) Output bus used as priority bus 9 Req l x line l x < y line n LOWEST Discharges NO Lines Req m Sense Amp + Latch Req n y z line m 1 z > y 1 1 INTERMEDIATE Discharges SOME Lines HIGHEST Discharges ALL Lines 9 9
Least Recently Granted(LRG) Output bus used as priority bus 1 Req l Req m Sense Amp + Latch Req n x 1 y z line l x < y line m 1 z > y 1 1 line n Example Arbitration: (1) Req l and Req m Request the bus (red lines) (2) Req m discharges line l, priority lines m and n remain charged (green lines) (3) Req l senses line l and is inhibited (not granted), Req m senses line m and is not inhibited (4) The crosspoint records the connectivity at input m 1 1
Least Recently Granted(LRG) Output bus used as priority bus 11 Example Update: Req l Rel l x SET Input m signals it is done with data transfer by asserting Rel m Req m Rel m 1 y 1 RESET Req n Rel n z 1 1 11 11
Least Recently Granted(LRG) Output bus used as priority bus 12 Req l Rel l x + 1 1 INTERMEDIATE Discharges SOME Lines Req m Rel m LOWEST Discharges NO Lines Req n Rel n z 1 1 HIGHEST Discharges ALL Lines 12 12
Least Recently Granted(LRG) Output bus used as priority bus 13 Req l Rel l Req m Rel m x + 1 1 Increasing x y 62 z LRG Algorithm l m 62 n m m releases channel l m-1 62 n Least priority upgraded unchanged Req n Rel n z 1 1 13 13
Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 14 14 14
64x64 Prototype 15 15 15
Measurement Results 16 16 16
Measurement Results 17 17 17
Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 18 18 18
Scaling Interconnect for Many-Cores 19 Existing interconnects Buses, Crossbars, Rings Limited to ~16 cores Other s Interconnect proposals for Many-Cores Packet-switched, multi-hop, network-on-chip (NoC) Grid of routers meshes, tori and flattened butterfly Our Proposal Swizzle Switch Networks Flat single-stage, one-hop, crossbar++ interconnect 19 19
Mesh Network-on-Chip 2 Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller 13.8 mm 32 kb ICache 1.73 mm ARM Cortex A5 256 kb L2 Cache Bank 32 kb DCache R 1.73 mm Memory Controller Memory Controller Area = 3. mm 2 13.8 mm Area = 19 mm 2 (a) Router R 2 2
Flattened Butterfly Network-on-Chip 21 Memory Controller Memory Controller Memory Controller Memory Controller Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Memory Controller Memory Controller 13.8 mm Memory Controller Memory Controller 13.8 mm Area = 19 mm 2 21 21
Motivating Swizzle Switch Networks 22 Uniform access latency Ease of programming, data placement, thread placement,... Low Power Simplicity Packet-switched NoCs need routing, congestion management, flow control, wormhole switching,... 22 22
Motivating Swizzle Switch Networks 23 Mesh SSN Average'='.17'flits/node/sec' Unfairness'='42.2''' Average'='.13'flits/node/sec' Unfairness'='1.8''' accepted throughput [flits/node/cycle].6.5.4.3.2.1. 1 2 3 4 5 6 7 8 node index (X) 1 3 5 7 node index (Y).6.5.4.3.2.1. 1 2 3 4 5 6 7 8 node index (X) 1 3 5 7 node index (Y) Unfairness = Node highest_throughput / Node lowest_throughput Hotspot Traffic = All nodes sending data to node 8,8 Under Hotspot traffic, the Crossbar has a slightly less throughput than the Mesh but is 4x more fair. 23 23
Motivating Swizzle Switch Networks 24 Mesh Average'='.36'flits/node/sec' Unfairness'='1.89''' SSN Average'='.47'flits/node/sec' Unfairness'='1.2'''.6.6.5.5.4.4.3.2.1. 1 2 3 4 5 6 7 8 node index (X) 1 3 5 7 node index (Y).3.2.1. 1 2 3 4 5 6 7 8 node index (X) 1 3 5 7 node index (Y) In the Mesh, nodes closest to the center receive the highest throughput Under Uniform Random traffic, the Crossbar has more throughput than the Mesh and is 87% more fair. 24 24
Motivating Swizzle Switch Networks 25 25 25
Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 26 26 26
Top-Level Floorplan 27 Memory Controller Memory Controller Memory Controller Memory Controller NW WN WS SW Swizzle Switch x 3 NE EN ES SE Memory Controller Memory Controller 14.3 mm 3.27 mm 512 kb L2 cache L2 Area = 4.5 mm 2 32 kb Icache.86 mm Arm Cortex A5 32 kb Dcache.86 mm 1.38 mm Memory Controller 14.3 mm Total Area = 24 mm 2 Memory Controller Core + L1 Area =.74 mm 2 27 27
Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 28 28 28
Evaluation 29 Simulation Parameters Feature NoC (Mesh/FBFly) SSN Processors L1 Cache 64 in-order cores, 1 IPC, 1.5 GHz 32kB I/D Caches, 4-way associative, 64-byte line size, 1 cycle latency L2 Cache Shared L2, 16 MB, 64-way banked, 8- way associative, 64-byte line size, 1 cycle latency Interconnect Main Memory 3. GHz, 128-bit, 4-stage Routers, 3 virt. networks w/ 3 virt. channels 496MB, 5 cycle latency Shared L2, 16MB, 32-way banked, 16-way associative, 64-byte line size, 11 cycle latency 1.5 GHz, 64x32x128bit Swizzle Switch Network Benchmarks SPLASH 2 : Scientific parallel application suite 29 29
Results Performance & QoS 3 Normalized Execution Time 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Synchronization Stall Core Active Memory Stall Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Barnes. Cholesky. FFT. FMM. Lu Contig. Lu NonContig. Ocean Contig. Ocean NonContig. Radix. Raytrace. Water NSquared. Water Spatial Overall Performance Quality-of-Service 3 3
Results Power 31 Interconnect Power (mw)! 7,! 6,! 5,! 4,! 3,! 2,! 1,!! Wire Dynamic! SSN/Router Dynamic! SSN/Router Leakage! Clock! Buffer! Barnes!.! Cholesky!.! FFT!.! FMM!.! Lu Contig!.! Lu NonContig!.! Ocean Contig!.! Ocean NonContig!.! Radix!.! Raytrace!.! Water NSquared!.! Water Spatial!.! Weighted Average! On average the SSN uses 28% less power in the interconnect compared to a flattened butterfly Total Benchmark Energy! (pj/inst)! 9! 8! 7! 6! 5! 4! 3! 2! 1!! Interconnect! L2 Dynamic! L2 Leakage! Core/L1 Dynamic! Core/L1 Leakage! Barnes!.! Cholesky!.! FFT!.! FMM!.! LuContig!.! Lu NonContig!.! Ocean Contig!.! Ocean NonContig!.! Radix!.! Raytrace!.! Water NSquared! Which results in an average reduction in total system energy to complete the task of 11%.! Water Spatial!.! Average! 31 31
Summary Swizzle Switch Prototype (45nm) 64x64 Crossbar with 128-bit busses Embedded LRG priority arbitration Achieved 4.4 Tbps @ ~6MHz consuming only 1.3W of power 32 Swizzle Switch Network Evaluation Improved performance by 21% Reduced power by 28% Reduced latency variability by 3x 32 32
33 Additional Detailed Slides 33 33
Arbitration Mechanism (Matrix View) 34 Pre-charge 128 Source IP Req Rel Source IP Req Rel Inhibits (X) Source IP Req Rel 128 Dest. IP Read Buffers Dest. IP Dest. IP Requests (R) X X 1 X 2 X 3 X 4 R X 1 1 R 1 X R 2 1 1 X 2 R 3 1 1 1 X 1 4 R 4 1 1 1 X 3 34 34
Least Recently Granted (LRG) 35 S et Reset X X 1 X 2 X 3 X 4 In X 1 1 In 1 X In 2 1 1 X 2 In 3 1 1 1 X 1 4 In 4 1 1 1 X 3 X X 1 X 2 X 3 X 4 In X 1 1 2 In 1 X 1 1 In 2 1 1 X 1 3 In 3 1 1 1 X 1 4 In 4 X 35 35
Round Robin Arbitration Output bus used as priority bus 36 Req l Rel l x SET Req m Rel m y 1 Req n Rel n z 1 1 RESET 36 36
Round Robin Arbitration Output bus used as priority bus 37 Req l Rel l Req m Rel m x + 1 y + 1 1 1 1 Increasing Round Robin Algorithm x y 62 z l m 62 n m l m-1 62 n Least priority upgraded Req n Rel n 37 37
QoS Arbitration 38 Increasing QoS Arbitration 3 3 3 2 QoS values LRG Arbitration 3 3 3 Highest QoS Winner 1 38 38
Timing Diagram 39 39 39
Crosspoint Circuit 4 wl out<> out<64> clock_b Crosspoint(x,y) Rel_b in<> in<64> wl -line bit clock Sense_ amp out<y> Reset_b Rel in<y> out<y> (priority_line) Rel Req wl Ack in<x+64> clock_b in<x> used to index Req/Rel out<y> sampled for Arbitration in<x+64> used for sending ack out<y> discharged for LRG update in<63> Rel_b wl out<63> out<127> in<127> 4 4
Regenerative Bit-line Repeater 41 16 16 Macro clock_b clock_b Bit-line Repeaters Bitline_R SSN Bitline_L clock Regeneration and Decoupling improves speed Bitline_L Bitline_R 41 41
Simulated bit-line delay improvement 42 Technology : 45nm Supply : 1.1V Temperature : 25 C 42 42
SSN Scaling: Simulation Technology : 45nm Supply : 1.1V Temperature : 25 C 43 wo Repeater 256 bit Bus with Repeater 128 bit Bus with Repeater Regenerative repeaters improve SSN scalability 43 43
Swizzle Switch Network-on-Chip 44 (1%)*%3, (1%)!%-./ ('%)!%-./ ('%)*%+, (1%)!%+, (1%)*%-./ ('%)*%-./ ('%)!%+, Memory Controller Memory Controller $ # 4 $ # $ $ 4 Memory Controller Memory Controller NW WN WS SW Swizzle Switch x 3 Memory Controller L1 14.3 mm NE EN ES SE Memory Controller Destination L2 Memory Controller Memory Controller 14.3 mm 1(%)*%-./ 1(%)!%-./ 1(%)*%+, 1%)*%+, 1%)!%-./ 1(%)!%+, 1%)!%+, 1%)*%-./ $ 4 $ $ # 4 # $ $!"#$%&$!'# ()*+'*"', -./12$ -./345 67+88 9 '%)!%-./ 1%)!%-./ 1%)*%3, 4 # $ 4!"#$%&$!"- ()*()*"', -./12 -./345 "6"88 9!"$:;*!'#$%&$!"# +'*()*"', -./12$-./345 67+88 9 $ $ # '%)!%+, $ # $ # 4 $ $ 4 '(%)*%+,% '(%)!%-./% '(%)*%-./!"#$%&& '(%)!%+, '%)!%+, '%)*%+, '%)*%-./ '%)!%-./ Source L1 L2 Shared Data Data Forwarding Responses Invalidations Requests Writebacks '%)*%-./ 1%)*%-./ 1%2!%+, '%)*%+,!"#$%&& (b) 44 44
Results 64-core with A9 O3 cores 45 45 45