FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Size: px

Start display at page:

Download "FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs"

Beryl Blake
5 years ago
Views:

1 1/29 FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu

2 2/29 Claim FPGA overlay NoCs designed to exploit interconnect properties of the FPGA fabric can surpass existing state-of-the-art NoCs by: throughput 2.2 energy at 2.5 LUT cost Xilinx Virtex-7 485T FPGA, 8 8 system size, synthetic+real-world traffic.

3 3/29 Context FPGAs finding comfortable home in datacenters Offloading compute intensive workloads to the FPGA Energy-efficiency, fast coupling to networking Common Infrastructure: NoCs for apps + system IO

4 3/29 Context FPGAs finding comfortable home in datacenters Offloading compute intensive workloads to the FPGA Energy-efficiency, fast coupling to networking Common Infrastructure: NoCs for apps + system IO

5 Landscape of contemporary FPGA NoC Routers 4/29

6 5/29 Landscape of contemporary FPGA NoC Routers Peak S/W Bandwidth (packets/ns) Split Merge CONNECT BLESS OpenSMART Cost per Switch max(luts,ffs) ASIC clones transplanted onto FPGAs fare poorly! expensive buffers, virtual channels, multi-ported itches Even contemporary FPGA routers are expensive and slow FastTrack: Deflection-routing + Bufferless + Torus

7 5/29 Landscape of contemporary FPGA NoC Routers Peak S/W Bandwidth (packets/ns) Split Merge CONNECT BLESS OpenSMART Cost per Switch max(luts,ffs) ASIC clones transplanted onto FPGAs fare poorly! expensive buffers, virtual channels, multi-ported itches Even contemporary FPGA routers are expensive and slow FastTrack: Deflection-routing + Bufferless + Torus

8 5/29 Landscape of contemporary FPGA NoC Routers Peak S/W Bandwidth (packets/ns) Split Merge CONNECT BLESS OpenSMART Cost per Switch max(luts,ffs) ASIC clones transplanted onto FPGAs fare poorly! expensive buffers, virtual channels, multi-ported itches Even contemporary FPGA routers are expensive and slow FastTrack: Deflection-routing + Bufferless + Torus

9 5/29 Landscape of contemporary FPGA NoC Routers Peak S/W Bandwidth (packets/ns) Split Merge CONNECT BLESS OpenSMART Cost per Switch max(luts,ffs) Hoplite FastTrack ASIC clones transplanted onto FPGAs fare poorly! expensive buffers, virtual channels, multi-ported itches Even contemporary FPGA routers are expensive and slow FastTrack: Deflection-routing + Bufferless + Torus

10 6/29 Qualitative Comparison of FPGA NoC Routers Router Cost Xbar+Arb Buffers VCs OpenSMART BLESS CONNECT Split-Merge Hoplite

11 Quick Tutorial on Hoplite 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs, TRETS 2017 Hoplite: Building Austere Overlay NoCs for FPGAs, FPL /29

12 8/29 Quick Tutorial on HopliteRT 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 HopliteRT: An Efficient FPGA NoC for Real-Time Applications, FPT 2017

13 9/29 Qualitative Comparison of FPGA NoC Routers Router Cost Xbar+Arb Buffers VCs OpenSMART BLESS CONNECT Split-Merge Hoplite

14 9/29 Qualitative Comparison of FPGA NoC Routers Router Cost Perf Xbar+Arb Buffers VCs Tput Latency OpenSMART BLESS CONNECT Split-Merge Hoplite

15 10/29 Challenge Deflection routing inefficient use of wiring resources Deflected packets stay in network for longer latency Steal bandwidth from other traffic throughput Can we allow improve NoC performance under deflection routing? Are there unique opportunities provided by the FPGA fabric? Hoplite cheap in LUT cost... FastTrack inspect FPGA interconnect

16 11/29 Outline Introduction and Motivation FastTrack NoC Organization FastTrack Router Operation Evaluation

17 12/29 Outline Introduction and Motivation FastTrack NoC Organization FastTrack Router Operation Evaluation

18 13/29 FPGA Wire Speeds distances not to scale

19 14/29 FastTrack NoC Organization

20 15/29 Depopulated Topology Generation

21 16/29 Parametric Topology generation FPGA NoC parameterized by three terms: N System size D Distance of express link R Depopulation parameter controls how many routers are FastTrack vs. vanilla Hoplite Fully populated 4 4 NoC FT(16,2,1) Half population 4 4 NoC FT(16,2,2)

22 17/29 Outline Introduction and Motivation FastTrack NoC Organization FastTrack Router Operation Evaluation

23 18/29 FastTrack Switch Organization N Sh N Ex N W Ex 5:1 E Ex W 3:1 E W Sh 4:1 E Sh 3:1 4:1 4:1 PE S (a) Base HopliteRT PE S Sh S Ex (b) FastTrack

24 19/29 Switch Operation W Ex W Sh PE N Sh N Ex 4:1 4:1 S Sh S Ex 5:1 4:1 E Ex E Sh Packets can start in either short or express links DOR routing function: travel in X first, then Y Packets can upgrade to fast links if they can Packets can downgrade to slow links only on turn! Livelock avoidance: W S > N S Express links=higher priority, deflected packets acquire higher priority progress

25 19/29 Switch Operation W Ex W Sh PE N Sh N Ex 4:1 4:1 S Sh S Ex 5:1 4:1 E Ex E Sh Packets can start in either short or express links DOR routing function: travel in X first, then Y Packets can upgrade to fast links if they can Packets can downgrade to slow links only on turn! Livelock avoidance: W S > N S Express links=higher priority, deflected packets acquire higher priority progress

26 19/29 Switch Operation W Ex W Sh PE N Sh N Ex 4:1 4:1 S Sh S Ex 5:1 4:1 E Ex E Sh Packets can start in either short or express links DOR routing function: travel in X first, then Y Packets can upgrade to fast links if they can Packets can downgrade to slow links only on turn! Livelock avoidance: W S > N S Express links=higher priority, deflected packets acquire higher priority progress

27 19/29 Switch Operation W Ex W Sh PE N Sh N Ex 4:1 4:1 S Sh S Ex 5:1 4:1 E Ex E Sh Packets can start in either short or express links DOR routing function: travel in X first, then Y Packets can upgrade to fast links if they can Packets can downgrade to slow links only on turn! Livelock avoidance: W S > N S Express links=higher priority, deflected packets acquire higher priority progress

28 19/29 Switch Operation W Ex W Sh PE N Sh N Ex 4:1 4:1 S Sh S Ex 5:1 4:1 E Ex E Sh Packets can start in either short or express links DOR routing function: travel in X first, then Y Packets can upgrade to fast links if they can Packets can downgrade to slow links only on turn! Livelock avoidance: W S > N S Express links=higher priority, deflected packets acquire higher priority progress

29 19/29 Switch Operation W Ex W Sh PE N Sh N Ex 4:1 4:1 S Sh S Ex 5:1 4:1 E Ex E Sh Packets can start in either short or express links DOR routing function: travel in X first, then Y Packets can upgrade to fast links if they can Packets can downgrade to slow links only on turn! Livelock avoidance: W S > N S Express links=higher priority, deflected packets acquire higher priority progress

30 19/29 Switch Operation W Ex W Sh PE N Sh N Ex 4:1 4:1 S Sh S Ex 5:1 4:1 E Ex E Sh Packets can start in either short or express links DOR routing function: travel in X first, then Y Packets can upgrade to fast links if they can Packets can downgrade to slow links only on turn! Livelock avoidance: W S > N S Express links=higher priority, deflected packets acquire higher priority progress

31 19/29 Switch Operation W Ex W Sh PE N Sh N Ex 4:1 4:1 S Sh S Ex 5:1 4:1 E Ex E Sh Packets can start in either short or express links DOR routing function: travel in X first, then Y Packets can upgrade to fast links if they can Packets can downgrade to slow links only on turn! Livelock avoidance: W S > N S Express links=higher priority, deflected packets acquire higher priority progress

32 20/29 Outline Introduction and Motivation FastTrack NoC Organization FastTrack Router Operation Evaluation

33 21/29 Experimental Setup RTL implementation of Routers parameterized D, R parameters control cost Cycle-accurate simulations Verilator FPGA synthesis + out-of-context place-and-route + XDC floorplanning constraints Vivado Benchmarking: Synthetic traffic patterns at various injection rates Traces from real workloads SpMV, Graph Analytics, Multi-processing Measure sustained throughput, average latency, power model

34 22/29 Avg. Latency RANDOM traffic 8 8 NoC Avg Latency (cyc) Hoplite FT(N,2,2) FT(N,2,1) 0.1 Injection Rate 0.2

35 22/29 Avg. Latency RANDOM traffic 8 8 NoC Avg Latency (cyc) Hoplite 3x Hoplite FT(N,2,2) FT(N,2,1) 0.1 Injection Rate 0.2

36 23/29 Avg. Latency RANDOM traffic Avg Latency (cyc) Avg Latency (cyc) Hoplite FT(N,2,2) FT(N,2,1) 0.1 Injection Rate Injection Rate 0.2 Hoplite 3x Hoplite FT(N,2,2) FT(N,2,1) 0.2 FastTrack saturates at 4 5 higher injection rate than Hoplite vs Replicated Hoplite, still better but by smaller margin Replicated Hoplite has a new kind of livelock possibility (delivery)

37 24/29 Results LUT vs Throughput 8 8 NoC 200 Sustained Rate (Million Packets/s) Hoplite Area (LUTs)

38 24/29 Results LUT vs Throughput 8 8 NoC 200 Sustained Rate (Million Packets/s) Hoplite 2x Hoplite Hoplite 3x Area (LUTs)

39 24/29 Results LUT vs Throughput 8 8 NoC Sustained Rate (Million Packets/s) FT R=1 Hoplite 2x Hoplite FT R=2 Hoplite 3x Area (LUTs)

40 25/29 Results Wiring vs. Throughput 8 8 NoC 200 Sustained Rate (Million Packets/s) Hoplite Wire Count

41 25/29 Results Wiring vs. Throughput 8 8 NoC 200 Sustained Rate (Million Packets/s) Hoplite Hoplite 2x Hoplite 3x Wire Count

42 25/29 Results Wiring vs. Throughput 8 8 NoC 200 Sustained Rate (Million Packets/s) Hoplite 2x Hoplite FT R=2 FT R=1 Hoplite 3x Wire Count

43 26/29 Results Cost vs. Throughput 8 8 NoC Sustained Rate (Million Packets/s) Sustained Rate (Million Packets/s) Hoplite 2x Hoplite FT R=2 FT R=1 Hoplite 3x Area (LUTs) Hoplite 2x Hoplite FT R=2 FT R=1 Hoplite 3x Wire Count FastTrack makes better use of FPGA resources (LUTs, and wires) Packets are allowed to leave the NoC faster, freeing up resources Must pick proper combination of FT design parameters

44 27/29 Qualitative Comparison of FPGA NoC Routers Router Cost Xbar+Arb Buffers VCs OpenSMART BLESS CONNECT Split-Merge Hoplite

45 27/29 Qualitative Comparison of FPGA NoC Routers Router Cost Perf Xbar+Arb Buffers VCs Tput Latency OpenSMART BLESS CONNECT Split-Merge Hoplite

46 27/29 Qualitative Comparison of FPGA NoC Routers Router Cost Perf Xbar+Arb Buffers VCs Tput Latency OpenSMART BLESS CONNECT Split-Merge Hoplite FastTrack

47 28/29 FPGA Mapping Frequency 8 8 NoC NoC Frequency (MHz) FastTrack (64,2) FastTrack (64,4) Hoplite Calibration studies showed express links can travel quickly on chip NoC Datawidth Fmax for 2-hop FastTrack keeps up with original Hoplite 4-hop express link distance too large, some noticeable slowdown

48 29/29 Conclusions FastTrack outperforms state-of-the-art Hoplite FPGA NoC by 2.5 for synthetic traffic, 2.8 for real-world traces 2.2 on energy efficiency 2.5 more LUTs required FastTrack better at larger system sizes Ideal hop distance is 2 4 (4 256 PEs) Fmax gap between FastTrack and Hoplite is small

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre University of Waterloo Ontario, Canada nachiket@uwaterloo.ca Tushar Krishna Georgia Institute