Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre

Size: px

Start display at page:

Download "Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre"

Easter Elliott
5 years ago
Views:

1 -DSP Harnessing the Xilinx DSP Multiplexers to efficiently support NoCs on FPGAs Chethan Kumar H B and Nachiket Kapre nachiket@ieee.org

2 FPL 201 paper Jan Gray co-author Specs 60 s+100 FFs 2.9ns clock Smallest FPGA router available + RTL code 2

3 32b payload + Virtex-6 240T Router s FFs Clock Penn 1.7K 41 4.ns CMU 1.K ns FPL ns 3

4 32b payload + Virtex-6 240T Router s FFs Clock Penn 1.7K 41 4.ns CMU 1.K ns FPL ns 2x 3

5 32b payload + Virtex-6 240T Router s FFs Clock Penn 1.7K 41 4.ns CMU 1.K ns FPL ns 2x x 3

6 32b payload + Virtex-6 240T Router s FFs Clock Penn 1.7K 41 4.ns CMU 1.K ns FPL ns 2x x 1.x 3

7 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns 4

8 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns

9 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns x

10 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns x 8x

11 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns x 8x ~

12 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns x 8x ~ + 1 DSP 6

13 7

14 Motivation Close the gap vs. embedded NoCs do we really want clean-slate hard NoCs? Return resources to FPGA application reduce NoC overheads Find clever ways to reuse existing FPGA elements 8

15 Outline Adapting the arch. to the DSP Scaling to 2D layouts using DSP carry chains Performance and Resource evaluation 9

16 Outline Adapting the arch. to the DSP Scaling to 2D layouts using DSP carry chains Performance and Resource evaluation 10

17 Overview of switch organization NoC organised as a unidirectional torus Each switch has 2 inputs, 2 outputs into the network + PE connection Uses deflection routing no buffering, no allocation, etc from: Jan Gray 11

18 Internals W PE N E 6 SPE 12 sel0 sel1,2 DOR Logic

19 W PE N 6 E SPE summary sel0 sel1,2 DOR Logic Bulk of the footprint from -, 6- blocks implement packet multiplexers DOR logic handful of s only reads address fields, valid signals Inter- router links pipelined registers Idea: move (1) multiplexers + (2) registers into Xilinx DSP block 13

20 Xilinx DSP block INMODE OPMODE ALUMODE PCOUT A D X B 18 Y P C PCIN Z ALU 14

21 Xilinx DSP block INMODE OPMODE ALUMODE PCOUT A D X B 18 Y P C PCIN Z ALU 1

22 INMODE OPMODE ALUMODE PCOUT A D X Programmable B 18 Y P C PCIN Z ALU elements Xilinx DSP block very versatile! Typical use case: signal processing, streaming computations => mainly arithmetic INMODE 27b multiplexer between A and D OPMODE b multiplexers between A:B, C Exploit cascade links PCINPCOUT! 16

23 Input + Multiplexer Mapping 6 W PE E SPE DOR Logic sel0 sel1,2 N A D B C P 27 PCIN ALU X Z Y PCOUT OPMODE ALUMODE INMODE 17

24 Input + Multiplexer Mapping 6 W PE E SPE DOR Logic sel0 sel1,2 N A D B C P 27 PCIN ALU X Z Y PCOUT OPMODE ALUMODE INMODE WEST PE N SPE EAST 18

25 Input + Multiplexer Mapping 6 W PE E SPE DOR Logic sel0 sel1,2 N A D B C P 27 PCIN ALU X Z Y PCOUT OPMODE ALUMODE INMODE WEST PE N SPE EAST 19

26 Input + Multiplexer Mapping 6 W PE E SPE DOR Logic sel0 sel1,2 N A D B C P 27 PCIN ALU X Z Y PCOUT OPMODE ALUMODE INMODE WEST PE N SPE EAST 20

27 Multi-cycling Problem: has two outputs (three in fact, with SPE output port shared) Solution: must multi-pump the DSP block runs at 2x the frequency of the PEs First sub-cycle resolve EAST output Second sub-cycle resolve SOUTHPE output 21

28 First cycle CE INMODE OPMODE ALUMODE PCOUT A D X East Output B 18 Y P C PCIN PE Input West Input Z ALU 22

29 Second cycle CE INMODE OPMODE ALUMODE PCOUT A D B North Input X Y SouthPE Output P C PCIN PE Input West Input Z ALU 23

30 Outline Adapting the arch. to the DSP Scaling to 2D layouts using DSP carry chains Performance and Resource evaluation 24

31 DSP columnar layout DOR Logic PCIN dedicated cascade routes PCOUT P A:B P A:B C PCIN DSP Column User Logic PCOUT programmable FPGA interconnect 2

32 Layout considerations FPGA DSPs organised into vertical columns ~100s of DSPs in a column ~10s of columns Restrictions: 1. Cascade links only extend within column 2. Horizontal links must use general interconnect Key question: Adjusting NoC size vs. DSP count use passthrough DSPs 26

33 Embedded layout Top-Turn DSPs PCIN to P Router DSPs Pass-thru DSPs PCOUT to PCIN Router DSPs fabric Pass-thru DSPs PCOUT to PCIN cascade fabric Router DSPs Bottom-Turn DSPs A:B to PCOUT 27

34 Comparing Xilinx Virtex6 and Virtex7 Layouts 8x8 NoC (ML60 board) 28 16x16 NoC (VC707 board)

35 Outline Adapting the arch. to the DSP Scaling to 2D layouts using DSP carry chains Performance and Resource evaluation 29

36 s vs DSPs Simple tradeoff substantially fewer s vs. DSPs Importantly, FFs absorbed into DSP Power and effective BW for random traffic mostly identical 30

37 s vs DSPs Simple tradeoff substantially fewer s vs. DSPs Importantly, FFs absorbed into DSP Power and effective BW for random traffic mostly identical 31

38 Commentary on hard NoCs Area: Hard router = 12.4 LABs 1 Altera DSP block = 11.9 LABs Stratix-III -DSP marginally smaller Speed: Hard router ~996 MHz -DSP ~60 MHz (multi-pumped) -DSP limits freq advantage to 3x. Power Hard router ~1.8 W -DSP model ~1.1W 1% activity -DSP uses ~0% less power Abdelfattah + Betz [TRETS2014] (extrapolated results for b-wide 1VC) 32

39 Wish-list for DSPs Gen2 Configurable Cascades b switched bidirectional routing instead of just cascades (approach hard NoC wiring) option to skip DSP blocks (segment lengths) DOR routing pattern detection logic with multiple masks (similar to Altera DSP units) SIMD Multiplexing fracturing b-wide lanes into multiple lanes 33

40 Conclusions muxes mapped to DSP blocks use the dynamic OPMODE feature Reduce cost by x s, 8x FFs per router Exploit cascade links to absorb NoC wiring Significantly close the gap with hard NoCs 34

41 Embedded layout Top-Turn DSPs PCIN to P Router DSPs Top-Turn DSPs PCIN to P Three kinds of DSPs Router DSPs D H cascade Pass-thru DSPs PCOUT to PCIN Router DSPs fabric Pass-thru DSPs fabric PCOUT to PCIN Router DSPs Route DSPs Pass-thru DSPs PCOUT to PCIN Small fraction of DSPs for Router DSPs switching fabric Pass-through Pass-thru DSPs PCOUT to PCIN glorified pipelined wires Router DSPs multi-pumping 0% back to user cascade fabric fabric H H Bottom-Turn DSPs Top-Turn A:B to PCOUT DSPs PCIN to P Router DSPs Bottom-Turn DSPs A:B to PCOUT Corner-turn DSPs connect cascades to fabric Pass-thru DSPs PCOUT to PCIN 3

42 Physical FPGA layout Corner-Turn fabric cascade fabric Pass-Thru 36 2x2 NoC (ML60 board)

44 Efficiency 38

45 Efficiency 39

46 Efficiency 40

47 Efficiency DSPs less-efficient than -based! 41

Implementing FPGA overlay NoCs using the Xilinx UltraScale memory cascades

Implementing FPGA overlay NoCs using the Xilinx UltraScale memory cascades Nachiket Kapre University of Waterloo Waterloo, Ontario, Canada Email: nachiket@uwaterloo.ca Abstract We can enhance the performance