Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Size: px

Start display at page:

Download "Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion"

Ethelbert Scott
5 years ago
Views:

1 Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23,

2 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 2

3 High Performance FPGA 100x Faster 10,000x Lower Cost 5000x Lower Power Pictures: Three Ages of the FPGAs: Trimberger, S, Proceedings of the IEEE Vol. 103, No. 3, March

4 FPGA in datacenters Increasing interest in use of FPGAs as application accelerators in datacenters Key advantage: Performance/Watt 4

5 High Performance FPGA 100x Faster 10,000x Lower Cost 5000x Lower Power 10,000x More Logic Pictures: Three Ages of the FPGAs: Trimberger, S, Proceedings of the IEEE Vol. 103, No. 3, March

6 d c FPGA Routing Find a physical path for every net in the circuit Disjoint-path problem and NP-complete The most complex and critical step in CAD flow 1 4 CLB 2 3 a b CLB a b 9 10 c d e f e f 5 8 CLB 6 7 g h CLB g h Challenge : Time-consuming 6

7 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 7

8 PathFinder Routing Algorithm Basis for both Xilinx and Altera commercial routers S1 S2 (2) (1) (4) (1) (2) (3) A B (2) (1) (4) (1) (2) (3) D1 D2 Initial Penalized 2 nd iteration, by present penalized routed with congestion by base history cost 8

9 Dynamic Parallelism on GPU Operator Active node: node where computation is needed Activity: Application of operator to active element Multiple active nodes can be processed in parallel, subject to neighborhood constraints Algorithm = repeated application of operator to graph [1] Pingali et al.. The Tao of Parallelism in Algorithms, in PLDI [2] Y. Moctor and P. Brisk. Parallel FPGA routing based on the operator formulation, in DAC

10 Overall Design Flow The iteration contains two steps: Search space reduction: subgraph dynamic expansion Parallelism exploration: SSSP with dynamic parallelism on GPU Update Cost No Placed Design subgraph dynamic expansion GPU-based SSSP for routing Feasible Yes Routed Design sink 2 source sink 1 10

11 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 11

12 Single-Net Routing Kernel: single source shortest path solver Djikstra -> VPR router Bellman-Ford -> GPU-friendly router Serial Bellman-Ford excels when running on a small graph [1] Parallel Bellman-Ford provides greatest speedup [2] Challenge : routing resource graph is large [1] A. DeHon et al. GraphStep: a system architecture for sparse-graph algorithm, in FCCM [2] F. Busato et al. An efficient implementation of the bellman-ford algorithm for Kepler GPU architecture, in IEEE Trans. on PDS

13 Subgraph Dynamic Expansion subgraph extraction sink 2 C C sink 2 C initial coverage source sink 1 source sink 1 C D sink 2 D D detection No Yes subgraph expansion source sink 2 boundary detection sink 1 source sink 1 D No feasible Yes 13

GPU-friendly router (Bellman-Ford + small routing

14 Effectiveness Runtime-Quality Comparison Baseline: VPR router (Dijkstra + large routing graph) BSDE: GPU-friendly router (Bellman-Ford + small routing subgraph) Runtime: 10% reduction Quality: 2.7% degradation 14

15 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 15

16 Time Single-Net Parallel Routing Static node-level parallelism Example L0 L1 L2 L Comp 0 Comp 8 Comp 4 Comp 1 Comp 5 Comp 2 Comp 6 Comp 3 Comp [1] Pingali et al. A Quantitative Study of Irregular Programs on GPUs, in IISWC [2] Pingali et al. Data-Driven versus Topology-Driven Irregular Computations on GPUs, in IPDPS

A Quantitative Study of Irregular Programs on GPUs, in IISWC

17 Single-Net Parallel Routing Dynamic node-level parallelism Example [1] Pingali et al. A Quantitative Study of Irregular Programs on GPUs, in IISWC [2] Pingali et al. Data-Driven versus Topology-Driven Irregular Computations on GPUs, in IPDPS

18 Single-Net Parallel Routing Dynamic edge-level parallelism Only for the edge Similar to dynamic node-level parallelism Hybrid approach Net: different sinks. Static node-level parallelism for low-fanout net Dynamic edge-level parallelism for high-fanout net [1] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal, in PPoPP

19 Multi-Net Parallel Routing Parallelization for multiple independent nets virtual source net 1 sink2 net 2 source sink1 Worklist Size filling net1 multi-net evicting net2 filling net3 evicting source sink3 sink2 net 3 sink1 Time single-net parallelism multi-net parallelism 19

20 Multi-Net Parallel Routing Challenge: dependency Solution: extracting Independent nets for parallelization net Example: 0 net 0 net 1 net 1 Fine-Grained Parallelism net 2 net 2 S 0 net 0 net 3 net 3 S 1 net 1 net 2 net 3 net 4 net 4 net 5 net 6 net 7 net 8 net 9 net 10 net 11 net 12 Scheduling net 5 net 6 net 7 net 8 net 9 net 10 net 11 net 12 Parallelization net 4 net 5 net 6 net 7 net 8 Coarse-Grained Parallelism net 9 net 10 net 11 net 12 20

21 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 21

22 Methodology Experimental platform Intel Xeon E CPU + 32GB memory Tesla K40c GPU + 12GB memory Experimental setup Baseline: VPR 7.0 router Channel width: 1.3x minimal channel width Benchmark: VTR benchmark 22

23 Speedup Single-net parallel routing Static node-level parallelism: 4.15x Dynamic node-level parallelism: 5.82x Dynamic edge-level parallelism: 9.75x Hybrid approach: 10.86x Multi-net parallel routing Hybrid approach: 18.72x 23

24 Comparison Compared to the recent existing work 3.43x fine-grained parallel router 2.67x coarse-grained parallel router 24

25 Quality Wirelength: 2.7% degradation 25

26 Scalability Construct synthetic designs Stretch the locations of sink and source of each net FPGA size: 100x x1000 Maintain a similar speedup Reduce the problem size Woklist that behaves similarly to a queue 26

27 Take-Away Points Search space reduction is very effective to mitigate the time complexity of GPU-friendly routing algorithm. More speedup can be achieved with both node-level and netlevel parallelization. Special thanks to LonestarGPU team! 27

28 Thanks! Question? 28

Accelerate FPGA Routing with Parallel Recursive Partitioning

Accelerate FPGA Routing with Parallel Recursive Partitioning Minghua Shen and Guojie Luo Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China PKU-UCLA Joint