Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1
Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 2
High Performance FPGA 100x Faster 10,000x Lower Cost 5000x Lower Power Pictures: Three Ages of the FPGAs: Trimberger, S, Proceedings of the IEEE Vol. 103, No. 3, March 2015 3
FPGA in datacenters Increasing interest in use of FPGAs as application accelerators in datacenters Key advantage: Performance/Watt 4
High Performance FPGA 100x Faster 10,000x Lower Cost 5000x Lower Power 10,000x More Logic Pictures: Three Ages of the FPGAs: Trimberger, S, Proceedings of the IEEE Vol. 103, No. 3, March 2015 5
d c FPGA Routing Find a physical path for every net in the circuit Disjoint-path problem and NP-complete The most complex and critical step in CAD flow 1 4 CLB 2 3 a b 12 11 CLB 9 10 2 3 a b 9 10 c d e f e f 5 8 CLB 6 7 g h 16 15 CLB 13 14 8 g h 7 13 16 Challenge : Time-consuming 6
Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 7
PathFinder Routing Algorithm Basis for both Xilinx and Altera commercial routers S1 S2 (2) (1) (4) (1) (2) (3) A B (2) (1) (4) (1) (2) (3) D1 D2 Initial Penalized 2 nd iteration, by present penalized routed with congestion by base history cost 8
Dynamic Parallelism on GPU Operator Active node: node where computation is needed Activity: Application of operator to active element Multiple active nodes can be processed in parallel, subject to neighborhood constraints Algorithm = repeated application of operator to graph [1] Pingali et al.. The Tao of Parallelism in Algorithms, in PLDI 2011. [2] Y. Moctor and P. Brisk. Parallel FPGA routing based on the operator formulation, in DAC 2014. 9
Overall Design Flow The iteration contains two steps: Search space reduction: subgraph dynamic expansion Parallelism exploration: SSSP with dynamic parallelism on GPU Update Cost No Placed Design subgraph dynamic expansion GPU-based SSSP for routing Feasible Yes Routed Design sink 2 source sink 1 10
Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 11
Single-Net Routing Kernel: single source shortest path solver Djikstra -> VPR router Bellman-Ford -> GPU-friendly router Serial Bellman-Ford excels when running on a small graph [1] Parallel Bellman-Ford provides greatest speedup [2] Challenge : routing resource graph is large [1] A. DeHon et al. GraphStep: a system architecture for sparse-graph algorithm, in FCCM 2006. [2] F. Busato et al. An efficient implementation of the bellman-ford algorithm for Kepler GPU architecture, in IEEE Trans. on PDS 12 2016.
Subgraph Dynamic Expansion subgraph extraction sink 2 C C sink 2 C initial coverage source sink 1 source sink 1 C D sink 2 D D detection No Yes subgraph expansion source sink 2 boundary detection sink 1 source sink 1 D No feasible Yes 13
Effectiveness Runtime-Quality Comparison Baseline: VPR router (Dijkstra + large routing graph) BSDE: GPU-friendly router (Bellman-Ford + small routing subgraph) Runtime: 10% reduction Quality: 2.7% degradation 14
Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 15
Time Single-Net Parallel Routing Static node-level parallelism Example L0 L1 L2 L3 4 3 2 0 5 6 Comp 0 Comp 8 Comp 4 Comp 1 Comp 5 Comp 2 Comp 6 Comp 3 Comp 7 1 8 7 [1] Pingali et al. A Quantitative Study of Irregular Programs on GPUs, in IISWC 2012. [2] Pingali et al. Data-Driven versus Topology-Driven Irregular Computations on GPUs, in IPDPS 2013. 16
Single-Net Parallel Routing Dynamic node-level parallelism Example [1] Pingali et al. A Quantitative Study of Irregular Programs on GPUs, in IISWC 2012. [2] Pingali et al. Data-Driven versus Topology-Driven Irregular Computations on GPUs, in IPDPS 2013. 17
Single-Net Parallel Routing Dynamic edge-level parallelism Only for the edge Similar to dynamic node-level parallelism Hybrid approach Net: different sinks. Static node-level parallelism for low-fanout net Dynamic edge-level parallelism for high-fanout net [1] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal, in PPoPP 2012. 18
Multi-Net Parallel Routing Parallelization for multiple independent nets virtual source net 1 sink2 net 2 source sink1 Worklist Size filling net1 multi-net evicting net2 filling net3 evicting source sink3 sink2 net 3 sink1 Time single-net parallelism multi-net parallelism 19
Multi-Net Parallel Routing Challenge: dependency Solution: extracting Independent nets for parallelization net Example: 0 net 0 net 1 net 1 Fine-Grained Parallelism net 2 net 2 S 0 net 0 net 3 net 3 S 1 net 1 net 2 net 3 net 4 net 4 net 5 net 6 net 7 net 8 net 9 net 10 net 11 net 12 Scheduling net 5 net 6 net 7 net 8 net 9 net 10 net 11 net 12 Parallelization net 4 net 5 net 6 net 7 net 8 Coarse-Grained Parallelism net 9 net 10 net 11 net 12 20
Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 21
Methodology Experimental platform Intel Xeon E5-2640 CPU + 32GB memory Tesla K40c GPU + 12GB memory Experimental setup Baseline: VPR 7.0 router Channel width: 1.3x minimal channel width Benchmark: VTR benchmark 22
Speedup Single-net parallel routing Static node-level parallelism: 4.15x Dynamic node-level parallelism: 5.82x Dynamic edge-level parallelism: 9.75x Hybrid approach: 10.86x Multi-net parallel routing Hybrid approach: 18.72x 23
Comparison Compared to the recent existing work 3.43x fine-grained parallel router 2.67x coarse-grained parallel router 24
Quality Wirelength: 2.7% degradation 25
Scalability Construct synthetic designs Stretch the locations of sink and source of each net FPGA size: 100x100 1000x1000 Maintain a similar speedup Reduce the problem size Woklist that behaves similarly to a queue 26
Take-Away Points Search space reduction is very effective to mitigate the time complexity of GPU-friendly routing algorithm. More speedup can be achieved with both node-level and netlevel parallelization. Special thanks to LonestarGPU team! 27
Thanks! Question? 28