Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Similar documents
Accelerate FPGA Routing with Parallel Recursive Partitioning

Scalable GPU Graph Traversal!

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Parallel Programming in the Age of Ubiquitous Parallelism. Keshav Pingali The University of Texas at Austin

Parallel Programming in the Age of Ubiquitous Parallelism. Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Reducing Power in an FPGA via Computer-Aided Design

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

GPU Sparse Graph Traversal

AN ACCELERATOR FOR FPGA PLACEMENT

Static and Dynamic Memory Footprint Reduction for FPGA Routing Algorithms

FPGA-based Supercomputing: New Opportunities and Challenges

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

GPU Sparse Graph Traversal. Duane Merrill

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Memory Footprint Reduction for FPGA Routing Algorithms

GRAph Parallel Actor Language A Programming Language for Parallel Graph Algorithms

Congestion-Driven Regional Re-clustering for Low-Cost FPGAs

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

A Routing Approach to Reduce Glitches in Low Power FPGAs

Wire Type Assignment for FPGA Routing

Introduction Warp Processors Dynamic HW/SW Partitioning. Introduction Standard binary - Separating Function and Architecture

Applications of Berkeley s Dwarfs on Nvidia GPUs

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Recent Advances in Heterogeneous Computing using Charm++

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs

Red Fox: An Execution Environment for Relational Query Processing on GPUs

ParaDiMe: A Distributed Memory FPGA Router Based on Speculative Parallelism and Path Encoding

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

CuSha: Vertex-Centric Graph Processing on GPUs

Performance analysis of -stepping algorithm on CPU and GPU

GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T. R.

THE COARSE-GRAINED / FINE-GRAINED LOGIC INTERFACE IN FPGAS WITH EMBEDDED FLOATING-POINT ARITHMETIC UNITS

Double Patterning-Aware Detailed Routing with Mask Usage Balancing

An LP-based Methodology for Improved Timing-Driven Placement

Fast and Memory-Efficient Routing Algorithms for Field Programmable Gate Arrays With Sparse Intracluster Routing Crossbars

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

Placement Algorithm for FPGA Circuits

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

GAIL The Graph Algorithm Iron Law

Effects of FPGA Architecture on FPGA Routing

Ultra-Fast NoC Emulation on a Single FPGA

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

A Fine-grained Performance-based Decision Model for Virtualization Application Solution

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Fast and Memory-Efficient Routing Algorithms for Field Programmable Gate Arrays with Sparse Intra-cluster Routing Crossbars

Pactron FPGA Accelerated Computing Solutions

Performance-Preserved Analog Routing Methodology via Wire Load Reduction

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Red Fox: An Execution Environment for Relational Query Processing on GPUs

An efficient FPGA priority queue implementation with application to the routing problem

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

ICS 252 Introduction to Computer Design

LDetector: A low overhead data race detector for GPU programs

How Much Logic Should Go in an FPGA Logic Block?

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

MAXIMIZING THE REUSE OF ROUTING RESOURCES IN A RECONFIGURATION-AWARE CONNECTION ROUTER

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

On Constructing Lower Power and Robust Clock Tree via Slew Budgeting

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

High Performance Ocean Modeling using CUDA

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

FastPlace 2.0: An Efficient Analytical Placer for Mixed- Mode Designs

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Introduction. A very important step in physical design cycle. It is the process of arranging a set of modules on the layout surface.

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Routability-Driven Bump Assignment for Chip-Package Co-Design

SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS. Donald Nguyen, Keshav Pingali

When MPPDB Meets GPU:

Cluster-based 3D Reconstruction of Aerial Video

Routing. Directly Connected IP Networks. Data link layer routing. ifconfig command

An integrated placement and routing approach

Supporting Multithreading in Configurable Soft Processor Cores

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

Chapter 12. Routing and Routing Protocols 12-1

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

CAD Algorithms. Placement and Floorplanning

On the limits of (and opportunities for?) GPU acceleration

Towards Efficient Graph Traversal using a Multi- GPU Cluster

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization

Transcription:

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1

Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 2

High Performance FPGA 100x Faster 10,000x Lower Cost 5000x Lower Power Pictures: Three Ages of the FPGAs: Trimberger, S, Proceedings of the IEEE Vol. 103, No. 3, March 2015 3

FPGA in datacenters Increasing interest in use of FPGAs as application accelerators in datacenters Key advantage: Performance/Watt 4

High Performance FPGA 100x Faster 10,000x Lower Cost 5000x Lower Power 10,000x More Logic Pictures: Three Ages of the FPGAs: Trimberger, S, Proceedings of the IEEE Vol. 103, No. 3, March 2015 5

d c FPGA Routing Find a physical path for every net in the circuit Disjoint-path problem and NP-complete The most complex and critical step in CAD flow 1 4 CLB 2 3 a b 12 11 CLB 9 10 2 3 a b 9 10 c d e f e f 5 8 CLB 6 7 g h 16 15 CLB 13 14 8 g h 7 13 16 Challenge : Time-consuming 6

Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 7

PathFinder Routing Algorithm Basis for both Xilinx and Altera commercial routers S1 S2 (2) (1) (4) (1) (2) (3) A B (2) (1) (4) (1) (2) (3) D1 D2 Initial Penalized 2 nd iteration, by present penalized routed with congestion by base history cost 8

Dynamic Parallelism on GPU Operator Active node: node where computation is needed Activity: Application of operator to active element Multiple active nodes can be processed in parallel, subject to neighborhood constraints Algorithm = repeated application of operator to graph [1] Pingali et al.. The Tao of Parallelism in Algorithms, in PLDI 2011. [2] Y. Moctor and P. Brisk. Parallel FPGA routing based on the operator formulation, in DAC 2014. 9

Overall Design Flow The iteration contains two steps: Search space reduction: subgraph dynamic expansion Parallelism exploration: SSSP with dynamic parallelism on GPU Update Cost No Placed Design subgraph dynamic expansion GPU-based SSSP for routing Feasible Yes Routed Design sink 2 source sink 1 10

Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 11

Single-Net Routing Kernel: single source shortest path solver Djikstra -> VPR router Bellman-Ford -> GPU-friendly router Serial Bellman-Ford excels when running on a small graph [1] Parallel Bellman-Ford provides greatest speedup [2] Challenge : routing resource graph is large [1] A. DeHon et al. GraphStep: a system architecture for sparse-graph algorithm, in FCCM 2006. [2] F. Busato et al. An efficient implementation of the bellman-ford algorithm for Kepler GPU architecture, in IEEE Trans. on PDS 12 2016.

Subgraph Dynamic Expansion subgraph extraction sink 2 C C sink 2 C initial coverage source sink 1 source sink 1 C D sink 2 D D detection No Yes subgraph expansion source sink 2 boundary detection sink 1 source sink 1 D No feasible Yes 13

Effectiveness Runtime-Quality Comparison Baseline: VPR router (Dijkstra + large routing graph) BSDE: GPU-friendly router (Bellman-Ford + small routing subgraph) Runtime: 10% reduction Quality: 2.7% degradation 14

Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 15

Time Single-Net Parallel Routing Static node-level parallelism Example L0 L1 L2 L3 4 3 2 0 5 6 Comp 0 Comp 8 Comp 4 Comp 1 Comp 5 Comp 2 Comp 6 Comp 3 Comp 7 1 8 7 [1] Pingali et al. A Quantitative Study of Irregular Programs on GPUs, in IISWC 2012. [2] Pingali et al. Data-Driven versus Topology-Driven Irregular Computations on GPUs, in IPDPS 2013. 16

Single-Net Parallel Routing Dynamic node-level parallelism Example [1] Pingali et al. A Quantitative Study of Irregular Programs on GPUs, in IISWC 2012. [2] Pingali et al. Data-Driven versus Topology-Driven Irregular Computations on GPUs, in IPDPS 2013. 17

Single-Net Parallel Routing Dynamic edge-level parallelism Only for the edge Similar to dynamic node-level parallelism Hybrid approach Net: different sinks. Static node-level parallelism for low-fanout net Dynamic edge-level parallelism for high-fanout net [1] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal, in PPoPP 2012. 18

Multi-Net Parallel Routing Parallelization for multiple independent nets virtual source net 1 sink2 net 2 source sink1 Worklist Size filling net1 multi-net evicting net2 filling net3 evicting source sink3 sink2 net 3 sink1 Time single-net parallelism multi-net parallelism 19

Multi-Net Parallel Routing Challenge: dependency Solution: extracting Independent nets for parallelization net Example: 0 net 0 net 1 net 1 Fine-Grained Parallelism net 2 net 2 S 0 net 0 net 3 net 3 S 1 net 1 net 2 net 3 net 4 net 4 net 5 net 6 net 7 net 8 net 9 net 10 net 11 net 12 Scheduling net 5 net 6 net 7 net 8 net 9 net 10 net 11 net 12 Parallelization net 4 net 5 net 6 net 7 net 8 Coarse-Grained Parallelism net 9 net 10 net 11 net 12 20

Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 21

Methodology Experimental platform Intel Xeon E5-2640 CPU + 32GB memory Tesla K40c GPU + 12GB memory Experimental setup Baseline: VPR 7.0 router Channel width: 1.3x minimal channel width Benchmark: VTR benchmark 22

Speedup Single-net parallel routing Static node-level parallelism: 4.15x Dynamic node-level parallelism: 5.82x Dynamic edge-level parallelism: 9.75x Hybrid approach: 10.86x Multi-net parallel routing Hybrid approach: 18.72x 23

Comparison Compared to the recent existing work 3.43x fine-grained parallel router 2.67x coarse-grained parallel router 24

Quality Wirelength: 2.7% degradation 25

Scalability Construct synthetic designs Stretch the locations of sink and source of each net FPGA size: 100x100 1000x1000 Maintain a similar speedup Reduce the problem size Woklist that behaves similarly to a queue 26

Take-Away Points Search space reduction is very effective to mitigate the time complexity of GPU-friendly routing algorithm. More speedup can be achieved with both node-level and netlevel parallelization. Special thanks to LonestarGPU team! 27

Thanks! Question? 28