Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Size: px
Start display at page:

Download "Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion"

Transcription

1 Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23,

2 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 2

3 High Performance FPGA 100x Faster 10,000x Lower Cost 5000x Lower Power Pictures: Three Ages of the FPGAs: Trimberger, S, Proceedings of the IEEE Vol. 103, No. 3, March

4 FPGA in datacenters Increasing interest in use of FPGAs as application accelerators in datacenters Key advantage: Performance/Watt 4

5 High Performance FPGA 100x Faster 10,000x Lower Cost 5000x Lower Power 10,000x More Logic Pictures: Three Ages of the FPGAs: Trimberger, S, Proceedings of the IEEE Vol. 103, No. 3, March

6 d c FPGA Routing Find a physical path for every net in the circuit Disjoint-path problem and NP-complete The most complex and critical step in CAD flow 1 4 CLB 2 3 a b CLB a b 9 10 c d e f e f 5 8 CLB 6 7 g h CLB g h Challenge : Time-consuming 6

7 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 7

8 PathFinder Routing Algorithm Basis for both Xilinx and Altera commercial routers S1 S2 (2) (1) (4) (1) (2) (3) A B (2) (1) (4) (1) (2) (3) D1 D2 Initial Penalized 2 nd iteration, by present penalized routed with congestion by base history cost 8

9 Dynamic Parallelism on GPU Operator Active node: node where computation is needed Activity: Application of operator to active element Multiple active nodes can be processed in parallel, subject to neighborhood constraints Algorithm = repeated application of operator to graph [1] Pingali et al.. The Tao of Parallelism in Algorithms, in PLDI [2] Y. Moctor and P. Brisk. Parallel FPGA routing based on the operator formulation, in DAC

10 Overall Design Flow The iteration contains two steps: Search space reduction: subgraph dynamic expansion Parallelism exploration: SSSP with dynamic parallelism on GPU Update Cost No Placed Design subgraph dynamic expansion GPU-based SSSP for routing Feasible Yes Routed Design sink 2 source sink 1 10

11 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 11

12 Single-Net Routing Kernel: single source shortest path solver Djikstra -> VPR router Bellman-Ford -> GPU-friendly router Serial Bellman-Ford excels when running on a small graph [1] Parallel Bellman-Ford provides greatest speedup [2] Challenge : routing resource graph is large [1] A. DeHon et al. GraphStep: a system architecture for sparse-graph algorithm, in FCCM [2] F. Busato et al. An efficient implementation of the bellman-ford algorithm for Kepler GPU architecture, in IEEE Trans. on PDS

13 Subgraph Dynamic Expansion subgraph extraction sink 2 C C sink 2 C initial coverage source sink 1 source sink 1 C D sink 2 D D detection No Yes subgraph expansion source sink 2 boundary detection sink 1 source sink 1 D No feasible Yes 13

14 Effectiveness Runtime-Quality Comparison Baseline: VPR router (Dijkstra + large routing graph) BSDE: GPU-friendly router (Bellman-Ford + small routing subgraph) Runtime: 10% reduction Quality: 2.7% degradation 14

15 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 15

16 Time Single-Net Parallel Routing Static node-level parallelism Example L0 L1 L2 L Comp 0 Comp 8 Comp 4 Comp 1 Comp 5 Comp 2 Comp 6 Comp 3 Comp [1] Pingali et al. A Quantitative Study of Irregular Programs on GPUs, in IISWC [2] Pingali et al. Data-Driven versus Topology-Driven Irregular Computations on GPUs, in IPDPS

17 Single-Net Parallel Routing Dynamic node-level parallelism Example [1] Pingali et al. A Quantitative Study of Irregular Programs on GPUs, in IISWC [2] Pingali et al. Data-Driven versus Topology-Driven Irregular Computations on GPUs, in IPDPS

18 Single-Net Parallel Routing Dynamic edge-level parallelism Only for the edge Similar to dynamic node-level parallelism Hybrid approach Net: different sinks. Static node-level parallelism for low-fanout net Dynamic edge-level parallelism for high-fanout net [1] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal, in PPoPP

19 Multi-Net Parallel Routing Parallelization for multiple independent nets virtual source net 1 sink2 net 2 source sink1 Worklist Size filling net1 multi-net evicting net2 filling net3 evicting source sink3 sink2 net 3 sink1 Time single-net parallelism multi-net parallelism 19

20 Multi-Net Parallel Routing Challenge: dependency Solution: extracting Independent nets for parallelization net Example: 0 net 0 net 1 net 1 Fine-Grained Parallelism net 2 net 2 S 0 net 0 net 3 net 3 S 1 net 1 net 2 net 3 net 4 net 4 net 5 net 6 net 7 net 8 net 9 net 10 net 11 net 12 Scheduling net 5 net 6 net 7 net 8 net 9 net 10 net 11 net 12 Parallelization net 4 net 5 net 6 net 7 net 8 Coarse-Grained Parallelism net 9 net 10 net 11 net 12 20

21 Contents Motivation Background Search Space Reduction for Routing Parallelism Exploration for GPU-Based Routing Evaluation and Conclusion 21

22 Methodology Experimental platform Intel Xeon E CPU + 32GB memory Tesla K40c GPU + 12GB memory Experimental setup Baseline: VPR 7.0 router Channel width: 1.3x minimal channel width Benchmark: VTR benchmark 22

23 Speedup Single-net parallel routing Static node-level parallelism: 4.15x Dynamic node-level parallelism: 5.82x Dynamic edge-level parallelism: 9.75x Hybrid approach: 10.86x Multi-net parallel routing Hybrid approach: 18.72x 23

24 Comparison Compared to the recent existing work 3.43x fine-grained parallel router 2.67x coarse-grained parallel router 24

25 Quality Wirelength: 2.7% degradation 25

26 Scalability Construct synthetic designs Stretch the locations of sink and source of each net FPGA size: 100x x1000 Maintain a similar speedup Reduce the problem size Woklist that behaves similarly to a queue 26

27 Take-Away Points Search space reduction is very effective to mitigate the time complexity of GPU-friendly routing algorithm. More speedup can be achieved with both node-level and netlevel parallelization. Special thanks to LonestarGPU team! 27

28 Thanks! Question? 28

Accelerate FPGA Routing with Parallel Recursive Partitioning

Accelerate FPGA Routing with Parallel Recursive Partitioning Accelerate FPGA Routing with Parallel Recursive Partitioning Minghua Shen and Guojie Luo Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China PKU-UCLA Joint

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Parallel Programming in the Age of Ubiquitous Parallelism. Keshav Pingali The University of Texas at Austin

Parallel Programming in the Age of Ubiquitous Parallelism. Keshav Pingali The University of Texas at Austin Parallel Programming in the Age of Ubiquitous Parallelism Keshav Pingali The University of Texas at Austin Parallelism is everywhere Texas Advanced Computing Center Laptops Cell-phones Parallel programming?

More information

Parallel Programming in the Age of Ubiquitous Parallelism. Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin

Parallel Programming in the Age of Ubiquitous Parallelism. Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin Parallel Programming in the Age of Ubiquitous Parallelism Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin Parallel Programming in the Age of Ubiquitous Parallelism Andrew Lenharth

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

Reducing Power in an FPGA via Computer-Aided Design

Reducing Power in an FPGA via Computer-Aided Design Reducing Power in an FPGA via Computer-Aided Design Steve Wilton University of British Columbia Power Reduction via CAD How to reduce power dissipation in an FPGA: - Create power-aware CAD tools - Create

More information

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen Graphic Processing Units (GPU) 2 Graphic Processing Units (GPU) 2 Graphic

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

AN ACCELERATOR FOR FPGA PLACEMENT

AN ACCELERATOR FOR FPGA PLACEMENT AN ACCELERATOR FOR FPGA PLACEMENT Pritha Banerjee and Susmita Sur-Kolay * Abstract In this paper, we propose a constructive heuristic for initial placement of a given netlist of CLBs on a FPGA, in order

More information

Static and Dynamic Memory Footprint Reduction for FPGA Routing Algorithms

Static and Dynamic Memory Footprint Reduction for FPGA Routing Algorithms 18 Static and Dynamic Memory Footprint Reduction for FPGA Routing Algorithms SCOTT Y. L. CHIN and STEVEN J. E. WILTON University of British Columbia This article presents techniques to reduce the static

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

Memory Footprint Reduction for FPGA Routing Algorithms

Memory Footprint Reduction for FPGA Routing Algorithms Memory Footprint Reduction for FPGA Routing Algorithms Scott Y.L. Chin, and Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, B.C., Canada email:

More information

GRAph Parallel Actor Language A Programming Language for Parallel Graph Algorithms

GRAph Parallel Actor Language A Programming Language for Parallel Graph Algorithms GRAph Parallel Actor Language A Programming Language for Parallel Graph Algorithms Thesis by Michael delorimier In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy California

More information

Congestion-Driven Regional Re-clustering for Low-Cost FPGAs

Congestion-Driven Regional Re-clustering for Low-Cost FPGAs Congestion-Driven Regional Re-clustering for Low-Cost FPGAs Darius Chiu, Guy G.F. Lemieux, Steve Wilton Electrical and Computer Engineering, University of British Columbia British Columbia, Canada dariusc@ece.ubc.ca

More information

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated

More information

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video Amrita Mazumdar Armin Alaghi Jonathan T. Barron David Gallup Luis Ceze Mark Oskin Steven M. Seitz University of Washington Google

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Deploying Graph Algorithms on GPUs: an Adaptive Solution Deploying Graph Algorithms on GPUs: an Adaptive Solution Da Li Dept. of Electrical and Computer Engineering University of Missouri - Columbia dlx7f@mail.missouri.edu Abstract Thanks to their massive computational

More information

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models Solving Large Complex Problems Efficient and Smart Solutions for Large Models 1 ANSYS Structural Mechanics Solutions offers several techniques 2 Current trends in simulation show an increased need for

More information

A Routing Approach to Reduce Glitches in Low Power FPGAs

A Routing Approach to Reduce Glitches in Low Power FPGAs A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin Wong Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign This research

More information

Wire Type Assignment for FPGA Routing

Wire Type Assignment for FPGA Routing Wire Type Assignment for FPGA Routing Seokjin Lee Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712 seokjin@cs.utexas.edu Hua Xiang, D. F. Wong Department

More information

Introduction Warp Processors Dynamic HW/SW Partitioning. Introduction Standard binary - Separating Function and Architecture

Introduction Warp Processors Dynamic HW/SW Partitioning. Introduction Standard binary - Separating Function and Architecture Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Dynamic HW/SW Partitioning Initially execute application in software only 5 Partitioned application executes faster

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation

More information

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16 GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mingxing Tan, Steve Dai, Udit Gupta, Zhiru Zhang School of Electrical and Computer Engineering Cornell University High-Level Synthesis (HLS) for

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

ParaDiMe: A Distributed Memory FPGA Router Based on Speculative Parallelism and Path Encoding

ParaDiMe: A Distributed Memory FPGA Router Based on Speculative Parallelism and Path Encoding ParaDiMe: A Distributed Memory FPGA Router Based on Speculative Parallelism and Path Encoding Chin Hau Hoo Department of Electrical & Computer Engineering National University of Singapore Singapore chinhau.hoo@u.nus.edu

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power

More information

Performance analysis of -stepping algorithm on CPU and GPU

Performance analysis of -stepping algorithm on CPU and GPU Performance analysis of -stepping algorithm on CPU and GPU Dmitry Lesnikov 1 dslesnikov@gmail.com Mikhail Chernoskutov 1,2 mach@imm.uran.ru 1 Ural Federal University (Yekaterinburg, Russia) 2 Krasovskii

More information

GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T. R.

GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T. R. GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T R FPGA PLACEMENT PROBLEM Input A technology mapped netlist of Configurable Logic Blocks (CLB) realizing a given circuit Output

More information

THE COARSE-GRAINED / FINE-GRAINED LOGIC INTERFACE IN FPGAS WITH EMBEDDED FLOATING-POINT ARITHMETIC UNITS

THE COARSE-GRAINED / FINE-GRAINED LOGIC INTERFACE IN FPGAS WITH EMBEDDED FLOATING-POINT ARITHMETIC UNITS THE COARSE-GRAINED / FINE-GRAINED LOGIC INTERFACE IN FPGAS WITH EMBEDDED FLOATING-POINT ARITHMETIC UNITS Chi Wai Yu 1, Julien Lamoureux 2, Steven J.E. Wilton 2, Philip H.W. Leong 3, Wayne Luk 1 1 Dept

More information

Double Patterning-Aware Detailed Routing with Mask Usage Balancing

Double Patterning-Aware Detailed Routing with Mask Usage Balancing Double Patterning-Aware Detailed Routing with Mask Usage Balancing Seong-I Lei Department of Computer Science National Tsing Hua University HsinChu, Taiwan Email: d9762804@oz.nthu.edu.tw Chris Chu Department

More information

An LP-based Methodology for Improved Timing-Driven Placement

An LP-based Methodology for Improved Timing-Driven Placement An LP-based Methodology for Improved Timing-Driven Placement Qingzhou (Ben) Wang, John Lillis and Shubhankar Sanyal Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 {qwang,

More information

Fast and Memory-Efficient Routing Algorithms for Field Programmable Gate Arrays With Sparse Intracluster Routing Crossbars

Fast and Memory-Efficient Routing Algorithms for Field Programmable Gate Arrays With Sparse Intracluster Routing Crossbars 1928 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 34, NO. 12, DECEMBER 2015 Fast and Memory-Efficient Routing Algorithms for Field Programmable Gate Arrays With Sparse

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

Placement Algorithm for FPGA Circuits

Placement Algorithm for FPGA Circuits Placement Algorithm for FPGA Circuits ZOLTAN BARUCH, OCTAVIAN CREŢ, KALMAN PUSZTAI Computer Science Department, Technical University of Cluj-Napoca, 26, Bariţiu St., 3400 Cluj-Napoca, Romania {Zoltan.Baruch,

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

GAIL The Graph Algorithm Iron Law

GAIL The Graph Algorithm Iron Law GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations

More information

Effects of FPGA Architecture on FPGA Routing

Effects of FPGA Architecture on FPGA Routing Effects of FPGA Architecture on FPGA Routing Stephen Trimberger Xilinx, Inc. 2100 Logic Drive San Jose, CA 95124 USA steve@xilinx.com Abstract Although many traditional Mask Programmed Gate Array (MPGA)

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

A Fine-grained Performance-based Decision Model for Virtualization Application Solution

A Fine-grained Performance-based Decision Model for Virtualization Application Solution A Fine-grained Performance-based Decision Model for Virtualization Application Solution Jianhai Chen College of Computer Science Zhejiang University Hangzhou City, Zhejiang Province, China 2011/08/29 Outline

More information

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Fast and Memory-Efficient Routing Algorithms for Field Programmable Gate Arrays with Sparse Intra-cluster Routing Crossbars

Fast and Memory-Efficient Routing Algorithms for Field Programmable Gate Arrays with Sparse Intra-cluster Routing Crossbars > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 Fast and Memory-Efficient Routing Algorithms for Field Programmable Gate Arrays with Sparse Intra-cluster Routing

More information

Pactron FPGA Accelerated Computing Solutions

Pactron FPGA Accelerated Computing Solutions Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market

More information

Performance-Preserved Analog Routing Methodology via Wire Load Reduction

Performance-Preserved Analog Routing Methodology via Wire Load Reduction Electronic Design Automation Laboratory (EDA LAB) Performance-Preserved Analog Routing Methodology via Wire Load Reduction Hao-Yu Chi, Hwa-Yi Tseng, Chien-Nan Jimmy Liu, Hung-Ming Chen 2 Dept. of Electrical

More information

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Xin-Wei Shih, Tzu-Hsuan Hsu, Hsu-Chieh Lee, Yao-Wen Chang, Kai-Yuan Chao 2013.01.24 1 Outline 2 Clock Network Synthesis Clock network

More information

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures Prof. Lei He EE Department, UCLA LHE@ee.ucla.edu Partially supported by NSF. Pathway to Power Efficiency and Variation Tolerance

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

An efficient FPGA priority queue implementation with application to the routing problem

An efficient FPGA priority queue implementation with application to the routing problem An efficient FPGA priority queue implementation with application to the routing problem Joseph Rios Technical Report ucsc-crl-07-01 Department of Computer Engineering School of Engineering University of

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]

More information

ICS 252 Introduction to Computer Design

ICS 252 Introduction to Computer Design ICS 252 Introduction to Computer Design Lecture 16 Eli Bozorgzadeh Computer Science Department-UCI References and Copyright Textbooks referred (none required) [Mic94] G. De Micheli Synthesis and Optimization

More information

LDetector: A low overhead data race detector for GPU programs

LDetector: A low overhead data race detector for GPU programs LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

MAXIMIZING THE REUSE OF ROUTING RESOURCES IN A RECONFIGURATION-AWARE CONNECTION ROUTER

MAXIMIZING THE REUSE OF ROUTING RESOURCES IN A RECONFIGURATION-AWARE CONNECTION ROUTER MAXIMIZING THE REUSE OF ROUTING RESOURCES IN A RECONFIGURATION-AWARE CONNECTION ROUTER Elias Vansteenkiste, Karel Bruneel and Dirk Stroobandt Department of Electronics and Information Systems Ghent University

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU Robert Strzodka NVAMG Project Lead A Parallel Success Story in Five Steps 2 Step 1: Understand Application ANSYS Fluent Computational Fluid Dynamics

More information

On Constructing Lower Power and Robust Clock Tree via Slew Budgeting

On Constructing Lower Power and Robust Clock Tree via Slew Budgeting 1 On Constructing Lower Power and Robust Clock Tree via Slew Budgeting Yeh-Chi Chang, Chun-Kai Wang and Hung-Ming Chen Dept. of EE, National Chiao Tung University, Taiwan 2012 年 3 月 29 日 Outline 2 Motivation

More information

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National

More information

High Performance Ocean Modeling using CUDA

High Performance Ocean Modeling using CUDA using CUDA Chris Lupo Computer Science Cal Poly Slide 1 Acknowledgements Dr. Paul Choboter Jason Mak Ian Panzer Spencer Lines Sagiv Sheelo Jake Gardner Slide 2 Background Joint research with Dr. Paul Choboter

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University

PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO IRIS Lab National Chiao Tung University Outline Introduction Problem Formulation Algorithm -

More information

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs 14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs K. Esler, D. Dembeck, K. Mukundakrishnan, V. Natoli, J. Shumway and Y. Zhang Stone Ridge Technology, Bel Air, MD

More information

FastPlace 2.0: An Efficient Analytical Placer for Mixed- Mode Designs

FastPlace 2.0: An Efficient Analytical Placer for Mixed- Mode Designs FastPlace.0: An Efficient Analytical Placer for Mixed- Mode Designs Natarajan Viswanathan Min Pan Chris Chu Iowa State University ASP-DAC 006 Work supported by SRC under Task ID: 106.001 Mixed-Mode Placement

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Introduction. A very important step in physical design cycle. It is the process of arranging a set of modules on the layout surface.

Introduction. A very important step in physical design cycle. It is the process of arranging a set of modules on the layout surface. Placement Introduction A very important step in physical design cycle. A poor placement requires larger area. Also results in performance degradation. It is the process of arranging a set of modules on

More information

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Routability-Driven Bump Assignment for Chip-Package Co-Design

Routability-Driven Bump Assignment for Chip-Package Co-Design 1 Routability-Driven Bump Assignment for Chip-Package Co-Design Presenter: Hung-Ming Chen Outline 2 Introduction Motivation Previous works Our contributions Preliminary Problem formulation Bump assignment

More information

SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS. Donald Nguyen, Keshav Pingali

SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS. Donald Nguyen, Keshav Pingali SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS Donald Nguyen, Keshav Pingali TERMINOLOGY irregular algorithm opposite: dense arrays work on pointered data structures like graphs shape of the

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Cluster-based 3D Reconstruction of Aerial Video

Cluster-based 3D Reconstruction of Aerial Video Cluster-based 3D Reconstruction of Aerial Video Scott Sawyer (scott.sawyer@ll.mit.edu) MIT Lincoln Laboratory HPEC 12 12 September 2012 This work is sponsored by the Assistant Secretary of Defense for

More information

Routing. Directly Connected IP Networks. Data link layer routing. ifconfig command

Routing. Directly Connected IP Networks. Data link layer routing. ifconfig command Routing Basic principles dr. C. P. J. Koymans Informatics Institute University of Amsterdam (version 1.1, 2010/02/19 12:21:58) Monday, February 22, 2010 Basic setup Directly connected Not directly connected

More information

An integrated placement and routing approach

An integrated placement and routing approach Retrospective Theses and Dissertations 2006 An integrated placement and routing approach Min Pan Iowa State University Follow this and additional works at: http://lib.dr.iastate.edu/rtd Part of the Electrical

More information

Supporting Multithreading in Configurable Soft Processor Cores

Supporting Multithreading in Configurable Soft Processor Cores Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA

More information

Chapter 12. Routing and Routing Protocols 12-1

Chapter 12. Routing and Routing Protocols 12-1 Chapter 12 Routing and Routing Protocols 12-1 Routing in Circuit Switched Network Many connections will need paths through more than one switch Need to find a route Efficiency Resilience Public telephone

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

CAD Algorithms. Placement and Floorplanning

CAD Algorithms. Placement and Floorplanning CAD Algorithms Placement Mohammad Tehranipoor ECE Department 4 November 2008 1 Placement and Floorplanning Layout maps the structural representation of circuit into a physical representation Physical representation:

More information

On the limits of (and opportunities for?) GPU acceleration

On the limits of (and opportunities for?) GPU acceleration On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,

More information

Towards Efficient Graph Traversal using a Multi- GPU Cluster

Towards Efficient Graph Traversal using a Multi- GPU Cluster Towards Efficient Graph Traversal using a Multi- GPU Cluster Hina Hameed Systems Research Lab, Department of Computer Science, FAST National University of Computer and Emerging Sciences Karachi, Pakistan

More information

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization Zhiru Zhang School of ECE, Cornell University September 29, 2017 @ EPFL A Case Study on Digit Recognition bit6 popcount(bit49 digit)

More information