Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Similar documents
Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

TUNING CUDA APPLICATIONS FOR MAXWELL

Portland State University ECE 588/688. Graphics Processors

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Fundamentals Jeff Larkin November 14, 2016

Parallel Processing SIMD, Vector and GPU s cont.

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Introduction to GPU programming with CUDA

Hardware/Software Co-Design

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Tesla Architecture, CUDA and Optimization Strategies

Exploring GPU Architecture for N2P Image Processing Algorithms

CUDA Performance Optimization. Patrick Legresley

Cartoon parallel architectures; CPUs and GPUs

Tesla GPU Computing A Revolution in High Performance Computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

June 2003, ver. 1.2 Application Note 198

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Verilog for High Performance

ECE 574 Cluster Computing Lecture 15

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Outline Marquette University

Programmable Graphics Hardware (GPU) A Primer

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

CUDA Programming Model

ME964 High Performance Computing for Engineering Applications

Advanced CUDA Programming. Dr. Timo Stich

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS377P Programming for Performance GPU Programming - II

Parallel Computing: Parallel Architectures Jin, Hai

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CPU-GPU Heterogeneous Computing

Spring Prof. Hyesoon Kim

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Numerical Simulation on the GPU

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

CSE P567 - Winter 2010 Lab 1 Introduction to FGPA CAD Tools

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Implementation of VLSI Gate Placement in CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

ECE 8823: GPU Architectures. Objectives

Threading Hardware in G80

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments

! Readings! ! Room-level, on-chip! vs.!

Lecture 1: Gentle Introduction to GPUs

The Nios II Family of Configurable Soft-core Processors

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

EECS4201 Computer Architecture

CSE 160 Lecture 24. Graphical Processing Units

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Paralization on GPU using CUDA An Introduction

Warps and Reduction Algorithms

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

High Performance Computing on GPUs using NVIDIA CUDA

FABRICATION TECHNOLOGIES

ABC basics (compilation from different articles)

Benchmarking the Memory Hierarchy of Modern GPUs

Multi-Processors and GPU

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to CUDA Programming

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Computer Architecture

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

When MPPDB Meets GPU:

ME964 High Performance Computing for Engineering Applications

Fundamental Optimizations

GPU-centric communication for improved efficiency

Best Practices for Incremental Compilation Partitions and Floorplan Assignments

Lecture 2: CUDA Programming

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T. R.

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Transcription:

Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010

Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question: Can we utilize the highly parallel GPU to speed up FPGA CAD algorithms?

What are GPUs? Graphics Processing Units (GPUs) Optimized to maximize throughput Single-Instruction, Multiple Data (SIMD) Architecture Optimized to run many threads in parallel Each thread performs a fixed set of operations on a small subset of data Threads are arranged in blocks, executed in warps (32 threads/warp) Consist of an array of Streaming Multiprocessors (SMs) SMs L2 L2 Interconnection Network L2 L2 L2 Global memory DRAM DRAM DRAM DRAM DRAM

GPU Architecture I-Cache MT-Cache C-Cache SP SP SP SP SP SP SP SP SFU SFU DP Shared Mem Each SM operates on 1 warp at a time Warps can be swapped in and out Each SM contains: Instruction Cache 8 Streaming Processors (SP), each capable of executing 1 thread at a time 4 clock cycles are required to execute one instruction for all 32 threads Shared memory 16k only Accessible by all 8 SPs

... Programming GPUs Used Nvidia s CUDA API (Compute Unified Device Architecture) Host Program Resides on CPU, responsible for setting up the memory Launches the GPU kernel GPU kernel Code executed by every thread on a different data element Host program can operate independently Host CPU Writes Data to Device Memory Launches Kernel GPU Device GPU Kernel

GPUs vs. CPUs CPU: minimize latency experienced by 1 thread big on-chip caches sophisticated control logic to extract parallelism GPU: maximize throughput of all threads Achieves massive parallelism through explicitly parallel code with thousands of lightweight threads Large latency incurred for global memory access Multithreading is used hide latency Can avoid the big caches of CPUs

Warp Scheduling Multiple warps can be assigned to the same SM Whenever a warp stalls, the GPU can swap in a new warp Hides latency Some things are not swapped out Shared memory Register file Active Warps Thread Warp Thread Warp 1 Thread Warp 2 Thread 1 Thread 2 Thread 3 Thread 32 Thread Warp 8 GPU Pipeline

Limitations of the kernel Threads on the same SM must execute the same instruction All other threads will do nothing Control structures and data-dependent operations are inefficient Max( array[ ] ); If (x == 0) {. } Else {. } idle idle Sequential execution of threads limits GPU parallelism

Graph Algorithms on GPU?? GPUs are great when performing the same computation on lots of data FPGA CAD algorithms are graph-based A lot of control structures inherent in the algorithm Control structures not very good for GPUs Requires serialization Loses the benefits of parallelism on GPU Irregular data access Edges between nodes in the graph are random Objective: Prototype this idea through parallelizing one step of the FPGA CAD flow on the GPU

FPGA CAD Flow HDL Logic Synthesis Technology Mapping Clustering Placement Routing Placed and Routed Design Logic Synthesis Converts HDL into an optimized netlist of 2-input gates Technology Mapping Maps the netlist into logic blocks that exist on the FPGA (LUTs) Clustering Logic blocks are clustered to satisfy legality constraints and to minimize placement effort Placement Clusters are assigned locations on the FPGA Routing Implements the connections between logic elements using the programmable fabric of the FPGA

Technology Mapping Steps Convert the netlist into an AND-Inverter Graph (AIG) Only have 2-input AND gates, and inversions on edges of the graph Each node has an associated depth The distance between this node and its Primary Inputs x y AND gates Inverted Edge c a b d

Generate Cuts Generate Coverings, or Cuts Each cut will result in 1 Look-Up Table (LUT) Represent a cut by its inputs Find the optimal set of Cuts that achieves the following: Minimal Area Fewer total cuts, the better Minimal Depth Good for timing Cut1=xyzw r t Mapped depth = 2 LUTs s Cut2=rwuv a b c x y z w u v

PIs to POs Cut Computation Start from Primary Inputs (PI) to Primary Outputs (PO) PIs input pins or register outputs POs output pins or register inputs Add the current node to the cutset Enumerate new cuts by taking the crossproduct of cutsets of input nodes Compute the cost of the new cut Add this new cut into the sorted cutset After making all computations, select the best cut for each node Starting from POs and go backwards This set of best cuts is now the optimal techmapped netlist r, pbc, pq, abq, abc p, ab q, bc p q a b c a b r c

Iterative Tech Mapping Cut computation is run multiple times, at different stages of the synthesis flow Each iteration may optimize for different metrics Iter 1: Optimize for Delay Iter 2: Optimize for Area Iter 3: Optimize for Delay Iter 5: Optimize for True Area Iter 6: Optimize for True Area

True Area Flow Try to estimate the real area cost required by a cut We need to recursively un-do the cut, only count the additional LUTs that we would need to make Original Network Unmapped Network True-Area Remapped Network t t t r i s r i s r i s a b c a b c a b c x y z w u v x y z w u v x y z w u v Problem!! GPU cannot handle recursion natively

Choice Nodes Gives techmapping more flexibility when it selects cuts Either a or e could be timing-critical If critical, we should shift the signal closer to its output Postpone the actual cut selection until we know more accurate timing info Don t know what s critical until we have mapped its input nodes x 1 Cuts=edcr, Cuts=abcs, x 2 r s e d c b a e d c b a

PIs to POs How to Parallelize Techmapping? Levelize the circuit Perform cutset enumeration on all nodes in the same level in parallel No data dependencies Level=3 Level=2 r, pbc, pq, abq, abc r p, ab q, bc p q Level=1 a b c a b c

Algorithm Created 3 kernels compute_and implements cut computation compute_choice performs choice node operations finalize updates data that the next level may need Levelize the netlist For i=0 to max_levels do launch execute_and kernel if choice nodes exist in level i then launch execute_choice kernel end if execute finalize _kernel End For

Memory Management Global memory writes and reads are expensive Copy the netlist to the GPU once (remains unchanged) Only copy the set of selected cuts back and forth Remove all un-utilized fields in the netlist, and only retain a bare-bones netlist on the GPU Data caching is important Consider a simple for loop: for (i = 0; i < netlist->size(); i++) Every time the exit condition is checked, there is a global access penalty

Dark Side of Warp Scheduling Multiple warps are assigned to the same SM Can be swapped in and out to hide latency Some things are not swapped out Shared Memory must have enough to accommodate all active blocks assigned to this SM Register file must have enough to accommodate all active threads assigned to this SM Either metric can limit how many warps assigned per SM Registers became the limitation for us Thread Registers allocated Per-thread Block Shared Memory allocated Per-block

Percent Occupancy Relative Time Register Tradeoffs The more active threads running on an SM, the fewer registers each thread can use Fewer registers implies larger number of accesses to global memory Fewer active threads inhibits the SM s ability to hide latency 120% 100% 80% 60% 40% 20% 0% 5 15 25 35 Registers/Thread Occupancy Relative Time 1.5 1.4 1.3 1.2 1.1 1 0.9

Implementation of True Area The GPU cannot handle recursion natively Instead, we implement an iterative approach Allocate a queue, where we put the nodes to visit If the queue size is exceeded, then give up This is ok because True Area is just a fixup stage In other words, just stick with the existing solution (the chosen cut) Is this safe?

Is This Safe? We are unmapping/remapping nodes that are in different levels Wouldn t we run the risk of working on the same node? True Area operates only on the MFFC cone So it is actually SAFE to unmap nodes in the MFFC cone t Guaranteed no other threads will i be accessing the same node r s a b c x y z w u v

Runtime Results Implemented an alternate Tech Mapper in ABC ABC is a state-of-the-art logic synthesis tool Used a set of large industrial benchmarks Need large circuits to see speedups due to GPU overhead Achieved 4X speedup on parts that are parallelized Circuit Max # Levels Avg Gates/Level if Map (ms) gpumap (ms) Speedup beamforming 184 1531 5920 2560 2.2 netcard 86 1502 2450 1100 2.2 cc_des25 30 4033 1740 670 2.6 oc_jpeg 49 2061 2440 910 2.7 radar20 92 2137 4040 1410 2.9 sat_solver 77 6071 7610 2480 3.1 uoft_raytracer 111 4954 8950 2740 3.4 uoft_stereo_vision 24 7599 4860 1290 3.8 blowfish_area 104 27121 54310 13200 4.1 blowfish_pipe 22 134964 57900 13710 4.2 vga_lcd 40 27435 17490 4070 4.3 Geometric Average 139 141 3.1

QoR Measurements Measured the quality of results by running both mapping algorithms through the Quartus II CAD tool HDL ABC ifmap ABC gpumap Replace Black Boxes Quartus II MAP Write Blif Write Verilog Annotate Preserve Constraints Quartus II Fitter Quartus II STA

Quality of Result (QoR) Placed and Routed the tech mapped result to verify the correctness of gpumap Used Altera s Quartus II tool to P&R the ifmap netlist, and the gpumap netlist Shows equivalent QoR to the original algorithm Circuit if Fmax gpu Fmax if Area gpu Area if LU% gpu LU% beamforming 149 174 18414 19425 45 46 netcard 164730 168912 cc_des25 299 299 137750 137750 56 56 oc_jpeg 191 188 13368 12601 23 22 radar20 144 165 12218 12471 27 28 sat_solver 125 124 20855 20249 42 44 uoft_raytracer 75 75 44566 45432 72 67 uoft_stereo_vision 129 120 77307 80461 60 61 blowfish_area 184 193 50191 48854 72 73 blowfish_pipe 73 62 214067 212716 66 66 vga_lcd Geometric Average 139 141 48 48

Conclusion We can use GPUs to parallelize graph-based algorithms Memory management is essential to getting good results Can achieve an average of 3.1X speedup by parallelizing Technology Mapping on a GPU Additional speedups may be possible by parallelizing ALL parts of Technology Mapping Timing analysis Truth table computation of the final selected cuts

Thank You! Questions?

Backup Slides

Overview Motivation Introduction to GPUs What is Technology Mapping? Approach Optimizations Results Conclusion

Motivation Field-Programmable Gate Arrays (FPGA) are programmable logic devices Has a regular, grid-like structure consisting of configurable Look-Up Tables (LUTs) surrounded by programmable routing network Shorter time-to-market Excellent for prototyping Cheaper than ASICs Requires a CAD tool to help the designers take their HDL design, and place and route it on the FPGA DSP Blocks LABs RAM Blocks IO Blocks

What are GPUs? Extremely efficient at performing computations on large quantities of data Runs thousands of lightweight threads in parallel Threads are arranged in blocks, executed in warps (32 threads/warp) Thread Block Per-thread Local Memory Per-block Shared Memory 32

GPU Advantages Extremely efficient at performing computations on large quantities of data Achieves massive parallelism 32 SMs x 1 warp/sm x 32 threads/warp = 1024 parallel threads Although optimized for throughput, it s not optimized for latency To hide latency, warps can be swapped in and out if a stall occurs 33

FPGA CAD Flow HDL Logic Synthesis Technology Mapping Clustering Technology Mapping Input: A netlist consisting of 2-input primitive gates Output: A netlist consisting of technology-dependent logic blocks Placement Routing Placed and Routed Design 34

The Technology Mapping Problem Take a circuit consisting of 2-input combinational gates, and map it to nodes that exist on the target device A B C D x y z Gate-level Netlist Z A B C D Look- Up Table (LUT) Z Mapped Netlist The resulting netlist can then be clustered, and placed on the FPGA

SMEM SMEM SMEM SMEM Memory Hierarchy Thread Fast single cycle Block Fast single cycle Per-thread Registers Per-block Shared Memory Global Memory Slow 100s of cycles Device Memory Registers and Shared memory accesses happen in a single cycle Global ( off chip ) memory can take hundreds of cycles

SMEM SMEM SMEM SMEM Communicating from CPU to GPU CPU Host Memory Bridge PCIe Device Memory cudamemcpy() Copying data back and forth between the GPU and CPU is extremely expensive Consider the differences in bandwidth PCIe gen2 peak bandwidth = 6 GB/s ( communication link from GPU to GPU ) GPU load/store DRAM peak bandwidth = 150 GB/s

Caching of Data Global memory access are SLOW Consider a simple for loop: for (i = 0; i < netlist->size(); i++) The netlist is in global memory Every time the exit condition is checked results in a global access penalty Loops are everywhere!! Solution: just cache these frequently accessed data (that don t change) in local variables Reduces global memory accesses Speedup: 45% 38

Still More Memory Overhead Because a large number of Cuts are generated, space is allocated for them on the GPU \The CPU does NOT care! Don t copy back useless data Solution: Create a scratch space on the GPU global memory to store cuts Speedup: 4X improvement 39

Register File Active Warps Thread Warp 1 Thread Warp 2 Thread Warp 8 GPU Pipeline Threads require exclusive access to registers in large register file There must be enough total registers to hold information for all threads in the group of active warps Allows the streaming multiprocessor to quickly swap between warps since the register values don t leave the SM 40

Future Work Our current approach partitions the netlist into disjoint levels and performs mapping on each node in parallel Results from one level may be necessary in the computation of subsequent levels Stored in off-chip, global memory = SLOW Better method would involve writing results to shared memory Complex to implement because nodes may require results from multiple different shared memories

Caching of Data Global memory access are SLOW Consider a simple for loop: for (i = 0; i < netlist->size(); i++) The netlist is in global memory Every time the exit condition is checked results in a global access penalty Loops are everywhere!! Solution: just cache these frequently accessed data (that don t change) in local variables Reduces global memory accesses

How to Measure Implemented an alternate Tech Mapper in ABC ABC is a state-of-the-art logic synthesis tool Read Blif ABC GPU Structural Hash GPU Netlist Kernel1 Insert Choices Extract Covers Kernel2 ifmap gpumap ABC Kernel3 Write Verilog ABC Modifications gpumap Implementation