Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Similar documents
SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Vivado HLx Design Entry. June 2016

Altera SDK for OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL

SDSoC: Session 1

SDAccel Development Environment User Guide

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

81920**slide. 1Developing the Accelerator Using HLS

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Porting Performance across GPUs and FPGAs

The S6000 Family of Processors

Welcome. Altera Technology Roadshow 2013

COE 561 Digital System Design & Synthesis Introduction

Versal: AI Engine & Programming Environment

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Unit 2: High-Level Synthesis

Early Models in Silicon with SystemC synthesis

World s most advanced data center accelerator for PCIe-based servers

Does FPGA-based prototyping really have to be this difficult?

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

Extending the Power of FPGAs

Adaptable Intelligence The Next Computing Era

Tesla GPU Computing A Revolution in High Performance Computing

Experts in Application Acceleration Synective Labs AB

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

100M Gate Designs in FPGAs

FlexRIO. FPGAs Bringing Custom Functionality to Instruments. Ravichandran Raghavan Technical Marketing Engineer. ni.com

Towards Optimal Custom Instruction Processors

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Ten Reasons to Optimize a Processor

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Simplifying FPGA Design for SDR with a Network on Chip Architecture

System Level Design with IBM PowerPC Models

LegUp: Accelerating Memcached on Cloud FPGAs

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Design of Transport Triggered Architecture Processor for Discrete Cosine Transform

Pactron FPGA Accelerated Computing Solutions

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Overview of ROCCC 2.0

FPGA Entering the Era of the All Programmable SoC

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs

Upgrading XL Fortran Compilers

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

IBM XL C/C++ V2R1M1 web deliverable for z/os V2R1

Zynq-7000 All Programmable SoC Product Overview

ATS-GPU Real Time Signal Processing Software

Ted N. Booth. DesignLinx Hardware Solutions

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

General Purpose GPU Computing in Partial Wave Analysis

Accelerating Data Center Workloads with FPGAs

Investigation of High-Level Synthesis tools applicability to data acquisition systems design based on the CMS ECAL Data Concentrator Card example

High-Level Synthesis

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Revolutionizing the Datacenter

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Technology for a better society. hetcomp.com

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

ESL design with the Agility Compiler for SystemC

Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices

Profiling & Tuning Applications. CUDA Course István Reguly

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Did I Just Do That on a Bunch of FPGAs?

Intel HLS Compiler: Fast Design, Coding, and Hardware

CUDA. Matthew Joyner, Jeremy Williams

Custom computing systems

High-level programming tools for FPGA

Lecture 7: Introduction to Co-synthesis Algorithms

Accelerating FPGA/ASIC Design and Verification

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

XPU A Programmable FPGA Accelerator for Diverse Workloads

OpenCL on FPGAs - Creating custom accelerated solutions

Accelerating Vision Processing

Tesla GPU Computing A Revolution in High Performance Computing

Convey Wolverine Application Accelerators. Architectural Overview. Convey White Paper

Addressing Heterogeneity in Manycore Applications

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

FPGA architecture and design technology

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Fast and Flexible High-Level Synthesis from OpenCL using Reconfiguration Contexts

Co-synthesis and Accelerator based Embedded System Design

FPGA design with National Instuments

S2C K7 Prodigy Logic Module Series

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Transcription:

Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015

Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to 1000s of processing elements Processing elements can be fined tuned to specific work loads Bit leel integer and fixed point operations Custom Memory Architectures Memories sized to the application Memory elements placed next to datapath Custom Data Moement and I/O More silicon area dedicated to wires and data transport than any other deice Direct low latency connections between I/O pins and compute Page 2

FPGA Applications for Data Center Application Network Function Virtualization Encryption Dictionary compression (ie Deflate) Search categorization Image classification by machine learning Image transcoding Genome sequencing Key Value Store Efficient bit leel operations FPGA Benefit Customized memory hierarchy and efficient string matching Control flow dominated, thousands of actie state machines Ability to customize data moement to specific working sets Fine grained parallelism with customized memory hierarchy Thousands of parallel bit leel operations Low latency I/O FPGAs Offer Significant Speedup Potential s CPU and GPU Page 3

Key Value Store Acceleration with FPGAs Line-rate maximum achieed by Xilinx FPGA accelerator Up to 36x in performance/watt demonstrated Plus 10-100x lower latency Page 4

Productiity Challenges for Software (Daid Thomas, Imperial College, UK) Relatie Performance 256 64 16 4 Parallelisation Clock Rate FPGA GPU CPU 1 025 Initial Design Hour Day Week Month Year Design-time FPGAs proide large speed-up and power saings at a price! Days or weeks to get an initial ersion working Multiple optimisation and erification cycles to get high performance Page 5

Programming Flow Eolution User interface HW World Hardware Software SW World Traditional flow Viado HLS flow SDAccel flow C/C++ CPU C/C++ CPU C/C++ CPU RTL RTL C/C++ RTL Accel C/C++ Accel C/C++ Accel Multiple languages and tools Single language and tool Page 6

Viado HLS Raising the Abstraction for IP Creation Page 7

Viado High-Leel Synthesis C, C++ or SystemC Algorithmic Specification Micro Architecture Exploration VHDL or Verilog RTL Implementation System IP Integration Comprehensie Integration with the Xilinx Design Enironment Accelerates Algorithmic C to RTL IP integration Page 8

Viado HLS System IP Integration Flow C- based IP CreaAon System IntegraAon C, C++, SystemC Libraries Viado IP Integrator Arbitrary Precision Video Math Linear algebra IP: FFT and FIR IP Catalog VHDL or Verilog Viado RTL System Generator for DSP Viado HLS Integrates into System Flows Page 9

Design Decisions Decisions made by designer Functionality As implicit state machine Performance Latency, throughput Interfaces Storage architecture Memories, registers banks etc Partitioning into modules Decisions made by the tool State machine Structure, encoding Pipelining Pipeline registers allocation Scheduling Memory I/O Interface I/O Functional operations Design Exploration Page 10

Datapath Synthesis Example: y = a*x + b + c; 1 HLS begins by extrac*ng a data flow graph (DFG), a func*onal representa*on a x b c * + + 3 Expression balancing for latency reduc*on Restructuring to op*mize fabric resources a x b c * + + y y 2 Accounts for target Fmax to determine minimum required pipelining (not yet the op*mal implementa*on) a x b c * + + 4 Restructuring for op*mized DSP48 a x b c * + + Structured for DSP inference or leeraging floa*ng point IPs y y Page 11

Interface Synthesis Completing the design C function arguments become RTL interface ports f(int in[20], int out[20]) { int a,b,c,x,y; for(int i = 0; i < 20; i++) { x = in[i]; y = a*x + b + c; out[i] = y;} } in BRA M i a x b c * + + Control logic FSM y i out BRA M The State Machine Automa>cally Adapts to the Design Interface Page 12

Viado HLS Examples Page 13

MGS QRD + Weight back Substitution à STAP This is the Hardest Floating Point Problem for Hardware

MGS QRD+WBS RESULTS

HLS Simple CA- CFAR 32 Cells, 14 Left, 14 Right, 2 Guard

CFAR RESULTS

SDAccel Software Centric View of FPGA Programming Page 18

OpenCL : Technical Goals Royalty Free standard for Parallel Programming Shared IP zones Target Eerything in a Heterogeneous Platform CPU + GPU + ASIC accelerators + DSP + FPGA Cross-Vendor Functional Portability No proprietary APIs + languages to learn Enable Architecture Specific Optimization Refactor code to optimize Vendor extensions Page 19

Introducing SDAccel Page 20

Breakthrough Solution for Application Deelopers Libraries OpenCL built-ins Video, DSP, Linear Algebra OpenCV BLAS Aailability Included Included Proided by Auiz Systems Readily Aailable Boards Page 21

Virtex-7 OpenCL Platform Virtex-7 690T Half-length, low profile 2x 8GB DDR3 1600Mb/s, ECC Gen 3 x8 PCIe Up to 287 GFlops/W 2x SFP+ modules http://wwwalpha-datacom/pdfs/adm-pcie-73pdf OpenCL COTS Platforms from Deelopment to Production Page 22 Copyright 2014 Xilinx Page 22

Kintex-7 OpenCL Platform M-505-K325T Kintex-7 325T 8GB DDR3 Gen 2 X8 PCI-e ISO 13485 http://picocomputingcom/products/embeddedmodules/m-505-k325t-embedded-2 Scalable Customizable Acceleration Page 23 Copyright 2014 Xilinx Page 23

Virtex7 OpenCL Platform Wolerine WX690/WX2000 WX690: V7 690T WX2000: V7 2000T 16/32/64 GB DDR3 Gen3 X16 PCI-e http://wwwconeycomputercom/products/wolerine/ Highest performance acceleration at the lowest power enelope Page 24 Copyright 2014 Xilinx Page 24

Accelerated OpenCL Programming and Deployment CPU/GPU like programming enironment ü X86 deelopment and debug ü Cycle accurate simulation ü Platform acceleration Page 25 Xilinx Internal

SDAccel Kernel Compiler for FPGAs C, C++ OpenCL Leeraging Viado HLS for ü Efficiently Optimizes a Variety of Input Types ü Optimized Architectural Parallelizing and Pipelining ü Broadest Range of Compiler Optimization for Memory, Dataflow, & Loop Pipelining Performance and Resource Efficiency Page 26

Pipelining OpenCL C Kernel C/C++ Kernel kernel oid foo() { attribute ((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_read(idx); op_compute(idx); op_write(idx); } } oid foo() { for (int i=0; i<3; i++) { #pragma HLS pipeline int idx = get_global_id(0)*3 + i; op_read(idx); op_compute(idx); op_write(idx); } } execution time of loop 9 cycles loop iteration 5 cycles Page 27

Loop Unrolling /* ector multiply */ kernel oid mult(local int* a, local int* b, local int* c) { int tid = get_global_id(0); } attribute ((unroll_loop_hint(2))) for (int i=0; i<4; i++) { int idx = tid*4 + i; a[idx] = b[idx] * c[idx]; } /* ector multiply */ kernel oid mult(local int* a, local int* b, local int* c) { int tid = get_global_id(0); } attribute ((unroll_loop_hint)) for (int i=0; i<4; i++) { int idx = tid*4 + i; a[idx] = b[idx] * c[idx]; } iter0 iter1 iter2 iter3 iter0 iter1 iter0 Automatic with heuristics User directed OpenCL attribute Expose more concurrency to the compiler s scheduler Similar to SIMD ectorization Page 28

Memory Specialization OpenCL C Kernel local int buffer[16] ; C/C++ Kernel int buffer[16] ; 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 buffer Buffer implemented in 1 physical memory Buffer can sustain 2 concurrent transactions Reading all alues of buffer can take 8 clock cycles Page 29

Memory Specialization OpenCL C Kernel local int buffer[16] attribute ((xcl_array_partition(block,4,1))); C/C++ Kernel int buffer[16]; #pragma HLS ARRAY_PARTITION ariable=buffer block factor=4 dim=1 Buffer implemented in 4 physical memories Buffer can sustain 8 concurrent transactions Compiler handles all address translations to physical memories Reading all alues of buffer can take 2 clock cycles Page 30

Memory Specialization OpenCL C Kernel local int buffer[16] attribute ((xcl_array_partition(complete,1))); C/C++ Kernel int buffer[16]; #pragma HLS ARRAY_PARTITION ariable=buffer complete dim=1 Buffer implemented in 16 registers Buffer can sustain 16 concurrent transactions Compiler handles all address translations to physical memories Reading all alues of buffer can take 1 clock cycles Page 31

SDAccel Platform Model CPU Host UltraScale FPGAs PCIe Scalable Host and Base During Updates DDR Memory Software Runtime Experience ü On-demand loadable acceleration units ü Always on interfaces (Memory, Ethernet PCIe, Video) ü Minimizes compilation time Page 32

SDAccel Applications Page 33

Virtex-7 Monte-Carlo Options Pricing PCIe3 x8 DMA Interconnect DDR DDR MC Black Scholes OpenCL Compute Unit Local memory Local memory

http://auizsystemscom/products/auizc/ FPGA Optimized Libraries Page 35 Copyright 2014 Xilinx Page 35

SDAccel in Action

SDAccel Performance/Watt Adantage Metric SDAccel with AuizCV library Nidia K20 with CUDA SDAccel Adantage HD Sobel Filter Frames/watt 80 11 7x HD Image Downscaling Frames/watt 36 5 7x Application *Data from Auiz Systems Page 37 Copyright 2014 Xilinx Page 37

Start Designing with SDAccel Today Page 38

Thank You Page 39