Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015

Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to 1000s of processing elements Processing elements can be fined tuned to specific work loads Bit leel integer and fixed point operations Custom Memory Architectures Memories sized to the application Memory elements placed next to datapath Custom Data Moement and I/O More silicon area dedicated to wires and data transport than any other deice Direct low latency connections between I/O pins and compute Page 2

FPGA Applications for Data Center Application Network Function Virtualization Encryption Dictionary compression (ie Deflate) Search categorization Image classification by machine learning Image transcoding Genome sequencing Key Value Store Efficient bit leel operations FPGA Benefit Customized memory hierarchy and efficient string matching Control flow dominated, thousands of actie state machines Ability to customize data moement to specific working sets Fine grained parallelism with customized memory hierarchy Thousands of parallel bit leel operations Low latency I/O FPGAs Offer Significant Speedup Potential s CPU and GPU Page 3

Key Value Store Acceleration with FPGAs Line-rate maximum achieed by Xilinx FPGA accelerator Up to 36x in performance/watt demonstrated Plus 10-100x lower latency Page 4

Productiity Challenges for Software (Daid Thomas, Imperial College, UK) Relatie Performance 256 64 16 4 Parallelisation Clock Rate FPGA GPU CPU 1 025 Initial Design Hour Day Week Month Year Design-time FPGAs proide large speed-up and power saings at a price! Days or weeks to get an initial ersion working Multiple optimisation and erification cycles to get high performance Page 5

Programming Flow Eolution User interface HW World Hardware Software SW World Traditional flow Viado HLS flow SDAccel flow C/C++ CPU C/C++ CPU C/C++ CPU RTL RTL C/C++ RTL Accel C/C++ Accel C/C++ Accel Multiple languages and tools Single language and tool Page 6

Viado HLS Raising the Abstraction for IP Creation Page 7

Viado High-Leel Synthesis C, C++ or SystemC Algorithmic Specification Micro Architecture Exploration VHDL or Verilog RTL Implementation System IP Integration Comprehensie Integration with the Xilinx Design Enironment Accelerates Algorithmic C to RTL IP integration Page 8

Viado HLS System IP Integration Flow C- based IP CreaAon System IntegraAon C, C++, SystemC Libraries Viado IP Integrator Arbitrary Precision Video Math Linear algebra IP: FFT and FIR IP Catalog VHDL or Verilog Viado RTL System Generator for DSP Viado HLS Integrates into System Flows Page 9

Design Decisions Decisions made by designer Functionality As implicit state machine Performance Latency, throughput Interfaces Storage architecture Memories, registers banks etc Partitioning into modules Decisions made by the tool State machine Structure, encoding Pipelining Pipeline registers allocation Scheduling Memory I/O Interface I/O Functional operations Design Exploration Page 10

Datapath Synthesis Example: y = a*x + b + c; 1 HLS begins by extrac*ng a data flow graph (DFG), a func*onal representa*on a x b c * + + 3 Expression balancing for latency reduc*on Restructuring to op*mize fabric resources a x b c * + + y y 2 Accounts for target Fmax to determine minimum required pipelining (not yet the op*mal implementa*on) a x b c * + + 4 Restructuring for op*mized DSP48 a x b c * + + Structured for DSP inference or leeraging floa*ng point IPs y y Page 11

Interface Synthesis Completing the design C function arguments become RTL interface ports f(int in[20], int out[20]) { int a,b,c,x,y; for(int i = 0; i < 20; i++) { x = in[i]; y = a*x + b + c; out[i] = y;} } in BRA M i a x b c * + + Control logic FSM y i out BRA M The State Machine Automa>cally Adapts to the Design Interface Page 12

Viado HLS Examples Page 13

MGS QRD + Weight back Substitution à STAP This is the Hardest Floating Point Problem for Hardware

MGS QRD+WBS RESULTS

HLS Simple CA- CFAR 32 Cells, 14 Left, 14 Right, 2 Guard

CFAR RESULTS

SDAccel Software Centric View of FPGA Programming Page 18

OpenCL : Technical Goals Royalty Free standard for Parallel Programming Shared IP zones Target Eerything in a Heterogeneous Platform CPU + GPU + ASIC accelerators + DSP + FPGA Cross-Vendor Functional Portability No proprietary APIs + languages to learn Enable Architecture Specific Optimization Refactor code to optimize Vendor extensions Page 19

Introducing SDAccel Page 20

Breakthrough Solution for Application Deelopers Libraries OpenCL built-ins Video, DSP, Linear Algebra OpenCV BLAS Aailability Included Included Proided by Auiz Systems Readily Aailable Boards Page 21

Virtex-7 OpenCL Platform Virtex-7 690T Half-length, low profile 2x 8GB DDR3 1600Mb/s, ECC Gen 3 x8 PCIe Up to 287 GFlops/W 2x SFP+ modules http://wwwalpha-datacom/pdfs/adm-pcie-73pdf OpenCL COTS Platforms from Deelopment to Production Page 22 Copyright 2014 Xilinx Page 22

Kintex-7 OpenCL Platform M-505-K325T Kintex-7 325T 8GB DDR3 Gen 2 X8 PCI-e ISO 13485 http://picocomputingcom/products/embeddedmodules/m-505-k325t-embedded-2 Scalable Customizable Acceleration Page 23 Copyright 2014 Xilinx Page 23

Virtex7 OpenCL Platform Wolerine WX690/WX2000 WX690: V7 690T WX2000: V7 2000T 16/32/64 GB DDR3 Gen3 X16 PCI-e http://wwwconeycomputercom/products/wolerine/ Highest performance acceleration at the lowest power enelope Page 24 Copyright 2014 Xilinx Page 24

Accelerated OpenCL Programming and Deployment CPU/GPU like programming enironment ü X86 deelopment and debug ü Cycle accurate simulation ü Platform acceleration Page 25 Xilinx Internal

SDAccel Kernel Compiler for FPGAs C, C++ OpenCL Leeraging Viado HLS for ü Efficiently Optimizes a Variety of Input Types ü Optimized Architectural Parallelizing and Pipelining ü Broadest Range of Compiler Optimization for Memory, Dataflow, & Loop Pipelining Performance and Resource Efficiency Page 26

Pipelining OpenCL C Kernel C/C++ Kernel kernel oid foo() { attribute ((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_read(idx); op_compute(idx); op_write(idx); } } oid foo() { for (int i=0; i<3; i++) { #pragma HLS pipeline int idx = get_global_id(0)*3 + i; op_read(idx); op_compute(idx); op_write(idx); } } execution time of loop 9 cycles loop iteration 5 cycles Page 27

Loop Unrolling /* ector multiply */ kernel oid mult(local int* a, local int* b, local int* c) { int tid = get_global_id(0); } attribute ((unroll_loop_hint(2))) for (int i=0; i<4; i++) { int idx = tid*4 + i; a[idx] = b[idx] * c[idx]; } /* ector multiply */ kernel oid mult(local int* a, local int* b, local int* c) { int tid = get_global_id(0); } attribute ((unroll_loop_hint)) for (int i=0; i<4; i++) { int idx = tid*4 + i; a[idx] = b[idx] * c[idx]; } iter0 iter1 iter2 iter3 iter0 iter1 iter0 Automatic with heuristics User directed OpenCL attribute Expose more concurrency to the compiler s scheduler Similar to SIMD ectorization Page 28

Memory Specialization OpenCL C Kernel local int buffer[16] ; C/C++ Kernel int buffer[16] ; 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 buffer Buffer implemented in 1 physical memory Buffer can sustain 2 concurrent transactions Reading all alues of buffer can take 8 clock cycles Page 29

Memory Specialization OpenCL C Kernel local int buffer[16] attribute ((xcl_array_partition(block,4,1))); C/C++ Kernel int buffer[16]; #pragma HLS ARRAY_PARTITION ariable=buffer block factor=4 dim=1 Buffer implemented in 4 physical memories Buffer can sustain 8 concurrent transactions Compiler handles all address translations to physical memories Reading all alues of buffer can take 2 clock cycles Page 30

Memory Specialization OpenCL C Kernel local int buffer[16] attribute ((xcl_array_partition(complete,1))); C/C++ Kernel int buffer[16]; #pragma HLS ARRAY_PARTITION ariable=buffer complete dim=1 Buffer implemented in 16 registers Buffer can sustain 16 concurrent transactions Compiler handles all address translations to physical memories Reading all alues of buffer can take 1 clock cycles Page 31

SDAccel Platform Model CPU Host UltraScale FPGAs PCIe Scalable Host and Base During Updates DDR Memory Software Runtime Experience ü On-demand loadable acceleration units ü Always on interfaces (Memory, Ethernet PCIe, Video) ü Minimizes compilation time Page 32

SDAccel Applications Page 33

Virtex-7 Monte-Carlo Options Pricing PCIe3 x8 DMA Interconnect DDR DDR MC Black Scholes OpenCL Compute Unit Local memory Local memory

SDAccel in Action

SDAccel Performance/Watt Adantage Metric SDAccel with AuizCV library Nidia K20 with CUDA SDAccel Adantage HD Sobel Filter Frames/watt 80 11 7x HD Image Downscaling Frames/watt 36 5 7x Application *Data from Auiz Systems Page 37 Copyright 2014 Xilinx Page 37

Start Designing with SDAccel Today Page 38

Thank You Page 39