Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Size: px

Start display at page:

Download "Unlocking FPGAs Using High- Level Synthesis Compiler Technologies"

Cameron Smith
6 years ago
Views:

1 Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015

2 Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to 1000s of processing elements Processing elements can be fined tuned to specific work loads Bit leel integer and fixed point operations Custom Memory Architectures Memories sized to the application Memory elements placed next to datapath Custom Data Moement and I/O More silicon area dedicated to wires and data transport than any other deice Direct low latency connections between I/O pins and compute Page 2

3 FPGA Applications for Data Center Application Network Function Virtualization Encryption Dictionary compression (ie Deflate) Search categorization Image classification by machine learning Image transcoding Genome sequencing Key Value Store Efficient bit leel operations FPGA Benefit Customized memory hierarchy and efficient string matching Control flow dominated, thousands of actie state machines Ability to customize data moement to specific working sets Fine grained parallelism with customized memory hierarchy Thousands of parallel bit leel operations Low latency I/O FPGAs Offer Significant Speedup Potential s CPU and GPU Page 3

4 Key Value Store Acceleration with FPGAs Line-rate maximum achieed by Xilinx FPGA accelerator Up to 36x in performance/watt demonstrated Plus x lower latency Page 4

5 Productiity Challenges for Software (Daid Thomas, Imperial College, UK) Relatie Performance Parallelisation Clock Rate FPGA GPU CPU Initial Design Hour Day Week Month Year Design-time FPGAs proide large speed-up and power saings at a price! Days or weeks to get an initial ersion working Multiple optimisation and erification cycles to get high performance Page 5

6 Programming Flow Eolution User interface HW World Hardware Software SW World Traditional flow Viado HLS flow SDAccel flow C/C++ CPU C/C++ CPU C/C++ CPU RTL RTL C/C++ RTL Accel C/C++ Accel C/C++ Accel Multiple languages and tools Single language and tool Page 6

7 Viado HLS Raising the Abstraction for IP Creation Page 7

Implementation System IP Integration Comprehensie Integration with

8 Viado High-Leel Synthesis C, C++ or SystemC Algorithmic Specification Micro Architecture Exploration VHDL or Verilog RTL Implementation System IP Integration Comprehensie Integration with the Xilinx Design Enironment Accelerates Algorithmic C to RTL IP integration Page 8

9 Viado HLS System IP Integration Flow C- based IP CreaAon System IntegraAon C, C++, SystemC Libraries Viado IP Integrator Arbitrary Precision Video Math Linear algebra IP: FFT and FIR IP Catalog VHDL or Verilog Viado RTL System Generator for DSP Viado HLS Integrates into System Flows Page 9

Design Decisions Decisions made by designer Functionality As implicit state machine Performance Latency, throughput Interfaces Storage architecture Memories, registers banks etc Partitioning

10 Design Decisions Decisions made by designer Functionality As implicit state machine Performance Latency, throughput Interfaces Storage architecture Memories, registers banks etc Partitioning into modules Decisions made by the tool State machine Structure, encoding Pipelining Pipeline registers allocation Scheduling Memory I/O Interface I/O Functional operations Design Exploration Page 10

Datapath Synthesis Example: y = a*x + b + c; 1 HLS begins by extrac*ng a data flow graph (DFG), a func*onal representa*on a x b c * + + 3 Expression balancing for latency reduc*on Restructuring to

11 Datapath Synthesis Example: y = a*x + b + c; 1 HLS begins by extrac*ng a data flow graph (DFG), a func*onal representa*on a x b c * Expression balancing for latency reduc*on Restructuring to op*mize fabric resources a x b c * + + y y 2 Accounts for target Fmax to determine minimum required pipelining (not yet the op*mal implementa*on) a x b c * Restructuring for op*mized DSP48 a x b c * + + Structured for DSP inference or leeraging floa*ng point IPs y y Page 11

12 Interface Synthesis Completing the design C function arguments become RTL interface ports f(int in[20], int out[20]) { int a,b,c,x,y; for(int i = 0; i < 20; i++) { x = in[i]; y = a*x + b + c; out[i] = y;} } in BRA M i a x b c * + + Control logic FSM y i out BRA M The State Machine Automa>cally Adapts to the Design Interface Page 12

13 Viado HLS Examples Page 13

14 MGS QRD + Weight back Substitution à STAP This is the Hardest Floating Point Problem for Hardware

15 MGS QRD+WBS RESULTS

16 HLS Simple CA- CFAR 32 Cells, 14 Left, 14 Right, 2 Guard

17 CFAR RESULTS

18 SDAccel Software Centric View of FPGA Programming Page 18

OpenCL : Technical Goals Royalty Free standard for Parallel Programming Shared IP zones Target Eerything in a Heterogeneous Platform CPU + GPU + ASIC accelerators + DSP +

19 OpenCL : Technical Goals Royalty Free standard for Parallel Programming Shared IP zones Target Eerything in a Heterogeneous Platform CPU + GPU + ASIC accelerators + DSP + FPGA Cross-Vendor Functional Portability No proprietary APIs + languages to learn Enable Architecture Specific Optimization Refactor code to optimize Vendor extensions Page 19

20 Introducing SDAccel Page 20

21 Breakthrough Solution for Application Deelopers Libraries OpenCL built-ins Video, DSP, Linear Algebra OpenCV BLAS Aailability Included Included Proided by Auiz Systems Readily Aailable Boards Page 21

22 Virtex-7 OpenCL Platform Virtex-7 690T Half-length, low profile 2x 8GB DDR3 1600Mb/s, ECC Gen 3 x8 PCIe Up to 287 GFlops/W 2x SFP+ modules OpenCL COTS Platforms from Deelopment to Production Page 22 Copyright 2014 Xilinx Page 22

Kintex-7 OpenCL Platform M-505-K325T Kintex-7 325T 8GB

24 Virtex7 OpenCL Platform Wolerine WX690/WX2000 WX690: V7 690T WX2000: V7 2000T 16/32/64 GB DDR3 Gen3 X16 PCI-e Highest performance acceleration at the lowest power enelope Page 24 Copyright 2014 Xilinx Page 24

25 Accelerated OpenCL Programming and Deployment CPU/GPU like programming enironment ü X86 deelopment and debug ü Cycle accurate simulation ü Platform acceleration Page 25 Xilinx Internal

26 SDAccel Kernel Compiler for FPGAs C, C++ OpenCL Leeraging Viado HLS for ü Efficiently Optimizes a Variety of Input Types ü Optimized Architectural Parallelizing and Pipelining ü Broadest Range of Compiler Optimization for Memory, Dataflow, & Loop Pipelining Performance and Resource Efficiency Page 26

$Pipelining OpenCL C Kernel C/C++ Kernel kernel oid foo() { attribute ((xcl_pipeline_loop)) for (int$ $oid foo() { for (int i=0; i<3; i++) { #pragma HLS pipeline int idx = get_global_id(0)*3 + i;$

27 Pipelining OpenCL C Kernel C/C++ Kernel kernel oid foo() { attribute ((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_read(idx); op_compute(idx); op_write(idx); } } oid foo() { for (int i=0; i<3; i++) { #pragma HLS pipeline int idx = get_global_id(0)*3 + i; op_read(idx); op_compute(idx); op_write(idx); } } execution time of loop 9 cycles loop iteration 5 cycles Page 27

$Loop Unrolling /* ector multiply */ kernel oid mult(local int* a, local int* b, local int* c) { int tid = get_global_id(0); } attribute ((unroll_loop_hint(2)))$ $tid = get_global_id(0); } attribute ((unroll_loop_hint)) for (int i=0; i<4; i++) { int idx = tid*4 + i; a[idx] = b[idx] * c[idx]; } iter0 iter1 iter2 iter3$

28 Loop Unrolling /* ector multiply */ kernel oid mult(local int* a, local int* b, local int* c) { int tid = get_global_id(0); } attribute ((unroll_loop_hint(2))) for (int i=0; i<4; i++) { int idx = tid*4 + i; a[idx] = b[idx] * c[idx]; } /* ector multiply */ kernel oid mult(local int* a, local int* b, local int* c) { int tid = get_global_id(0); } attribute ((unroll_loop_hint)) for (int i=0; i<4; i++) { int idx = tid*4 + i; a[idx] = b[idx] * c[idx]; } iter0 iter1 iter2 iter3 iter0 iter1 iter0 Automatic with heuristics User directed OpenCL attribute Expose more concurrency to the compiler s scheduler Similar to SIMD ectorization Page 28

29 Memory Specialization OpenCL C Kernel local int buffer[16] ; C/C++ Kernel int buffer[16] ; buffer Buffer implemented in 1 physical memory Buffer can sustain 2 concurrent transactions Reading all alues of buffer can take 8 clock cycles Page 29

30 Memory Specialization OpenCL C Kernel local int buffer[16] attribute ((xcl_array_partition(block,4,1))); C/C++ Kernel int buffer[16]; #pragma HLS ARRAY_PARTITION ariable=buffer block factor=4 dim=1 Buffer implemented in 4 physical memories Buffer can sustain 8 concurrent transactions Compiler handles all address translations to physical memories Reading all alues of buffer can take 2 clock cycles Page 30

31 Memory Specialization OpenCL C Kernel local int buffer[16] attribute ((xcl_array_partition(complete,1))); C/C++ Kernel int buffer[16]; #pragma HLS ARRAY_PARTITION ariable=buffer complete dim=1 Buffer implemented in 16 registers Buffer can sustain 16 concurrent transactions Compiler handles all address translations to physical memories Reading all alues of buffer can take 1 clock cycles Page 31

SDAccel Platform Model CPU Host UltraScale

acceleration units ü Always on interfaces

32 SDAccel Platform Model CPU Host UltraScale FPGAs PCIe Scalable Host and Base During Updates DDR Memory Software Runtime Experience ü On-demand loadable acceleration units ü Always on interfaces (Memory, Ethernet PCIe, Video) ü Minimizes compilation time Page 32

33 SDAccel Applications Page 33

34 Virtex-7 Monte-Carlo Options Pricing PCIe3 x8 DMA Interconnect DDR DDR MC Black Scholes OpenCL Compute Unit Local memory Local memory

36 SDAccel in Action

Frames/watt 80 11 7x HD Image Downscaling Frames/watt 36 5 7x

37 SDAccel Performance/Watt Adantage Metric SDAccel with AuizCV library Nidia K20 with CUDA SDAccel Adantage HD Sobel Filter Frames/watt x HD Image Downscaling Frames/watt x Application *Data from Auiz Systems Page 37 Copyright 2014 Xilinx Page 37

38 Start Designing with SDAccel Today Page 38

39 Thank You Page 39

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently