Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs

Size: px

Start display at page:

Download "Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs"

Conrad Dickerson
5 years ago
Views:

ilinx DNN Proceor An Inference Engine, Network Compiler Runtime for ilinx FPGA Rahul Nimaiyar, Brian Sun, Victor Wu, Thoma Branca, Yi Wang, Jutin Oo, Elliott Delaye, Aaron Ng, Paolo D'Alberto,

1 ilinx DNN Proceor An Inference Engine, Network Compiler Runtime for ilinx FPGA Rahul Nimaiyar, Brian Sun, Victor Wu, Thoma Branca, Yi Wang, Jutin Oo, Elliott Delaye, Aaron Ng, Paolo D'Alberto, Sean Settle, Bill Teng, Manaa Bollavaram, Chaithanya Dudha, Hanh Hoang, Swati Gupta,, Alina Huang, Ephrem Wu, Samuel Bayli, Phil Jame-Roxby, Ralph Wittig, Ahih Siraao, Ravi Sunkavalli 21 t Augut 2018

Spill / Retore DMA Controller Execution Controller ilinx

Runtime ilinx DNN Proceor Image Queue Intruction Buffer DMA

ReLU Pooling Pooling Pooling Pooling Cro Bar Low Latency,

2 Spill / Retore DMA Controller Execution Controller ilinx Inference Engine DNN Proceor Compiler Trained Model Compiler Runtime ilinx DNN Proceor Image Queue Intruction Buffer DMA Controller v V v Pooling/ EWA Bia ReLU Bia ReLU Bia ReLU Bia ReLU Pooling Pooling Pooling Pooling Cro Bar Low Latency, High Throughput Batch % Efficiency No FPGA Expertie Required >> 2

Virtex UltraScale VU9P FPGA 16nm TSMC FF FPGA 2.

Mbit On-Die SRAM 4 DDR4-2400 x72 Channel >> 3

Reference Deign VU9P Virtex UltraScale FPGA 21

3 Virtex UltraScale VU9P FPGA 16nm TSMC FF FPGA 2.5M Sytem Logic Cell 6840 DSP Block (18x27 MAC) 382 Mbit On-Die SRAM 4 DDR x72 Channel >> 3 Virtex UltraScale FPGA VCU1525 Developer Board and Reference Deign VU9P Virtex UltraScale FPGA 21 TOPS (INT8) 382 Mbit on-chip SRAM 64 GByte on-board DRAM 75W

4 Motivation for Deep Learning on FPGA Data Parallel 2D Array of MAC Flexible on-chip memory acce High Bandwidth, Multiple Acce Port Data Reue Near Memory Compute Programmable routing for data & filter reue Compreion & Sparity Flexible Data Type FP32/16, INT16/8/4/2, Binary/Ternary Sparity friendly compute >> 4

Virtex UltraScale Full Spectrum of Memory UltraRAM (100 of Megabit)

(Multi-Gigabyte) Ditributed RAM (10 of megabit) Block RAM (10 of

80Tb/ VU37P: 8GB, 460GB/ 64GB 85GB/ 5 Tier of Memory -> Build cutom

5 Virtex UltraScale Full Spectrum of Memory UltraRAM (100 of Megabit) High Bandwidth Memory (Multi-Gigabyte) External Memory 2666-DDR4 (Multi-Gigabyte) Ditributed RAM (10 of megabit) Block RAM (10 of megabit) Gap in Memory Hierarchy VU9P 36Mb 675Tb/ 77Mb 216Tb/ 270Mb 80Tb/ VU37P: 8GB, 460GB/ 64GB 85GB/ 5 Tier of Memory -> Build cutom memory hierarchy. 500 Mb of On-chip Memory and Tb/ of On-chip Memory Bandwidth >> 5

6 Spill / Retore DMA Controller Execution Controller ilinx DNN Proceor (xdnn) Image Queue Intruction Buffer DMA Controller Sytolic Array Configurable Overlay Proceor DNN Specific Intruction Set Convolution, Max Pool etc. Any Network, Any Image Size High Frequency & High Compute Efficiency Compile and run new network Bia Bia Bia Bia Pooling/ EWA ReLU ReLU ReLU ReLU Pooling Pooling Pooling Pooling Cro Bar >> 6

7 xdnn Channel Parallel Sytolic Array 2D Channel Parallel Datapath & Ditributed Buffer Micro-Architecture Optimized for underlying Ultracale FPGA Fabric >> 7

8 xdnn Proceor Tenor Memory W R W R W R W R From Parallel Proceing From Sytolic Array From DDR DMA To Sytolic Array To DDR DMA To Parallel Proceing Channel Parallel Concurrent Acce >> 8

9 Efficient Memory Utilization Previou Layer Output Tenor Memory Time: T n Tenor Memory T n1 Tenor Memory T n2 Tenor Memory T n3 Previou Layer Output Previou Layer Output Previou Layer Output Previou Layer Output 3x3 Conv Reduce 5x5 Conv Reduce 3x3 Conv 3x3 Conv 5x5 Conv 3x3 Conv Reduce 3x3 Conv Reduce Concatenated Output Tenor Memory T n3 Tenor Memory T n4 Tenor Memory T n5 Previou Layer Output Previou Layer Output Previou Layer Output 3x3 Conv 3x3 Conv 3x3 Conv Concatenated Output Input for Next Layer 5x5 Conv 3x3 Conv Reduce 5x5 Conv Reduce 5x5 Conv Reduce >> 9

10 xdnn Compiler Runtime Deep Learning Framework MxNet Frontend Framework Tenor Graph to ilinx Tenor Graph Tenor Graph Optimization Compiler Quantizer Model Calibration Set Image Runtime CPU Layer FPGA Layer >> 10

11 Graph Partitioning One time loading of ub-graph intruction Data Flow Execution FPGA or CPU FPGA CPU FPGA CPU Pre-Proceing Subgraph 1 Parallel Subgraph Pot-Proceing 1 Core -> Multi-Core -> Multi-Chip >> 11

12 Fuing of Operation Conv Batch Norm. ReLU Bia Conv. Max Pool Pipelined Operator (In-line, no RAM acce) Fue operation a pipe-lined tage Acce activation only once >> 12

13 Intruction Level Parallelim Previou Layer Output 3x3 Conv Reduce 5x5 Conv Reduce 3x3 Max Pool 3x3 Conv 5x5 Conv Concatenated Output 3x3 Conv Reduce 3x3 Conv 3x3 Max Pool Parallel Execution 5x5 Conv Reduce 5x5 Conv Time >> 13

14 Automatic Intra Layer Tiling Tile when Feature Map ize exceed on-chip memory Work on full feature map depth Any Feature Map Size >> 14

15 xdnn Key Function Feature Detail Convolution NxM, Stride 1,2,4,8 N,M=1-15 Max Pool NxM, Stride 1,2,4,8 N,M=1-15 Avg Pool NxM, Stride 1,2,4,8 N,M=1-15 Dilated Convolution Factor 1,2,4 De-Convolution NxM, Stride 1,2,4,8 N,M=1-15 Up-ampling Factor 2,4,8,16 Activation ReLU, prelu Elementwie Addition Any Square, Rectangular Preciion Int8, Int16 Network Claification: e.g. ReNet Object Detection: e.g. YOLO v2, Segmentation: e.g. MakRCNN >> 15

xdnn Implementation on VU9P Reource Count VU9P Utilization LUT 612k 52% DSP 5493 80% BRAM 228 38% URAM 864 92% Sytolic

16 xdnn Implementation on VU9P Reource Count VU9P Utilization LUT 612k 52% DSP % BRAM % URAM % Sytolic Array 800MHz 90% of device Fmax 3 Large 96x16 PE 1 in each SLR Sytolic Array at 800 MHz; Ret of logic at 400 MHz >> 16

17 xdnn Compute Efficiency xdnn 74% 60-80% Efficiency Acro Network Other Architecture Full Benchmark Data at ilinx Developer Forum Oct 1, 2018 >> 17

18 Cutom Deep Learning Pipeline xdnn Video Decode Proceing Video ML Genomic ML xdnn DNN Rik Modelling ML Databae ML xdnn Video Proceing Encode Network IPS ML Storage ML Integrate Cutom Application with xdnn. Lower end-to-end latency >> 18

19 ilinx DNN Proceor - Summary xdnn Performance No FPGA Expertie Needed 60-80% 90% Compile & Run Trained Network Deploy on AWS F1, other Cloud Platform EFFICIENCY FREQUENCY Deploy in PCIe Card >> 19

20 Additional Information Connect Learn Share DF connect oftware developer and ytem deigner to the deep expertie of ilinx engineer, partner, and indutry leader. Silicon Valley October 1-2 Beijing October 6 th Frankfurt December 10 Learn More >> 20

Xilinx ML Suite Overview

Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame