Xilinx ML Suite Overview

Size: px

Start display at page:

Download "Xilinx ML Suite Overview"

Asher Stone
5 years ago
Views:

1 Xilinx ML Suite Overview Jim Heaton Sr. FAE

2 Deep Learning explores the study of algorithms that can learn from and make predictions on data Deep Learning is Re-defining Many Applications Cloud Acceleration Security Ecommerce Social Financial Page 2 Surveillance Industrial IOT Medical Bioinformatics Autonomous Vehicles

3 Accelerating AI Inference into Your Cloud Applications Featuring the Most Powerful FPGA in the Cloud Deep Learning Applications Virtex Ultrascale+ VU9P Zynq Ultrascale+ MPSoC Cloud On Premises Edge

4 Want to try the out Xilinx ML Suite?

5 Xilinx ML Suite - AWS Marketplace ML Suite Supported Frameworks: Caffe MxNet Tensorflow Keras Python Support Darknet Jupyter Notebooks available: Image Classification with Caffe Using the xfdnn Compiler w/ a Caffe Model Using the xfdnn Quantizer w/ a Caffe Model Pre-trained Models Caffe 8/16-bit GoogLeNet v1 ResNet50 Flowers102 Places365 Python 8/16-bit Yolov2 MxNet 8/16-bit GoogLeNet v1 xfdnn Tools Compiler Quantizer

6 Unified Simple User Experience from Cloud to Alveo Develop Publish Deploy User Guides Tutorials Examples User Guides Tutorials Examples Launch Instance User Choice Download xfdnn Apps xdnn Binaries xfdnn Apps xdnn Binaries

7 Deep Learning Models Multi-Layer Perceptron Classification Universal Function Approximator Autoencoder Convolutional Neural Network Feature Extraction Object Detection Image Segmentation Recurrent Neural Network Sequence and Temporal Data Speech to Text Language Translation Classification Object Detection Segmentation Dog

and dense implementation xdnn CNN Engine for Large 16 nm Xilinx Devices Deephi DPU Flexible CNN Engine with

8 Overlay Architecture Custom Processors Exploiting Xilinx FPGA Flexibility Customized overlays with ISA architecture for optimized implementation Easy plug and play with Software Stack MLP Engine Scalable sparse and dense implementation xdnn CNN Engine for Large 16 nm Xilinx Devices Deephi DPU Flexible CNN Engine with Embedded Focus Deephi ESE LSTM Speech to Text engine Available on AWS Random Forest Configurable RF classification

9 Rapid Feature and Performance Improvement xdnn-v1 500 MHz URAM for feature maps without caching Array of accumulator with 16 bit(batch 1), 8 bit(batch 2) Instructions: Convolution, Relu, MaxPool, AveragePool, Elementwise Flexible kernel size(square) and strides Programmable Scaling Q4CY17 xdnn-v2 500 MHz All xdnn-v1 features DDR Caching: Larger Image, CNN Networks Instructions: Depth wise Convolution, Deconvolution, Convolution, Transpose Upsampling Rectangular Kernels Q2CY18 xdnn-v3 700 MHz Feature compatible with xdnn-v2 New Systolic Array Implementation: 50% Higher FMAX and 2.2x time lower latency Batch of 1 for 8 bit implementation Non-blocking Caching and Pooling Q4CY18 Page 9

10 GOPS/W GOPS/W Break Through on Peak Performance GPU: Introduce new architectures and silicon 1000 GPU Deep Learning Peak Power Efficiency Xilinx: Adapt the break through of emerging domain knowledge 1000 FPGA Deep Learning Peak Power Efficiency ACAP Architectures Mixed FP16 for TensorCore Adaptable Precision Native INT8 operation INT8 Optimization Kepler Maxwell Pascal Volta Ultrascale Ultrascale+ ACAP

11 From Xilinx From Community Seamless Deployment with Open Source Software xfdnn Middleware, Tools and Runtime xdnn Processing Engine

12 xfdnn flow CNTK Caffe2 PyTorch Tensorflow MxNet Caffe F R O N T E N D Framework Tensor Graph to Xilinx Tensor Graph xfdnn Tensor Graph Optimization ONNX xfdnn Compiler xfdnn Compression Model Weights Calibration Set Image xfdnn Runtime (python API) CPU Layers FPGA Layers

network optimizations for lower latency by fusing layers and buffering on-chip memory

13 xfdnn Inference Toolbox Graph Compiler Network Optimization xfdnn Quantizer Python tools to quickly compile networks from common Frameworks Caffe, MxNet and Tensorflow Automatic network optimizations for lower latency by fusing layers and buffering on-chip memory Quickly reduce precision of trained models for deployment Maintains 32bit accuracy at 8 bit within 2%

xfdnn Graph Compiler Pass in a Network Microcode for xdnn is Produced 0 XNConv conv1 7 2 16 26 2 1 1 0x1c0000 224 3 0x0 112 64 2 XNMaxPool pool1 3 2 0 0x0 112 64 0x1c0000 56 3 XNConv res2a_branch1 1

14 xfdnn Graph Compiler Pass in a Network Microcode for xdnn is Produced 0 XNConv conv x1c x XNMaxPool pool x x1c XNConv res2a_branch x1c x XNConv res2a_branch2a x1c x XNConv res2a_branch2b x x1c XNConv res2a_branch2c x1c x XNEltWise x0 0x x0 9 XNUpload 0x XNConv res2b_branch2a x0 xfdnn x1c XNConv res2b_branch2b x1c x3f XNConv res2b_branch2c Graph x3f0000 Compiler x1c XNEltWise x0 0x1c x0 16 XNUpload 0x XNConv res2c_branch2a x x3f XNConv res2c_branch2b x3f x XNConv res2c_branch2c x x1c XNEltWise x0 0x1c x0 23 XNUpload 0x XNConv res3a_branch x x1c XNConv res3a_branch2a x x3f XNConv res3a_branch2b x3f x XNConv res3a_branch2c x x2a XNEltWise x1c0000 0x2a x1c XNUpload 0x1c

15 xfdnn Network Optimization Layer to Layer Next Next Relu Relu Relu Relu Bias Bias Bias Bias Fused Fused Fused Fused Conv Conv Conv Conv Relu Relu Bias Bias Pool Fused Fused Pool DDR Buffered Conv Conv On-Chip URAM Buffered Previous Previous Unoptimized Model xfdnn Intelligently Fused layers Streaming optimized for URAM

16 xfdnn Network Deployment Output Output HW In- Line Next HW HW In- In- Line Line HW HW In- In- Line Line Previous HW In- Line Pool Fused Layer Optimizations Compiler can merge nodes (Conv or EltWise)+Relu Conv + Batch Norm Compiler can split nodes Conv 1x1 stride 2 -> Maxpool+Conv 1x1 Stride 1 Next HW In- Line HW HW In- In- Line Line HW HW In- In- Line Line Previous HW In- Line Pool One Shot Network Deployment On-Chip buffering reduces latency and increases throughput xfdnn analyzes network memory needs and optimizes scheduler For Fused and One Shot Deployment Next HW In- Line HW In- Line HW In- Line HW In- Line HW In- Line HW In- Line Pool One Shot deploys entire network to FPGA Optimized for fast, low latency inference Entire network, schedule and weights loaded only once to FPGA Previous Input Input Page 16

17 xfdnn Quantizer: FP to Fixed-Point Quantization Problem: Nearly all trained models are in 32-bit floating-point Available Caffe and TensorFlow quantization tools take hours and produce inefficient models Introducing: xfdnn Quantizer A customer friendly toolkit that automatically analyses floating-point ranges layer-by-layer and produces the fixed-point encoding that looses the least amount of information Quantizes GoogleNet in under a minute Quantizes 8-bit fixed-point networks within 1-3% accuracy of 32-bit floating-point networks Extensible toolkit to maximize performance by searching for minimal viable bitwidths and prune sparse networks 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% xfdnn Quantized Model Accuracy resnet-50 resnet-101 resnet-152 Googlenet-v1 fp32 top-1 accuracy 8 bit top-1 accuracy fp32 top-5 accuracy 8 bit top-5 accuracy

18 xfdnn Quantizer: Fast and Easy 8 1) Provide FP32 network and model E.g., prototxt and caffemodel 2) Provide a small sample set, no labels required 16 to 512 images 3) Specify desired precision Quantizes to <8 bits to match Xilinx s DSP

19 Xilinx ML Processing Engine xdnn Supported Operations Features Convolution / Deconvolution / Convolution Transpose Max Pooling Avg Pooling Element-wise Add Memory Support Description Kernel Sizes W: 1-15; H:1-15 Strides W: 1,2,4,8; H: 1,2,4,8 Padding Same, Valid Dilation Factor: 1,2,4 Activation ReLU Bias Value Per Channel Scaling Scale & Shift Value Per Channel Kernel Sizes W: 1-15; H:1-15 Strides W: 1,2,4,8; H: 1,2,4,8 Padding Same, Valid Kernel Sizes W: 1-15; H:1-15 Strides W: 1,2,4,8; H: 1,2,4,8 Padding Same, Valid Width & Height must match; Depth can mismatch. On-Chip Buffering, DDR Caching Programmable Feature-set Tensor Level Instructions 700+MHz DSP Freq (VU9P) Custom Network Acceleration Expanded set of image sizes Square, Rectangular Upsampling Strides Factor: 2,4,8,16 Miscellaneous Data width 16-bit or 8-bit

20 From Xilinx From Community Seamless Deployment with Open Source Software Deploy xfdnn Middleware, Tools and Runtime xdnn CNN Processing Engine *TensorFlow Q4 2017

ML Suite Overlays with xdnn Processing Engines

Adjacent acceleration opportunities Realtime

processing xdnn PE xdnn PE xdnn PE xdnn PE

21 ML Suite Overlays with xdnn Processing Engines Adaptable AI algorithms are changing rapidly Adjacent acceleration opportunities Realtime 10x Low latency than CPU and GPU Data flow processing xdnn PE xdnn PE xdnn PE xdnn PE FPGA Platform DDR Efficient Performance/watt CPU Low Power

xdnn PEs Optimized for Your Cloud Applications Throughput, Multi-Network Optimized Latency, High Res Optimized xdnn PE xdnn PE xdnn PE xdnn PE FPGA

Overlay_0 28x32 4 4 MB Int16 896 Multi-Network, Maximum Throughput ResNet50 (224x224) Overlay_1 28x32 4 4 MB Int8 1,792 Multi-Network, Maximum

22 xdnn PEs Optimized for Your Cloud Applications Throughput, Multi-Network Optimized Latency, High Res Optimized xdnn PE xdnn PE xdnn PE xdnn PE FPGA Platform Infrastructure xdnn PE xdnn PE FPGA Platform Infrastructure Overlay Name DSP Array #PEs Cache Precision GOP/s Optimized For Examples Networks Overlay_0 28x MB Int Multi-Network, Maximum Throughput ResNet50 (224x224) Overlay_1 28x MB Int8 1,792 Multi-Network, Maximum Throughput ResNet50 (224x224) Overlay_2 56x MB Int16 1,702 Lowest Latency Yolov2 (224x224) Overlay_3 56x MB Int8 3,405 Lowest Latency Yolov2 (224x224)

23 Real-time Inference Input Data Input 1 Input 2 Input 3 Input 4 Prepare Batch Input 1 Input 2 Input 3 Input 4 Inference Processor Latency4 Latency3 Latency2 Latency1 Inference with batches Require batch of input data to improve data reuse and instruction synchronization High throughput depends on high number of batch size High and unstable latency Results Result 1 Result 2 Result 3 Result 4 Low compute efficiency while batch is not fully filled or at lower batch size Input Data Input 1 Input 2 Input 3 Input 4 Inference Processor Latency1 Latency4 Latency3 Latency2 Real Time Inference No requirement for batch input data Throughput less related to batch size Low and deterministic latency Results Result 1 Consistent compute efficiency Result 2 Result 3 Result 4

24 Throughput (images/sec) 3500 Xilinx - High Throughput at Real-Time Consistent High Throughput at Low Latency GoogLeNet V1 Performance Degradation of Throughput with Lower Latency Latency (ms) Alveo U200 v3 XDNN P4 with Tensor RT4.0 P4 with Tensor RT3.0

25 Images/s Fast Advantages in Machine Learning Inference 4500 INCREASE REAL-TIME MACHINE LEARNING* THROUGHPUT BY 20X x advantage Intel Xeon Skylake Intel Arria 10 PAC Nvidia V100 Xilinx U200 Xilinx U250 * Source: Accelerating DNNs with Xilinx Alveo Accelerator Cards White Paper

26 Img/Sec GoogleNet v1, batch 4 ML Suite Performance Roadmap xdnnv2, INT8 xdnnv3, INT8 xdnnv3, FP7* Alveo (U250) Alveo (U200) Alveo (U280) 4000 Alveo (U250) 3000 V100 Alveo (U200) Alveo (U280) 2000 P4 Alveo (U200) 1000 K80 B8 Mar 18 Jun 18 Sep 18 Dec 18 Mar 19 Jun 19 Q1CY19 Q2CY10 Q3CY18 Q4CY18 Q1CY19 Q2CY19

27 Alveo Overview Jim Heaton Sr. FAE

28 Alveo Breathe New Life into Your Data Center 16nm UltraScale Architecture Cloud Deployed Off-Chip Memory Support Max Capacity: 64GB Max Bandwidth: 77GB/s Cloud On- Premise Mobility Internal SRAM Max Capacity: 54MB Max Bandwidth: 38TB/s Ecosystem of Applications Many available today More on the way PCIe Gen3x16 Server OEM Support Major OEMs in Qualification Accelerate Any Application IDE for compiling, debugging, profiling Supports C/C++, RTL, and OpenCL

Alveo Accelerator Card Value Proposition Fast

Workload Accessible Cloud On-Premises Mobility

Optimize for any workload Adapt to changing

29 Alveo Accelerator Card Value Proposition Fast Highest Performance Adaptable Accelerate Any Workload Accessible Cloud On-Premises Mobility Faster than CPUs & GPUs Latency advantage over GPUs Optimize for any workload Adapt to changing algorithms Deploy in the cloud or on-premises Applications available now >> 29

30 Accelerator Cards That Fit Your Performance Needs

31 Performance & Capability Expanding Accelerator Card Portfolio 16nm UltraScale+ 7nm Versal U250 Available Now U280 U200 Available Now SmartNIC Broad Portfolio of Acceleration Cards Planned

Examples User Guides Tutorials Examples Launch Instance

32 Accessible Unified Simple User Experience from Cloud to Alveo Develop Publish Deploy User Guides Tutorials Examples User Guides Tutorials Examples Launch Instance User Choice Download xfdnn Apps xdnn Binaries xfdnn Apps xdnn Binaries

33 Adaptable - On-chip memory The critical asset on-chip to assist and feed the compute Store parameters, buffer intermediate activations, and move data around Alveo: Highest on-chip memory capacity Alveo: Most adaptable memory architecture On-Chip Memory (MB) Kernel A Kernel B Kernel C LUTRAM LUTRAM LUTRAM LUTRAM LUTRAM LUTRAM LUTRAM BRAM BRAM BRAM BRAM BRAM 20 UltraRAM UltraRAM 10 Global Mem (if needed) 0 Xilinx Alveo U200 Xilinx Alveo U250 Nvidia Tesla T4 Memory Access of Alveo

34 Adaptable - Neural Network Inference Efficiency 60% GoogLeNet v1 Peak Efficiency 50% 40% 30% 20% 10% 0% Xilinx Alveo U200 Xilinx Alveo U250 Nvidia Tesla T4 *T4 efficiency assumes 2x power efficiency improvement vs. P4 as claimed in Nvidia whitepaper:

35 Fast Real-time Performance 60% 50% 40% 30% 20% 10% 0% GoogLeNet v1 Real-time (batch = 1) Performance/Efficiency Xilinx Alveo U200 Xilinx Alveo U250 Nvidia Tesla T Efficiency GoogLeNet Performance images/sec *T4 Performance and efficiency assumes 2x power efficiency improvement vs. P4 as claimed in Nvidia whitepaper:

Img/Sec GoogleNet v1, Batch 1 Xilinx Performance Roadmap 100,000 Model Compression, Low and Custom precision arithmetic Silicon Improvement (Versal)

36 Img/Sec GoogleNet v1, Batch 1 Xilinx Performance Roadmap 100,000 Model Compression, Low and Custom precision arithmetic Silicon Improvement (Versal) 10,000 Today 1000 P4 T4 Xilinx Alveo U250 Xilinx Alveo U200 K80 V100 Oct 18 Jan 18 Apr 19 Jul 19 Oct 19 Before Q Q Q Q Q3 2019

37 Beyond Machine Learning

38 Fast Advantages Across Many Workload Types Database Financial Machine Learning Video HPC & Life Sciences

39 It s All About the Applications Database 0 Elasticsearch Low Latency KVS ETL, Streaming, SQL Analytics Deepgreen DB Spark-MLlib Hyperion 10G RegEx Machine Learning 0 ML Suite for Inference Image Classification Text Analysis Video Financial 0 0 ABR Transcoding Real-Time Risk Dashboard Experience Engine SIMM Calculation Image Processing Application Ecosystem Continues to Grow HPE & Life Sciences 0 Accelerated Genomics Pipelines

40 Infuse Machine Learning with other accelerations Database Machine Learning HPC & Life Sciences Video Financial

41 Solution Stack Accelerated Solutions Developer Package DEVELOPERS 100% Growth of Published Applications Hundreds of Developers Trained RTL, C, C++, OpenCL Xilinx ML Suite HPC &LIFE SCIENCES End user Framework, API, Python/Java/C++ Programmability FINANCIAL MACHINE LEARNING VIDEO. DATABASE Solutions Xilinx ISVs Platforms FPGA as a Service (FaaS) Platform Cloud On-premise

42 Xilinx is Qualifying with Major Server OEMs Server Support & Qualification Strategy Interop Level Qual Stringent Qual Very Stringent Qual Xilinx Validated Dell R730 Dell R740 SuperMicro SYS-4028GR-TR SuperMicro SYS-4029GP-TRT SuperMicro SYS-7049GP-TRT HPE ProLiant DL380 G10 COMPLETE 3 rd Party Option (3PO) Qualified IN PROGRESS OEM Factory Qualified IN PROGRESS

Ordering Information A U### P 64G PQ G Two Cooling Options Alveo Kit Name U200 U250 Cooling P:

Qualified RoHS Indicator G: RoHS 6/6 Passive No Fan Active Built in Fan Alveo Accelerator Card U200

$12,995 Buy via standard quote/po process from Xilinx or Avnet Buy via ecom (at SRP) from Xilinx,

43 Ordering Information A U### P 64G PQ G Two Cooling Options Alveo Kit Name U200 U250 Cooling P: Passive A: Active Memory 64G: 64GB Solution Qualification ESx: Engineering Sample PQ: Production Qualified RoHS Indicator G: RoHS 6/6 Passive No Fan Active Built in Fan Alveo Accelerator Card U200 U250 Part Number A-U200-A64G-PQ-G A-U200-P64G-PQ-G SRP 1pc $8,995 A-U250-A64G-PQ-G A-U250-P64G-PQ-G $12,995 Buy via standard quote/po process from Xilinx or Avnet Buy via ecom (at SRP) from Xilinx, Avnet, DigiKey, EBV, or Premier Farnell Developer discount available Requires qualification through Accelerator Program

More Information Available on Xilinx.com https://www.xilinx.com/alveo Xilinx.

44 More Information Available on Xilinx.com Xilinx.com Product Brief Product Selection Guide Getting Started Guide Data Sheet ML Solution Brief SDAccel Solution Brief ABR Transcoding Solution Brief Accelerating DNNs with Alveo White Paper Applications Directory

45 Adaptable. Intelligent.

Xilinx ML Suite Overview

Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame