Xilinx ML Suite Overview

Similar documents
Xilinx ML Suite Overview

在数据中心中加速 AI - Xilinx 机器学习套件 (Xilinx ML Suite )

Adaptable Intelligence The Next Computing Era

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Deep Learning Accelerators

Versal: AI Engine & Programming Environment

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Revolutionizing the Datacenter

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Inference Engine compiler and SDK FWDNXT 2018

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Inference

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Xilinx Machine Learning Strategies For Edge

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

XPU A Programmable FPGA Accelerator for Diverse Workloads

Fast Hardware For AI

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

THE NVIDIA DEEP LEARNING ACCELERATOR

NVIDIA PLATFORM FOR AI

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Research Faculty Summit Systems Fueling future disruptions

The OpenVX Computer Vision and Neural Network Inference

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

An introduction to Machine Learning silicon

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

Neural Network Exchange Format

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Wu Zhiwen.

Arm s First-Generation Machine Learning Processor

NVIDIA DEEP LEARNING INSTITUTE

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

Deep Learning Compiler

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

Advancing State-of-the-Art of Autonomous Vehicles and Robotics Research using AWS GPU Instances

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Small is the New Big: Data Analytics on the Edge

World s most advanced data center accelerator for PCIe-based servers

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

End to End Optimization Stack for Deep Learning

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Deep Learning mit PowerAI - Ein Überblick

Accelerating Data Center Workloads with FPGAs

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Altera SDK for OpenCL

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

DNNDK User Guide. UG1327 (v1.0) January 22, 2019

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

High Performance Computing

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

Automatic Code Generation TVM Stack

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT. Kurtis McBride CEO, Miovision

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN Inference on CPUs

Scaling Throughput Processors for Machine Intelligence

Hello Edge: Keyword Spotting on Microcontrollers

TENSORRT. RN _v01 January Release Notes

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

Cisco UCS C480 ML M5 Rack Server Performance Characterization

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

A Lightweight YOLOv2:

Lecture 12: Model Serving. CSE599W: Spring 2018

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

Transcription:

Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration

Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame rate for HEVC & VP9 encoding Genomics 20 min vs. 33 hours for whole genome analysis Big Data Analytics 40 min vs. 60 hours for logfile query 10x 10x 100x 90x

Deep Learning explores the study of algorithms that can learn from and make predictions on data Deep Learning is Re-defining Many Applications Cloud Acceleration Security Ecommerce Social Financial Page 3 Surveillance Industrial IOT Medical Bioinformatics Autonomous Vehicles

Accelerating AI Inference into Your Cloud Applications Featuring the Most Powerful FPGA in the Cloud Deep Learning Applications Virtex Ultrascale+ VU9P Zynq Ultrascale+ MPSoC Cloud On Premises Edge

Xilinx ML Suite - AWS Marketplace ML Suite Supported Frameworks: Caffe MxNet Tensorflow Python Support Darknet Jupyter Notebooks available: Image Classification with Caffe Using the xfdnn Compiler w/ a Caffe Model Using the xfdnn Quantizer w/ a Caffe Model Pre-trained Models Caffe 8/16-bit GoogLeNet v1 ResNet50 Flowers102 Places365 Python 8/16-bit Yolov2 MxNet 8/16-bit GoogLeNet v1 xfdnn Tools Compiler Quantizer https://aws.amazon.com/marketplace/pp/b077fm2jns

Unified Simple User Experience from Cloud to XBB Develop Publish Deploy User Guides Tutorials Examples User Guides Tutorials Examples Launch Instance User Choice Download xfdnn Apps xdnn Binaries xfdnn Apps xdnn Binaries Page 6

Overlay Architecture Custom Processors Exploiting Xilinx FPGA Flexibility Customized overlays with ISA architecture for optimized implementation Easy plug and play with Software Stack MLP Engine Scalable sparse and dense implementation xdnn CNN Engine for Large 16 nm Xilinx Devices Deephi DPU Flexible CNN Engine with Embedded Focus CHaiDNN HLS based open source offering Deephi ESE LSTM Speech to Text engine Random Forest Configurable RF classification Page 7

Deep Learning Models Multi-Layer Perceptron Classification Universal Function Approximator Autoencoder Convolutional Neural Network Feature Extraction Object Detection Image Segmentation Recurrent Neural Network Sequence and Temporal Data Speech to Text Language Translation Classification Object Detection Segmentation Dog

Rapid Feature and Performance Improvement xdnn-v1 500 MHz URAM for feature maps without caching Array of accumulator with 16 bit(batch 1), 8 bit(batch 2) Instructions: Convolution, Relu, MaxPool, AveragePool, Elementwise Flexible kernel size(square) and strides Programmable Scaling Q4CY17 xdnn-v2 500 MHz All xdnn-v1 features DDR Caching: Larger Image, CNN Networks Instructions: Depth wise Convolution, Deconvolution, Convolution, Transpose Upsampling Rectangular Kernels Q2CY18 xdnn-v3 700 MHz Feature compatible with xdnn-v2 New Systolic Array Implementation: 50% Higher FMAX and 2.2x time lower latency Batch of 1 for 8 bit implementation Non-blocking Caching and Pooling Q4CY18 Page 9

Break Through on Peak Performance GPU: Introduce new architectures and silicon 1000 GPU Deep Learning Peak Power Efficiency Xilinx: Adapt the break through of emerging domain knowledge 1000 FPGA Deep Learning Peak Power Efficiency 900 800 700 900 800 700 ACAP Architectures 600 600 GOPS/W 500 400 300 Mixed FP16 for TensorCore GOPS/W 500 400 300 Adaptable Precision 200 100 Native INT8 operation 200 100 INT8 Optimization 0 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 Kepler Maxwell Pascal Volta 0 2014 2015 2016 2017 2018 2019 2020 2021 Ultrascale Ultrascale+ ACAP

Seamless Deployment with Open Source Software xfdnn Middleware, Tools and Runtime From Xilinx From Community xdnn Processing Engine

xfdnn flow CNTK Caffe2 PyTorch Tensorflow MxNet Caffe F R O N T E N D xfdnn Compiler Framework Tensor Graph to Xilinx Tensor Graph xfdnn Tensor Graph Optimization xfdnn Compression ONNX Model Weights Calibration Set Image xfdnn Runtime (python API) CPU Layers FPGA Layers https://github.com/xilinx/ml-suite Page 12

xfdnn Inference Toolbox Graph Compiler Network Optimization xfdnn Quantizer Python tools to quickly compile networks from common Frameworks Caffe, MxNet and Tensorflow Automatic network optimizations for lower latency by fusing layers and buffering on-chip memory Quickly reduce precision of trained models for deployment Maintains 32bit accuracy at 8 bit within 2%

xfdnn Graph Compiler Pass in a Network Microcode for xdnn is Produced 0 XNConv conv1 7 2 16 26 2 1 1 0x1c0000 224 3 0x0 112 64 2 XNMaxPool pool1 3 2 0 0x0 112 64 0x1c0000 56 3 XNConv res2a_branch1 1 1 16 26 2 0 1 0x1c0000 56 64 0x0 56 256 4 XNConv res2a_branch2a 1 1 16 26 2 1 1 0x1c0000 56 64 0x230000 56 64 6 XNConv res2a_branch2b 3 1 16 26 2 1 1 0x230000 56 64 0x1c0000 56 64 8 XNConv res2a_branch2c 1 1 16 26 2 0 1 0x1c0000 56 64 0x230000 56 256 9 XNEltWise 1 0 1 0x0 0x230000 56 256 0x0 9 XNUpload 0x0 56 256 11 XNConv res2b_branch2a 1 1 16 26 2 1 1 0x0 xfdnn 56 256 0x1c0000 56 64 13 XNConv res2b_branch2b 3 1 16 26 2 1 1 0x1c0000 56 64 0x3f0000 56 64 15 XNConv res2b_branch2c 1 1 16 26 Graph 2 0 1 0x3f0000 Compiler 56 64 0x1c0000 56 256 16 XNEltWise 1 0 1 0x0 0x1c0000 56 256 0x0 16 XNUpload 0x0 56 256 18 XNConv res2c_branch2a 1 1 16 26 2 1 1 0x0 56 256 0x3f0000 56 64 20 XNConv res2c_branch2b 3 1 16 26 2 1 1 0x3f0000 56 64 0x460000 56 64 22 XNConv res2c_branch2c 1 1 16 26 2 0 1 0x460000 56 64 0x1c0000 56 256 23 XNEltWise 1 0 1 0x0 0x1c0000 56 256 0x0 23 XNUpload 0x0 56 256 25 XNConv res3a_branch1 1 2 16 26 2 0 1 0x0 56 256 0x1c0000 28 512 26 XNConv res3a_branch2a 1 2 16 26 2 1 1 0x0 56 256 0x3f0000 28 128 28 XNConv res3a_branch2b 3 1 16 26 2 1 1 0x3f0000 28 128 0x428000 28 128 30 XNConv res3a_branch2c 1 1 16 26 2 0 1 0x428000 28 128 0x2a0000 28 512 31 XNEltWise 1 0 1 0x1c0000 0x2a0000 28 512 0x1c0000 31 XNUpload 0x1c0000 28 512 Page 14

xfdnn Network Optimization Layer to Layer Next Next Relu Relu Relu Relu Bias Bias Bias Bias Fused [Relu+ Fused [Relu+ Fused [Relu+ Fused [Relu+ Conv Conv Conv Conv Relu Relu Bias Bias Pool Fused [Relu+ Fused [Relu+ Pool DDR Buffered Conv Conv On-Chip URAM Buffered Previous Previous Unoptimized Model xfdnn Intelligently Fused layers Streaming optimized for URAM Page 15

xfdnn Network Deployment Output Output [Relu+ Next [Relu+ [Relu+ [Relu+ [Relu+ Previous [Relu+ Pool Fused Layer Optimizations Compiler can merge nodes (Conv or EltWise)+Relu Conv + Batch Norm Compiler can split nodes Conv 1x1 stride 2 -> Maxpool+Conv 1x1 Stride 1 Next [Relu+ [Relu+ [Relu+ [Relu+ [Relu+ Previous [Relu+ Pool One Shot Network Deployment On-Chip buffering reduces latency and increases throughput xfdnn analyzes network memory needs and optimizes scheduler For Fused and One Shot Deployment Next [Relu+ [Relu+ [Relu+ [Relu+ [Relu+ Previous [Relu+ Pool One Shot deploys entire network to FPGA Optimized for fast, low latency inference Entire network, schedule and weights loaded only once to FPGA Page 16 Input Input

xfdnn Quantizer: FP to Fixed-Point Quantization Problem: xfdnn Quantized Model Accuracy Nearly all trained models are in 32-bit floating-point Available Caffe and TensorFlow quantization tools take hours and produce inefficient models 100% 90% 80% 70% Introducing: xfdnn Quantizer 60% A customer friendly toolkit that automatically analyses floating-point ranges layer-by-layer and produces the fixed-point encoding that looses the least amount of information Quantizes GoogleNet in under a minute Quantizes 8-bit fixed-point networks within 1-3% accuracy of 32-bit floating-point networks Extensible toolkit to maximize performance by searching for minimal viable bitwidths and prune sparse networks Page 17 50% 40% 30% 20% 10% 0% resnet-50 fp32 top-1 accuracy resnet-101 8 bit top-1 accuracy resnet-152 fp32 top-5 accuracy Googlenet-v1 8 bit top-5 accuracy

xfdnn Quantizer: Fast and Easy 8 1) Provide FP32 network and model E.g., prototxt and caffemodel 2) Provide a small sample set, no labels required 16 to 512 images 3) Specify desired precision Quantizes to <8 bits to match Xilinx s DSP

Seamless Deployment with Open Source Software xfdnn Middleware, Tools and Runtime From Xilinx From Community xdnn Processing Engine

Xilinx ML Processing Engine xdnn Supported Operations Features Convolution / Deconvolution / Convolution Transpose Max Pooling Avg Pooling Element-wise Add Memory Support Description Kernel Sizes W: 1-15; H:1-15 Strides W: 1,2,4,8; H: 1,2,4,8 Padding Same, Valid Dilation Factor: 1,2,4 Activation ReLU Bias Value Per Channel Scaling Scale & Shift Value Per Channel Kernel Sizes W: 1-15; H:1-15 Strides W: 1,2,4,8; H: 1,2,4,8 Padding Same, Valid Kernel Sizes W: 1-15; H:1-15 Strides W: 1,2,4,8; H: 1,2,4,8 Padding Same, Valid Width & Height must match; Depth can mismatch. On-Chip Buffering, DDR Caching Programmable Feature-set Tensor Level Instructions 500+MHz DSP Freq (VU9P) Custom Network Acceleration Expanded set of image sizes Square, Rectangular Upsampling Strides Factor: 2,4,8,16 Miscellaneous Data width 16-bit or 8-bit

Alveo Breathe New Life into Your Data Center 16nm UltraScale Architecture Cloud Deployed Off-Chip Memory Support Max Capacity: 64GB Max Bandwidth: 77GB/s Cloud On- Premise Mobility Internal SRAM Max Capacity: 54MB Max Bandwidth: 38TB/s Ecosystem of Applications Many available today More on the way PCIe Gen3x16 Server OEM Support Major OEMs in Qualification Accelerate Any Application IDE for compiling, debugging, profiling Supports C/C++, RTL, and OpenCL >> 21

ML Suite Overlays with xdnn Processing Engines Adaptable AI algorithms are changing rapidly Adjacent acceleration opportunities Realtime 10x Low latency than CPU and GPU Data flow processing xdnn PE xdnn PE xdnn PE xdnn PE FPGA Platform DDR Efficient Performance/watt CPU Low Power

xdnn PEs Optimized for Your Cloud Applications Throughput, Multi-Network Optimized Latency, High Res Optimized xdnn PE xdnn PE xdnn PE xdnn PE FPGA Platform Infrastructure xdnn PE xdnn PE FPGA Platform Infrastructure Overlay Name DSP Array #PEs Cache Precision GOP/s Optimized For Examples Networks Overlay_0 28x32 4 4 MB Int16 896 Multi-Network, Maximum Throughput ResNet50 (224x224) Overlay_1 28x32 4 4 MB Int8 1,792 Multi-Network, Maximum Throughput ResNet50 (224x224) Overlay_2 56x32 1 5 MB Int16 1,702 Lowest Latency Yolov2 (224x224) Overlay_3 56x32 1 5 MB Int8 3,405 Lowest Latency Yolov2 (224x224)

Real-time Inference Input Data Input 1 Input 2 Input 3 Input 4 Prepare Batch Input 1 Input 2 Input 3 Input 4 Inference Processor Latency4 Latency3 Latency2 Latency1 Inference with batches Require batch of input data to improve data reuse and instruction synchronization High throughput depends on high number of batch size High and unstable latency Results Result 1 Result 2 Result 3 Result 4 Low compute efficiency while batch is not fully filled or at lower batch size Input Data Input 1 Input 2 Input 3 Input 4 Inference Processor Latency1 Latency4 Latency3 Latency2 Real Time Inference No requirement for batch input data Throughput less related to batch size Low and deterministic latency Results Result 1 Consistent compute efficiency Result 2 Result 3 Result 4

3500 Xilinx - High Throughput at Real-Time GoogLeNet V1 Performance Consistent High Throughput at Low Latency Throughput (images/sec) 3000 2500 2000 1500 1000 500 Degradation of Throughput with Lower Latency 0 0 10 20 30 40 50 60 Latency (ms) VCU1525 with xdnn v3* VCU1525 with xdnn v2 P4 with Tensor RT4.0 P4 with TensorRT3.0

Fast Advantages in Machine Learning Inference 4500 INCREASE REAL-TIME MACHINE LEARNING* THROUGHPUT BY 20X 4000 3500 Images/s 3000 2500 2000 1500 20x advantage 1000 500 0 Intel Xeon Skylake Intel Arria 10 PAC Nvidia V100 Xilinx U200 Xilinx U250 * Source: Accelerating DNNs with Xilinx Alveo Accelerator Cards White Paper >> 26

ML Suite Performance Roadmap 9000 xdnnv2, INT8 xdnnv3, INT8 xdnnv3, FP7* 8000 Alveo (U250) 7000 Img/Sec GoogleNet v1, batch 4 6000 5000 4000 3000 V100 Alveo (U200) Alveo (U250) Alveo (U250) Alveo (U280) Alveo (U280) 2000 P4 Alveo (U200) 1000 K80 B8 Mar 18 Jun 18 Sep 18 Dec 18 Mar 19 Jun 19 Q1CY19 Q2CY10 Q3CY18 Q4CY18 Q1CY19 Q2CY19 Page 27

Visit Xilinx.com/ML for more information https://www.xilinx.com/applications/megatrends/machine-learning.html Page 28

Adaptable. Intelligent.