Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Similar documents
Deep Learning Accelerators

Xilinx ML Suite Overview

Data Platform Futures

Research Faculty Summit Systems Fueling future disruptions

Today s Data Centers. How can we improve efficiencies?

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Adaptable Intelligence The Next Computing Era

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Revolutionizing the Datacenter

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Fast Hardware For AI

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Hello Edge: Keyword Spotting on Microcontrollers

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

A New Era of Hardware Microservices in the Cloud. Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017

XPU A Programmable FPGA Accelerator for Diverse Workloads

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

AI Requires Many Approaches

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Neural Network Exchange Format

World s most advanced data center accelerator for PCIe-based servers

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

End to End Optimization Stack for Deep Learning

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Onto Petaflops with Kubernetes

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Xilinx ML Suite Overview

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 23 Domain- Specific Architectures

An introduction to Machine Learning silicon

Master Informatics Eng.

HETEROGENEOUS COMPUTE INFRASTRUCTURE FOR SINGAPORE

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

Vathys: Petascale Deep Learning on a (Single) Chip. Tapa Ghosh Vathys

NVIDIA AI INFERENCE PLATFORM

How to Build Optimized ML Applications with Arm Software

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Machine Learning on VMware vsphere with NVIDIA GPUs

Inference

LegUp: Accelerating Memcached on Cloud FPGAs

Wu Zhiwen.

Brainchip OCTOBER

The Explosion in Neural Network Hardware

WITH INTEL TECHNOLOGIES

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Data-Centric Innovation Summit DAN MCNAMARA SENIOR VICE PRESIDENT GENERAL MANAGER, PROGRAMMABLE SOLUTIONS GROUP

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Standards for Vision Processing and Neural Networks

Enabling the future of Artificial intelligence

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

IN-MEMORY ASSOCIATIVE COMPUTING

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Movidius Neural Compute Stick

High Performance Computing

Accelerate Machine Learning on macos with Intel Integrated Graphics. Hisham Chowdhury May 23, 2018

How to Build Optimized ML Applications with Arm Software

Versal: AI Engine & Programming Environment

Debbie Marr Sr. Principal Engineer and Director Intel Labs / Accelerator Architecture Lab. September 2018

TEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE

NVIDIA PLATFORM FOR AI

NVIDIA DEEP LEARNING PLATFORM

TVM Stack Overview. Tianqi Chen

Jacek Czaja, Machine Learning Engineer, AI Product Group

Deep learning in MATLAB From Concept to CUDA Code

Pluto A Distributed Heterogeneous Deep Learning Framework. Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud

CNN optimization. Rassadin A

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Improving Performance of Machine Learning Workloads

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

What s New with the Cloud?

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

Xilinx Machine Learning Strategies For Edge

THE NVIDIA DEEP LEARNING ACCELERATOR

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Making Sense of Artificial Intelligence: A Practical Guide

COSC 6339 Accelerators in Big Data

Deploying Deep Learning Networks to Embedded GPUs and CPUs

CNP: An FPGA-based Processor for Convolutional Networks

Software Defined Hardware

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Transcription:

Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and deploy in large-scale interactive services Recurrent Neural Networks y t-1 y t y t+1 h t-1 h t h t+1 h t-1 h t h t+1 x t-1 x t x t+1 Convolutional Neural Networks 2

DNN Processing Units Contro l Unit (CU) CPUs Registers Arithmeti c Logic Unit (ALU) GPUs Soft DPU (FPGA) Hard DPU ASICs FLEXIBILITY EFFICIENCY BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc. Cerebras Google TPU Graphcore Groq Intel Nervana Movidius Wave Computing Etc. 3

A Scalable FPGA-powered DNN Serving Platform L0 F F F L1 FPGAs Network switches L0 F F F Instr Decoder & Control Neural FU Pretrained DNN Model in CNTK, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU 4

CPU compute layer Reconfigurable compute layer (FPGA) Converged network

Sub-millisecond FPGA compute latencies at batch 1 6

7

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs 8

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms 9

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch 10

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs 11

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO 16] 12

13

A Cloud-Scale Acceleration Architecture [MICRO 16] 14

Interconnected FPGAs form a separate plane of computation Routers Can be managed and used independently from the CPU Hardware acceleration plane Deep neural networks SQL CPU QPI CPU FPGAs Web search ranking SDN offload Web search ranking FPGA CPUs Traditional software (CPU) server plane QSFP 40Gb/s QSFP QSFP 40Gb/s ToR 15

16

500x500 Matrix MatMul500 500x500 Matrix Add500 500 Sigmoid500 Add500 1000-dim Vector Split 500x500 Matrix MatMul500 MatMul500 MatMul500 Sigmoid500 500 500 Concat Add500 500 500x500 Matrix Add500 Split 1000-dim Vector Caffe Model CNTK Model Tensorflow Model Frontends Portable IR Graph Splitter and Optimizer FPGA0 FPGA1 Transformed IRs Target compiler Target compiler Target compiler CPU-CNTK FPGA CPU-Caffe Deployment Package FPGA HW Microservice 17

N weight kernels Output pre-activation Input activation = = O(N 3 ) data O(N 4 K 2 ) compute O(N 2 ) data O(N 2 ) compute 18

Output pre-activation Input activation = = O(N 3 ) data O(N 4 K 2 ) compute O(N 2 ) data O(N 2 ) compute 19

Model Parameters Initialized in DRAM 2xCPU FPGA 20

Model Parameters Initialized in DRAM 2xCPU FPGA 21

Hardware Utilization (%) Batch Size FPGA 22

Hardware Utilization (%) Latency at 99th Maximum Allowed Latency Batch Size Batch Size Batching improves HW utilization but increases latency 23

Hardware Utilization (%) Latency at 99th Maximum Allowed Latency Batch Size Batch Size Batching improves HW utilization but increases latency 24

2xCPU FPGA 25

Observations 2xCPU 26

2xCPU 27

2xCPU 28

29

30

31

Core Features Proprietary parameterizable narrow precision format wrapped in float16 interfaces Matrix Vector Unit FPGA 32

Matrix-Vector Unit Convert to msft-fp Neural Functional Unit Matrix RF VRF Kernel Matrix Vector Multiply Matrix RF VRF Kernel Matrix Vector Multiply... Matrix RF VRF Kernel Matrix Vector Multiply Instruction Decoder Control Processor TA + Convert to float16 TA VRF A Multifunction Unit xbar x VRF TA TA Tensor Manager Matrix Memory Manager Vector Memory Manager Input Message Processor DRAM Output Message Processor Network IFC + VRF xbar A Multifunction Unit x VRF + VRF TA Legend Tensor data Instructions Commands A x + Activation Multiply Add/Sub Memory TA Tensor Arbiter 33

Features Matrix Row 1 Matrix Row 2 + + + + + + + Matrix Row N Float16 Input Tensor Float16 Output Tensor 34

FPGA Performance vs. Data Type 5.0 Stratix V D5 @ 225MHz 4.0 4.5 Tera-Operations/sec 3.0 2.0 2.0 2.7 1.0 1.4 0.0 16-bit int 8-bit int ms-fp9 ms-fp8 35

36

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 2.0 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 37

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 38

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 39

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 40

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz 90 Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 41

FPGA Performance vs. Data Type Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz 90 Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 Accuracy 1.00 0.90 0.80 0.70 0.60 0.50 Impact of Narrow Precison on Accuracy Model 1 (GRU-based) Model 2 (LSTM-based) float32 ms-fp9 ms-fp9 retrain Model 3 (LSTM-based) 42

Project BrainWave is a powerful platform for an accelerated AI cloud Runs on Microsoft s hyperscale infrastructure with FPGAs Achieves excellent performance at low batch sizes via persistency and narrow precision Adaptable to precision and changes in future AI algorithms BrainWave running on Hardware Microservices will push the boundary of what is possible to deploy in the cloud Stay tuned for announcements about external availability. Deeper/larger CNNs for more accurate computer vision Higher dimensional RNNs toward human-like natural language processing State-of-the-art speech And much more