Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and deploy in large-scale interactive services Recurrent Neural Networks y t-1 y t y t+1 h t-1 h t h t+1 h t-1 h t h t+1 x t-1 x t x t+1 Convolutional Neural Networks 2

DNN Processing Units Contro l Unit (CU) CPUs Registers Arithmeti c Logic Unit (ALU) GPUs Soft DPU (FPGA) Hard DPU ASICs FLEXIBILITY EFFICIENCY BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc. Cerebras Google TPU Graphcore Groq Intel Nervana Movidius Wave Computing Etc. 3

A Scalable FPGA-powered DNN Serving Platform L0 F F F L1 FPGAs Network switches L0 F F F Instr Decoder & Control Neural FU Pretrained DNN Model in CNTK, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU 4

CPU compute layer Reconfigurable compute layer (FPGA) Converged network

Sub-millisecond FPGA compute latencies at batch 1 6

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs 8

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs 11

A Cloud-Scale Acceleration Architecture [MICRO 16] 14

Interconnected FPGAs form a separate plane of computation Routers Can be managed and used independently from the CPU Hardware acceleration plane Deep neural networks SQL CPU QPI CPU FPGAs Web search ranking SDN offload Web search ranking FPGA CPUs Traditional software (CPU) server plane QSFP 40Gb/s QSFP QSFP 40Gb/s ToR 15

500x500 Matrix MatMul500 500x500 Matrix Add500 500 Sigmoid500 Add500 1000-dim Vector Split 500x500 Matrix MatMul500 MatMul500 MatMul500 Sigmoid500 500 500 Concat Add500 500 500x500 Matrix Add500 Split 1000-dim Vector Caffe Model CNTK Model Tensorflow Model Frontends Portable IR Graph Splitter and Optimizer FPGA0 FPGA1 Transformed IRs Target compiler Target compiler Target compiler CPU-CNTK FPGA CPU-Caffe Deployment Package FPGA HW Microservice 17

N weight kernels Output pre-activation Input activation = = O(N 3 ) data O(N 4 K 2 ) compute O(N 2 ) data O(N 2 ) compute 18

Output pre-activation Input activation = = O(N 3 ) data O(N 4 K 2 ) compute O(N 2 ) data O(N 2 ) compute 19

Model Parameters Initialized in DRAM 2xCPU FPGA 20

Model Parameters Initialized in DRAM 2xCPU FPGA 21

Hardware Utilization (%) Batch Size FPGA 22

Hardware Utilization (%) Latency at 99th Maximum Allowed Latency Batch Size Batch Size Batching improves HW utilization but increases latency 23

Hardware Utilization (%) Latency at 99th Maximum Allowed Latency Batch Size Batch Size Batching improves HW utilization but increases latency 24

2xCPU FPGA 25

Observations 2xCPU 26

2xCPU 27

2xCPU 28

Core Features Proprietary parameterizable narrow precision format wrapped in float16 interfaces Matrix Vector Unit FPGA 32

Matrix-Vector Unit Convert to msft-fp Neural Functional Unit Matrix RF VRF Kernel Matrix Vector Multiply Matrix RF VRF Kernel Matrix Vector Multiply... Matrix RF VRF Kernel Matrix Vector Multiply Instruction Decoder Control Processor TA + Convert to float16 TA VRF A Multifunction Unit xbar x VRF TA TA Tensor Manager Matrix Memory Manager Vector Memory Manager Input Message Processor DRAM Output Message Processor Network IFC + VRF xbar A Multifunction Unit x VRF + VRF TA Legend Tensor data Instructions Commands A x + Activation Multiply Add/Sub Memory TA Tensor Arbiter 33

Features Matrix Row 1 Matrix Row 2 + + + + + + + Matrix Row N Float16 Input Tensor Float16 Output Tensor 34

FPGA Performance vs. Data Type 5.0 Stratix V D5 @ 225MHz 4.0 4.5 Tera-Operations/sec 3.0 2.0 2.0 2.7 1.0 1.4 0.0 16-bit int 8-bit int ms-fp9 ms-fp8 35

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 2.0 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 37

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 38

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 39

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 40

Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz 90 Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 41

FPGA Performance vs. Data Type Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz 90 Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 Accuracy 1.00 0.90 0.80 0.70 0.60 0.50 Impact of Narrow Precison on Accuracy Model 1 (GRU-based) Model 2 (LSTM-based) float32 ms-fp9 ms-fp9 retrain Model 3 (LSTM-based) 42

Project BrainWave is a powerful platform for an accelerated AI cloud Runs on Microsoft s hyperscale infrastructure with FPGAs Achieves excellent performance at low batch sizes via persistency and narrow precision Adaptable to precision and changes in future AI algorithms BrainWave running on Hardware Microservices will push the boundary of what is possible to deploy in the cloud Stay tuned for announcements about external availability. Deeper/larger CNNs for more accurate computer vision Higher dimensional RNNs toward human-like natural language processing State-of-the-art speech And much more