Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and deploy in large-scale interactive services Recurrent Neural Networks y t-1 y t y t+1 h t-1 h t h t+1 h t-1 h t h t+1 x t-1 x t x t+1 Convolutional Neural Networks 2
DNN Processing Units Contro l Unit (CU) CPUs Registers Arithmeti c Logic Unit (ALU) GPUs Soft DPU (FPGA) Hard DPU ASICs FLEXIBILITY EFFICIENCY BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc. Cerebras Google TPU Graphcore Groq Intel Nervana Movidius Wave Computing Etc. 3
A Scalable FPGA-powered DNN Serving Platform L0 F F F L1 FPGAs Network switches L0 F F F Instr Decoder & Control Neural FU Pretrained DNN Model in CNTK, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU 4
CPU compute layer Reconfigurable compute layer (FPGA) Converged network
Sub-millisecond FPGA compute latencies at batch 1 6
7
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs 8
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms 9
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch 10
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs 11
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO 16] 12
13
A Cloud-Scale Acceleration Architecture [MICRO 16] 14
Interconnected FPGAs form a separate plane of computation Routers Can be managed and used independently from the CPU Hardware acceleration plane Deep neural networks SQL CPU QPI CPU FPGAs Web search ranking SDN offload Web search ranking FPGA CPUs Traditional software (CPU) server plane QSFP 40Gb/s QSFP QSFP 40Gb/s ToR 15
16
500x500 Matrix MatMul500 500x500 Matrix Add500 500 Sigmoid500 Add500 1000-dim Vector Split 500x500 Matrix MatMul500 MatMul500 MatMul500 Sigmoid500 500 500 Concat Add500 500 500x500 Matrix Add500 Split 1000-dim Vector Caffe Model CNTK Model Tensorflow Model Frontends Portable IR Graph Splitter and Optimizer FPGA0 FPGA1 Transformed IRs Target compiler Target compiler Target compiler CPU-CNTK FPGA CPU-Caffe Deployment Package FPGA HW Microservice 17
N weight kernels Output pre-activation Input activation = = O(N 3 ) data O(N 4 K 2 ) compute O(N 2 ) data O(N 2 ) compute 18
Output pre-activation Input activation = = O(N 3 ) data O(N 4 K 2 ) compute O(N 2 ) data O(N 2 ) compute 19
Model Parameters Initialized in DRAM 2xCPU FPGA 20
Model Parameters Initialized in DRAM 2xCPU FPGA 21
Hardware Utilization (%) Batch Size FPGA 22
Hardware Utilization (%) Latency at 99th Maximum Allowed Latency Batch Size Batch Size Batching improves HW utilization but increases latency 23
Hardware Utilization (%) Latency at 99th Maximum Allowed Latency Batch Size Batch Size Batching improves HW utilization but increases latency 24
2xCPU FPGA 25
Observations 2xCPU 26
2xCPU 27
2xCPU 28
29
30
31
Core Features Proprietary parameterizable narrow precision format wrapped in float16 interfaces Matrix Vector Unit FPGA 32
Matrix-Vector Unit Convert to msft-fp Neural Functional Unit Matrix RF VRF Kernel Matrix Vector Multiply Matrix RF VRF Kernel Matrix Vector Multiply... Matrix RF VRF Kernel Matrix Vector Multiply Instruction Decoder Control Processor TA + Convert to float16 TA VRF A Multifunction Unit xbar x VRF TA TA Tensor Manager Matrix Memory Manager Vector Memory Manager Input Message Processor DRAM Output Message Processor Network IFC + VRF xbar A Multifunction Unit x VRF + VRF TA Legend Tensor data Instructions Commands A x + Activation Multiply Add/Sub Memory TA Tensor Arbiter 33
Features Matrix Row 1 Matrix Row 2 + + + + + + + Matrix Row N Float16 Input Tensor Float16 Output Tensor 34
FPGA Performance vs. Data Type 5.0 Stratix V D5 @ 225MHz 4.0 4.5 Tera-Operations/sec 3.0 2.0 2.0 2.7 1.0 1.4 0.0 16-bit int 8-bit int ms-fp9 ms-fp8 35
36
Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 2.0 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 37
Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 38
Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 39
Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 40
Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 FPGA Performance vs. Data Type Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz 90 Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 41
FPGA Performance vs. Data Type Tera-Operations/sec Tera-Operations/sec 100 5.0 90 80 4.0 70 60 3.0 50 402.0 30 201.0 10 00.0 Stratix Stratix V D5 V D5 @ @ 225MHz 225MHz 90 Stratix 10 280 @ 500MHz 4.5 65 2.7 31 2.0 12 1.4 16-bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 Accuracy 1.00 0.90 0.80 0.70 0.60 0.50 Impact of Narrow Precison on Accuracy Model 1 (GRU-based) Model 2 (LSTM-based) float32 ms-fp9 ms-fp9 retrain Model 3 (LSTM-based) 42
Project BrainWave is a powerful platform for an accelerated AI cloud Runs on Microsoft s hyperscale infrastructure with FPGAs Achieves excellent performance at low batch sizes via persistency and narrow precision Adaptable to precision and changes in future AI algorithms BrainWave running on Hardware Microservices will push the boundary of what is possible to deploy in the cloud Stay tuned for announcements about external availability. Deeper/larger CNNs for more accurate computer vision Higher dimensional RNNs toward human-like natural language processing State-of-the-art speech And much more