Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Size: px

Start display at page:

Download "Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks"

Gerald Willis
6 years ago
Views:

2 Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and deploy in large-scale interactive services Recurrent Neural Networks y t-1 y t y t+1 h t-1 h t h t+1 h t-1 h t h t+1 x t-1 x t x t+1 Convolutional Neural Networks 2

EFFICIENCY BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc.

3 DNN Processing Units Contro l Unit (CU) CPUs Registers Arithmeti c Logic Unit (ALU) GPUs Soft DPU (FPGA) Hard DPU ASICs FLEXIBILITY EFFICIENCY BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc. Cerebras Google TPU Graphcore Groq Intel Nervana Movidius Wave Computing Etc. 3

4 A Scalable FPGA-powered DNN Serving Platform L0 F F F L1 FPGAs Network switches L0 F F F Instr Decoder & Control Neural FU Pretrained DNN Model in CNTK, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU 4

5 CPU compute layer Reconfigurable compute layer (FPGA) Converged network

6 Sub-millisecond FPGA compute latencies at batch 1 6

7 7

8 A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs 8

9 A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms 9

10 A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch 10

11 A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs 11

12 A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO 16] 12

13 13

14 A Cloud-Scale Acceleration Architecture [MICRO 16] 14

15 Interconnected FPGAs form a separate plane of computation Routers Can be managed and used independently from the CPU Hardware acceleration plane Deep neural networks SQL CPU QPI CPU FPGAs Web search ranking SDN offload Web search ranking FPGA CPUs Traditional software (CPU) server plane QSFP 40Gb/s QSFP QSFP 40Gb/s ToR 15

16 16

17 500x500 Matrix MatMul x500 Matrix Add Sigmoid500 Add dim Vector Split 500x500 Matrix MatMul500 MatMul500 MatMul500 Sigmoid Concat Add x500 Matrix Add500 Split 1000-dim Vector Caffe Model CNTK Model Tensorflow Model Frontends Portable IR Graph Splitter and Optimizer FPGA0 FPGA1 Transformed IRs Target compiler Target compiler Target compiler CPU-CNTK FPGA CPU-Caffe Deployment Package FPGA HW Microservice 17

18 N weight kernels Output pre-activation Input activation = = O(N 3 ) data O(N 4 K 2 ) compute O(N 2 ) data O(N 2 ) compute 18

19 Output pre-activation Input activation = = O(N 3 ) data O(N 4 K 2 ) compute O(N 2 ) data O(N 2 ) compute 19

20 Model Parameters Initialized in DRAM 2xCPU FPGA 20

21 Model Parameters Initialized in DRAM 2xCPU FPGA 21

22 Hardware Utilization (%) Batch Size FPGA 22

23 Hardware Utilization (%) Latency at 99th Maximum Allowed Latency Batch Size Batch Size Batching improves HW utilization but increases latency 23

24 Hardware Utilization (%) Latency at 99th Maximum Allowed Latency Batch Size Batch Size Batching improves HW utilization but increases latency 24

25 2xCPU FPGA 25

26 Observations 2xCPU 26

27 2xCPU 27

28 2xCPU 28

29 29

30 30

31 31

32 Core Features Proprietary parameterizable narrow precision format wrapped in float16 interfaces Matrix Vector Unit FPGA 32

33 Matrix-Vector Unit Convert to msft-fp Neural Functional Unit Matrix RF VRF Kernel Matrix Vector Multiply Matrix RF VRF Kernel Matrix Vector Multiply... Matrix RF VRF Kernel Matrix Vector Multiply Instruction Decoder Control Processor TA + Convert to float16 TA VRF A Multifunction Unit xbar x VRF TA TA Tensor Manager Matrix Memory Manager Vector Memory Manager Input Message Processor DRAM Output Message Processor Network IFC + VRF xbar A Multifunction Unit x VRF + VRF TA Legend Tensor data Instructions Commands A x + Activation Multiply Add/Sub Memory TA Tensor Arbiter 33

34 Features Matrix Row 1 Matrix Row Matrix Row N Float16 Input Tensor Float16 Output Tensor 34

35 FPGA Performance vs. Data Type 5.0 Stratix V 225MHz Tera-Operations/sec bit int 8-bit int ms-fp9 ms-fp8 35

36 36

37 Tera-Operations/sec Tera-Operations/sec FPGA Performance vs. Data Type Stratix Stratix V D5 225MHz 225MHz Stratix MHz bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 37

38 Tera-Operations/sec Tera-Operations/sec FPGA Performance vs. Data Type Stratix Stratix V D5 225MHz 225MHz Stratix MHz bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 38

39 Tera-Operations/sec Tera-Operations/sec FPGA Performance vs. Data Type Stratix Stratix V D5 225MHz 225MHz Stratix MHz bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 39

40 Tera-Operations/sec Tera-Operations/sec FPGA Performance vs. Data Type Stratix Stratix V D5 225MHz 225MHz Stratix MHz bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 40

41 Tera-Operations/sec Tera-Operations/sec FPGA Performance vs. Data Type Stratix Stratix V D5 225MHz 225MHz 90 Stratix MHz bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 41

42 FPGA Performance vs. Data Type Tera-Operations/sec Tera-Operations/sec Stratix Stratix V D5 225MHz 225MHz 90 Stratix MHz bit 16-bit int int 8-bit 8-bit int int ms-fp9 ms-fp8 Accuracy Impact of Narrow Precison on Accuracy Model 1 (GRU-based) Model 2 (LSTM-based) float32 ms-fp9 ms-fp9 retrain Model 3 (LSTM-based) 42

43 Project BrainWave is a powerful platform for an accelerated AI cloud Runs on Microsoft s hyperscale infrastructure with FPGAs Achieves excellent performance at low batch sizes via persistency and narrow precision Adaptable to precision and changes in future AI algorithms BrainWave running on Hardware Microservices will push the boundary of what is possible to deploy in the cloud Stay tuned for announcements about external availability. Deeper/larger CNNs for more accurate computer vision Higher dimensional RNNs toward human-like natural language processing State-of-the-art speech And much more

Deep Learning Accelerators

Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction