Revolutionizing the Datacenter

Size: px

Start display at page:

Download "Revolutionizing the Datacenter"

Alexia Marilyn Jenkins
5 years ago
Views:

1 Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit

Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*) ** Super Human Humans: ~95%*** Page 2 * http://image-net.

2 Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*) ** Super Human Humans: ~95%*** Page 2 * ** pg 10 *** Russakovsky, et al 2014,

3 Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*) ** Super Human Humans: ~95%*** CNNs far outperform non AI methods CNNs deliver super-human accuracy Page 3 * ** pg 10 *** Russakovsky, et al 2014,

4 CNNs Explained Page 4

5 The Computation Page 5

6 The Computation Page 6

7 Convolution Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights Page 7 Input Kernel Weights Output

8 Convolution Your logo Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights Page 8 Input Kernel Weights Output

9 Continue along the row... Convolution Page 9 Input Kernel Weights Output

10 Convolution Before moving down to the next row Page 10 Input Kernel Weights Output

11 Convolution The first output feature map is complete Page 11 Input Kernel Weights Output

12 Convolution Your logo Move onto the next output feature map by switching weights, and repeat Page 12 Input Kernel Weights Output

13 Convolution Pattern repeats as before: same input volumes, different weight Page Input Kernel Weights Output

14 Convolution Complete the second output feature map plane Page 14 Input Kernel Weights Output

15 Convolution Your logo Finally, after 256 weight sets have been used, the output feature map is complete Page 15 Input Kernel Weights Output

16 Fully Connected Layers Page 16

17 Fully Connected Layers a1,0 f ( a i 0 0, i * w 0,0, i ) w 0,0,0 a 0,0 a 1,0 a 2,0 w 0,0,1 w 1,40 95,0 a 0,1 a 1,1 a 2,1 w 0,0,40 95 w 1,40 95,1 a 2,999 f ( i 0 a 1, i * w 1,0, i ) a 0,40 95 fc6 a 1,40 95 fc7 w 1,40 95,99 9 a 2,99 9 fc8 Page 17

18 Memory Access G Reads Per Layer Compute GOPs Per Layer Your logo CNN Properties Compute: dominated by convolution (CONV) layers CONV1 CONV2 CONV3 CONV4 CONV5 FC6 FC7 FC8 CaffeNet ZF VGG11 VGG16 VGG19 Memory BW: dominated by fully-connected (FC) layers CONV1 CONV2 CONV3 CONV4 CaffeNet ZF VGG11 VGG16 VGG19 Page 18 Source: Yu Wang, Tsinghua University, Feb 2016

19 Humans vs Machines * Humans are six orders of magnitude more efficient Page 19 *IBM Watson, ca 2012 Source: Yu Wang, Tsinghua University, Feb 2016

20 Cost of Computation Page 20 Source: William Dally, High Performance Hardware for Machine Learning Cadence ENN Summit, 2/9/2016.

21 Cost of Computation Stay in on-chip memory (1/100 x power) Use Smaller Multipliers (8bits vs 32bits: 1/16 x power) Fixed-Point vs Float (don t waste bits on dynamic range) Page 21 Source: William Dally, High Performance Hardware for Machine Learning Cadence ENN Summit, 2/9/

22 Improving Machine Efficiency Model Pruning Right-Sizing Precision Custom CNN Processor Architecture Page 22

23 Retrain to Recover Accuracy Pruning Elements Your logo Train Connectivity Prune Connections Train Weights Pruned Han et al. Learning Remove both Weights Low and Contribution Connections for Efficient Weights Neural Networks, (Synapses) NIPS 2015 Retrain Remaining Weights Page 23 Source: Han, et al, Learning both Weights and Connections for Efficient Neural Networks

24 Pruning Results: AlexNet 9x Reduction In #Weights Most Reduction In FC Layers Page 24 Source: Han, et al, DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING,

25 Pruning Results: AlexNet < 0.1% Accuracy Loss Page 25 Source: Han, et al, DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING,

26 Inference with Integer Quantization Page 26

27 Right-Sizing Precision Network VGG16 Data Bits Single-float Weight Bits Single-float or 4 Data Precision N/A /2-1 Dynamic Weight Precision N/A Dynamic Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0% Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6% Dynamic: Variable Format Fixed-Point (Per Layer) < 1% Accuracy Loss Page 27 Source: Yu Wang, Tsinghua University, Feb 2016

28 Right-Sizing Precision Fixed-Point Sufficient For Deployment (INT16, INT8) No Significant Loss in Accuracy (< 1%) >10x Energy Efficiency OPs/J (INT8 vs FP32) 4x Memory Energy Efficiency Tx/J (INT8 vs FP32) Page 28

29 Improving Machine Efficiency CNN Model Model pruning Pruned Floating-Point Model Pruned Fixed-Point Model Instructions Data/weight quantization Compilation Run FPGA Based Neural Network Processor Page 29 Modified From: Yu Wang, Tsinghua University, Feb 2016

30 Xilinx Kintex UltraScale KU115 (20nm) 5520 DSP Cores,up to 500Mhz 5.5 T OPs int16 (peak) 4 GB DDR & 38 GB/s 55W TDP & 100 G OPs/W Single Slot, Low Profile FF AlphaData ADM-PCIE-8K5 OpenPOWER CAPI Page 30

31 .... Your logo FPGA Architecture RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM RAM CLB DSP CLB.... RAM 2D Array Architecture (scales with Moore s Law) Memory Proximate Computing (Minimize Data Moves) Broadcast Capable Interconnect (Data Sharing/Reuse) Page 31

32 FPGA Arithmetic & Memory Resources Your logo Custom Width Memory INT4 INT8 INT16 INT32 FP16 FP32 D j W i j 16-bit Multiplier 48-bit Accumulator Custom Quantization Q8.8 Q2.14 Qm.n O i Native 16-bit multiplier (or reduced power 8-bit) On-Chip RAMs store INT4, INT8, INT16, Custom Quantization Formatting (Qm.n) Page 32

33 Convolver Unit n Delays Input Data MU X MU X m Delays Data buffer 9 Data Inputs Input Weight Weight buffer 9 Weight Inputs X X X X X X X X X Multipliers Adder Tree Output Data Page 33 Source: Yu Wang, Tsinghua University, Feb 2016

34 Convolver Unit n Delays Input Data Serial to Parallel Data Reuse: 8/ MU X MU X m Delays Input Weight Data buffer Serial to Parallel Ping/Pong Weight buffer 9 Data Inputs Memory Proximate + Compute X X X + X2D X Parallel X MemoryOutput + Data X 2D X XOperator + Array + Multipliers INT16 9 Weight Inputs Adder Tree Page 34 Source: Yu Wang, Tsinghua University, Feb 2016

35 Your logo Processing Engine (PE) Bias Data Bias Shift C + Intermediate Data PE Input Buffer Weights C NL Pool Output Buffer Data shift C + Convolver Complex Adder Tree Controller Page 35 Source: Yu Wang, Tsinghua University, Feb 2016

36 Your logo Processing Engine (PE) Data Memory Sharing Input Buffer Bias Broadcast Weights Weights Bias Shift C C Intermediate Data PE NL Pool Output Buffer Custom Quantization Data shift C + Convolver Complex Adder Tree Controller Page 36 Source: Yu Wang, Tsinghua University, Feb 2016

37 Processing System Programmable Logic Controller Your logo Top Level Power CPU External Memory Config. Bus Data & Inst. Bus DMA w/ compression FIFO Input Buffer Output Buffer Computing Complex PE PE PE Page 37 Source: Yu Wang, Tsinghua University, Feb 2016

38 Processing System Programmable Logic Controller Your logo Top Level SW Scheduled Power CPU Dataflow Config. Bus Data & Inst. Bus External Memory DMA w/ compression Decompress weights on the fly FIFO Input Ping Output Pong Buffers Buffer Buffer Transfers Overlap with Compute Computing Complex Multiple PE PE PE PE Block Level Parallelism Page 38 Source: Yu Wang, Tsinghua University, Feb 2016

39 FPGA Neural Net Processor Tiled Architecture (Parallelism & Scaling) Semi Static Dataflow (Pre-scheduled Data Transfers) Memory Reuse (Data Sharing across Convolvers) Page 39

40 OpenPOWER CAPI POWER8 CAP UNIT CAP PSL Shared Virtual Memory System-Wide Memory Concy Low Latency Control Messages Peer Programming Model and Interaction Efficiency Page 40

41 OpenPOWER CAPI POWER8 CAP UNIT CAP PSL Power Caffe, TensorFlow, etc Load CNN Model Call AuvizDNN Library Xilinx FPGA AuvizDNN Kernel Scalable & Fully Parameterized Plug and Play Library Page 41

42 OpenPOWER CAPI POWER8 CAP UNIT CAP PSL 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP Page 42

43 Take Aways FPGA: Ideal Dataflow CNN Processor POWER/CAPI: Elevates Accelerators As Peers to CPUs FPGA CNN Libraries Page 43

44 Thank You! 4/11/

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School