Bandwidth-Efficient Deep Learning

Size: px
Start display at page:

Download "Bandwidth-Efficient Deep Learning"

Transcription

1 1 Bandwidth-Efficient Deep Learning from Compression to Acceleration Song Han Assistant Professor, EECS Massachusetts Institute of Technology

2 2 AI is Changing Our Lives Self-Driving Car Machine Translation 2 AlphaGo Smart Robots

3 3 Models are Getting Larger IMAGE RECOGNITION SPEECH RECOGNITION 16X Model 8 layers 1.4 GFLOP ~16% Error 152 layers 22.6 GFLOP ~3.5% error 10X Training Ops 80 GFLOP 7,000 hrs of Data ~8% Error 465 GFLOP 12,000 hrs of Data ~5% Error 2012 AlexNet 2015 ResNet 2014 Deep Speech Deep Speech 2 Dally, NIPS 2016 workshop on Efficient Methods for Deep Neural Networks

4 4 The first Challenge: Model Size Hard to distribute large models through over-the-air update

5 The Second Challenge: Speed ResNet18: ResNet50: ResNet101: ResNet152: Error rate 10.76% 7.02% 6.21% 6.16% Training time 2.5 days 5 days 1 week 1.5 weeks Such long training time limits ML researcher s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 5

6 6 The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game on mobile: drains battery on data-center: increases TCO

7 7 Where is the Energy Consumed? larger model => more memory reference => more energy

8 8 Where is the Energy Consumed? larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost = 1000

9 9 Where is the Energy Consumed? larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost how to make deep learning more efficient?

10 10 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

11 11 Application as a Black Box Algorithm Spec 2006 Hardware CPU

12 12 Open the Box before Hardware Design Algorithm? Hardware?PU Breaks the boundary between algorithm and hardware

13 Proposed Paradigm

14 How to compress a deep learning model? Compression Acceleration Regularization 14

15 Learning both Weights and Connections for Efficient Neural Networks Han et al. NIPS 2015 Compression Acceleration Regularization 15

16 Pruning Neural Networks Pruning Trained Quantization Huffman Coding 16

17 Pruning Happens in Human Brain 1000 Trillion Synapses 50 Trillion Synapses 500 Trillion Synapses Newborn 1 year old Adolescent Christopher A Walsh. Peter Huttenlocher ( ). Nature, 502(7470): , Pruning Trained Quantization Huffman Coding 17

18 Pruning AlexNet [Han et al. NIPS 15] CONV Layer: 3x FC Layer: 10x Pruning Trained Quantization Huffman Coding 18

19 [Han et al. NIPS 15] Pruning Neural Networks x+1 6M 60 Million 10x less connections Pruning Trained Quantization Huffman Coding 19

20 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 20

21 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 21

22 Retrain to Recover Accuracy [Han et al. NIPS 15] Accuracy Loss Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 22

23 [Han et al. NIPS 15] Iteratively Retrain to Recover Accuracy Accuracy Loss Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 23

24 Pruning RNN and LSTM [Han et al. NIPS 15] *Karpathy et al "Deep Visual- Semantic Alignments for Generating Image Descriptions" Pruning Trained Quantization Huffman Coding 24

25 Pruning RNN and LSTM [Han et al. NIPS 15] 90% Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball 90% Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area 90% 95% Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field Pruning Trained Quantization Huffman Coding 25

26 26 Exploring the Granularity of Sparsity that is Hardware-friendly 4 types of pruning granularity irregular sparsity regular sparsity more regular sparsity fully-dense model => => => [Han et al, NIPS 15] [Molchanov et al, ICLR 17]

27 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Han et al. ICLR 2016 Best Paper Pruning Trained Quantization Huffman Coding 28

28 [Han et al. ICLR 16] Trained Quantization 2.09, 2.12, 1.92, Pruning Trained Quantization Huffman Coding 29

29 [Han et al. ICLR 16] Trained Quantization weights cluster index (32 bit float) (2 bit uint) centroids : cluster : : : gradient group by reduce Pruning Trained Quantization Huffman Coding 30

30 After Trained Quantization: Discrete Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 31

31 After Trained Quantization: Discrete Weight after Training [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 32

32 [Han et al. ICLR 16] How Many Bits do We Need? Pruning Trained Quantization Huffman Coding 33

33 [Han et al. ICLR 16] How Many Bits do We Need? 77% Non-uniform quantization Uniform quantization Top-1 Accuracy 76% 75% 74% full-precision top-1 accuracy: 76.15% 73% 2bits 4bits 6bits 8bits Number of bits per weight for ResNet-50 after fine-tuning Pruning Trained Quantization Huffman Coding 34

34 35 More Aggressive Compression: Ternary Quantization Full Precision Weight Normalized Full Precision Weight Intermediate Ternary Weight Trained Quantization Final Ternary Weight Normalize Quantize Wn Scale Wp t 0 t Wn 0 Wp Feed Forward Back Propagate Inference Time gradient1 gradient2 Loss 3 res1.0/conv1/wn res1.0/conv1/wp res3.2/conv2/wn res3.2/conv2/wp linear/wn linear/wp Ternary Weight Value

35 [Han et al. ICLR 16] Results: Compression Ratio Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy LeNet KB 27KB 40x 98.36% 98.42% LeNet KB 44KB 39x 99.20% 99.26% AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.09% Inception- V3 91MB 4.2MB 22x 93.56% 93.67% ResNet-50 97MB 5.8MB 17x 92.87% 93.04% Can we make compact models to begin with? Compression Acceleration Regularization 36

36 SqueezeNet Input 1 Conv Squeeze Conv Expand x3 Conv Expand Output Concat/Eltwise 128 Vanilla Fire module Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arxiv 2016 Compression Acceleration Regularization 37

37 Compressing SqueezeNet Network Approach Size Ratio Top-1 Accuracy Top-5 Accuracy AlexNet - 240MB 57.2% 80.3% AlexNet SVD 48MB 5x 56.0% 79.4% AlexNet Deep Compression 6.9MB 35x 57.2% 80.3% SqueezeNet - 4.8MB 50x 57.5% 80.3% SqueezeNet Deep Compression 0.47MB 510x 57.5% 80.3% Compression Acceleration Regularization 38

38 39 Results: Speedup 40fps 30fps 20fps 10fps 0fps 1.6x speedup 20fps 33fps 1.6x speedup 5fps 8fps Baseline: map = / / FLOP = 17.5G # Parameters = 6.0M Pruned: map = / / FLOP = 8.9G # Parameters = 2.5M 1fps Batch = 1 Batch = 8 Batch = 32 2fps

39 Deep Compression Applied to Industry Deep Compression Compression Acceleration Regularization 40

40 41 A Facebook AR prototype backed by Deep Compression - 8x model size reduction by Deep Compression - Run on mobile - Source: Facebook F8 17

41 EIE: Efficient Inference Engine on Compressed Deep Neural Network Han et al. ISCA

42 43 Deep Learning Accelerators First Wave: Compute (Neu Flow) Second Wave: Memory (Diannao family) Third Wave: Algorithm / Hardware Co-Design (EIE) Google TPU: This unit is designed for dense matrices. Sparse architectural support was omitted for time-to-deploy reasons. Sparsity will have high priority in future designs

43 EIE: the First DNN Accelerator for Sparse, Compressed Model [Han et al. ISCA 16] 0 * A = 0 W * 0 = , 1.92=> 2 Sparse Weight 90% static sparsity Sparse Activation 70% dynamic sparsity Weight Sharing 4-bit weights 10x less computation 3x less computation 5x less memory footprint 8x less memory footprint Compression Acceleration Regularization 44

44 EIE: Parallelization on Sparsity [Han et al. ISCA 16] a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 Compression Acceleration Regularization 45

45 EIE: Parallelization on Sparsity [Han et al. ISCA 16] PE PE PE PE PE PE PE PE Central Control PE PE PE PE PE PE PE PE a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 Compression Acceleration Regularization 46

46 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 rule of thumb: 0 * A = 0 W * 0 = 0 Compression Acceleration Regularization 47

47 [Han et al. ISCA 16] EIE Architecture Weight decode Compressed DNN Model Input Image Encoded Weight Relative Index Sparse Format 4-bit Virtual weight 4-bit Relative Index Weight Look-up Index Accum 16-bit Real weight ALU 16-bit Mem Absolute Index Prediction Result Address Accumulate rule of thumb: 0 * A = 0 W * 0 = , 1.92=> 2 Compression Acceleration Regularization 48

48 [Han et al. ISCA 16] Post Layout Result of EIE Technology 40 nm # PEs 64 on-chip SRAM Max Model Size Static Sparsity Dynamic Sparsity Quantization ALU Width Area MxV Throughput Power 8 MB 84 Million 10x 3x 4-bit 16-bit 40.8 mm^2 81,967 layers/s 586 mw 1. Post layout result 2. Throughput measured on AlexNet FC-7 Compression Acceleration Regularization 49

49 5. [Han et al. ISCA 16] Speedup on EIE SpMat CPU Dense (Baseline) 1000x CPU Compressed GPU Dense 507x 248x mgpu Compressed Ptr_Even Act_1 Arithm SpMat Ptr_Odd Speedup Act_0 100x 25x 14x 10x 5x 3x 2x 0.6x 24x 9x 2 14x 5x 0.5x x 1.0x EIE 92x 22x 10x 8x 9x mgpu Compressed EIE 1.0x 0.5x 48x 25x 9x 3x 3x 2x 60x 33x 15x 9x 0.3x 189x 98x 63x 34x 16x 10x 2x 0.5x 2x 15x 3x 0.5x 3x 0.6x 0. Alex-6 Layout of one PE in EIE under TSMC 45nm process. Table II Figure 6. Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 ry network er national cell ueue ad tread munit W cell Power (mw) (59.15%) (20.46%) (11.20%) (9.18%) (1.23%) (19.73%) (54.11%) (12.68%) (12.25%) (%) Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 Energy Efficiency 98x (%) CPU Dense (Baseline) x (93.22%) 10000x (0.14%) (1.48%) 1000x (1.40%) (3.76%) 100x (0.12%) (19.10%) 10x (73.57%) (0.49%) (2.97%) (3.76%) Figure 7. 2x NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 189x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS x mgpu Dense 618x 210x 115x 94x 56x GPU Compressed 1018x 60x 37x 9x12x 26x 37x 5x 7x 10x Alex-6 GPU Dense 119,797x 61,533x 34,522x 25x 9x CPU Compressed 48x 14,826x 15x Alex-7 15x 78x 10 59x 7x10x 3x 18x 7x Alex-8 GPU Compressed 17x 10x 13x VGG-6 3x mgpu Dense mgpu Compressed EIE 76,784x 6 20x 10x 11,828x 24,207x 10,904x 9,485x 8,053x 102x 14x VGG-7 2x 5x 25x 14x 8x 39x 14x 6x 6x 6x 6x 8x 5x VGG-8 NT-We 3x 25x 7x 15x 20x 4x 5x 7x 23x 6x 7x 36x 9x Geo Mean NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. corner. We placed and routed the PE using the Synopsys IC B 0.6x 0.5x compiler (ICC). We used Cacti [25] to get SRAM area and Layer nit: I/O and Computing. In the I/O mode, all of re idle while the activations and weights in every e accessed by a DMA connected with the Central s is one time cost. In the Computing mode, the eatedly collects a non-zero value from the LNZD and broadcasts this value to all PEs. This process until the input length is exceeded. By setting the gth and starting address of pointer array, EIE is to execute different layers. Table III ENCHMARK FROM STATE - OF - THE - ART Size Weight% Act% energy numbers. We annotated the toggle rate from the RTL 9216, Alex-6 9% 4096 simulation to the gate-level netlist, which was dumped to 4096, switching activity interchange format (SAIF), and estimated Alex-7 9% V. E VALUATION M ETHODOLOGY 4096 power using Prime-Time PX. tor, RTL and Layout. We the implemented a custom 4096, Alex-8 25% urate C++ simulator for the accelerator aimed to 1000 Comparison Baseline. We compare EIE with three dife RTL behavior of synchronous circuits. Each EIE CPU GPU mgpu 25088, module is abstracted as an ferent object thatoff-the-shelf implements VGG-6 4% computing units: CPU, GPU and mobile 4096 act methods: propagate and update, corresponding 4096, ation logic and the flip-flop GPU. in RTL. The simulator VGG-7 4% or design space exploration. It also serves as a ) CPU. We use Intel Core i k CPU, a Haswell-E or RTL verification. 4096, asure the area, power and critical delay, we that has been used in NVIDIA Digits Deep VGG-8 23% classpath processor, 1000 ted the RTL of EIE in Verilog. The RTL is verified Dev Box as aacceleration CPU baseline. To run the benchmarkregularization 4096, e cycle-accurate simulator.learning Then we synthesized Compression NT-LSTM g the Synopsys Design Compiler (DC) under the Geo Mean NT-We % DNN MODELS FLOP% 35.1% 3% 35.3% 3% 37.5% 10% 18.3% 1% 37.5% 2% 41.1% 9% 100% 10% Description Compressed AlexNet [1] for large scale image classification Compressed VGG-16 [3] for large scale image classification and object detection Compressed NeuralTalk [7] 50

50 CPU Dense (Baseline) Wd NT-LSTM Speedup 1000x 25x 14x GPU Compressed mgpu Dense Geo Mean 24x 9x 2 14x mgpu Compressed 135x 92x 22x 10x 34x 16x 10x 60x 33x 15x 9x 2x 0.6x 5x 0.5x x 1.0x 0.3x 25x 9x 3x 3x 2x 2x 0.5x 0.5x 48x 15x 2x 0.5x 3x 3x 0.6x 0. Alex-6 Figure 6. Act_0 Ptr_Even Act_1 Arithm Alex-7 Alex-8 CPU Dense (Baseline) Ptr_Odd VGG-6 VGG x CPU Compressed GPU Dense 119,797x 61,533x 34,522x 100x Figure 7. 37x 9x12x 26x 37x 5x 7x 10x 10x Alex-6 78x 10 59x 15x 7x10x 3x Alex-7 18x 7x Alex-8 (%) mgpu Dense 11,828x 17x 10x 13x VGG x 10x 102x 14x VGG-7 NT-LSTM Geo Mean mgpu Compressed EIE EIE 2x 5x 8x 25x 14x 5x VGG-8 10,904x 9,485x 39x 14x 6x 6x 6x 6x 8x NT-We 25x 7x NT-Wd 24,207x 8,053x 15x 20x 4x 5x 7x NT-LSTM 23x 6x 7x 36x 9x Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 10,904x corner. We placed and routed8,053x the PE using the Synopsys IC Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 NT-Wd 76,784x 1000x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS Power (mw) GPU Compressed 10000x Layout of one PE in EIE under TSMC 45nm process. Table II NT-We 14,826x mgpu Compressed SpMat VGG-8 Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. SpMat Energy Efficiency 5. 3x 5x 9x [Han et al. ISCA 16] 189x 98x 63x Energy Efficiency on EIE There is no batching in all cases. 10x 8x EIE 618x 210x 115x 94x 56x GPU Dense 1018x 507x 248x 100x CPU Compressed (%) 24,207x Table III B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS compiler(93.22%) (ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description (0.14%) energy numbers. We annotated the toggle rate from the RTL 9216, (1.48%) Alex-6 9% 35.1% 3% (1.40%) 4096 Compressed simulation to the gate-level netlist, which was dumped to (3.76%) 4096, AlexNet [1] for (1.23%) (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% (19.73%)switching (19.10%) 4096 large scale image (54.11%) (73.57%) the power using Prime-Time PX (12.68%) (0.49%) 4096, classification Alex-8 25% 37.5% 10% (12.25%) (2.97%) 1000 Comparison Baseline. We compare EIE with three dif(3.76%) 25088, VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 nit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for GPU. 36x re idle while the activations and weights in every 4096, VGG-7 4% 37.5% 2% large scale image 25x 23x e accessed by a DMA connected with the Central x 1) CPU. We use Intel Core i k CPU, a Haswell-E classification and 15x s is one time cost. In the Computing mode, the 4096, eatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection class processor, that has been used in NVIDIA Digits 7x Deep x and broadcasts this value to all PEs. This process 5x Dev the Box as a CPU baseline. To run the benchmark 4096, Compressed 4x until the input length is Learning exceeded. By setting NT-We 10% 100% 10% gth and starting address on of pointer array, EIE is 600 NeuralTalk [7] CPU, we used MKL CBLAS GEMV to implement the to execute different layers. 9x 600, with RNN and 7xMKL SPBLAS CSRMV 7xV. EVALUATION METHODOLOGY dense model and for the NT-Wd 11% 100% 11% original 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic tor, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% 2400 image captioning urate C++ simulator for the aimed toby the pcm-power utility provided by Intel. areaccelerator as reported e RTL behavior of synchronous circuits. Each GPU mgpu EIE CPUX GPU, 2) GPU. We use NVIDIA GeForce GTX Titan module is abstracted as an object that implements act methods: propagate and update, corresponding The uncompressed DNN model is obtained from Caffe a state-of-the-art GPU for deep learning as our baseline ation logic and the flip-flop in RTL. The simulator or design space exploration. It alsonvidia-smi serves as a model zoo [28] and NeuralTalk model zoo [7]; The comusing utility to report the power. To run or RTL verification. pressed DNN model is produced as described in [16], [23]. benchmark, asure the area, power andthe critical path delay, we we used cublas GEMV to implement ted the RTL of EIE in Verilog. The RTL is verified The benchmark networks have 9 layers in total obtained thethen original dense layer. For the compressed sparse layer, e cycle-accurate simulator. we synthesized Compression Acceleration Regularization from AlexNet, VGGNet, and NeuralTalk. We use the Imagewe stored thethesparse matrix in in CSR format, and used g the Synopsys Design Compiler (DC) under ry network er national cell ueue ad tread munit W cell Wd (59.15%) (20.46%) (11.20%) (9.18%) Geo Mean NT-LSTM Geo Mean 51

51 [Han et al. ISCA 16] Comparison: Throughput EIE 1E+06 Throughput (Layers/s in log scale) 1E+05 1E+04 1E+03 GPU ASIC ASIC ASIC ASIC 1E+02 1E+01 CPU mgpu FPGA 1E+00 Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs Compression Acceleration Regularization 52

52 Comparison: Energy Efficiency [Han et al. ISCA 16] EIE 1E+06 Energy Efficiency (Layers/J in log scale) 1E+05 ASIC ASIC 1E+04 ASIC ASIC 1E+03 1E+02 1E+01 GPU mgpu 1E+00 CPU Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu FPGA A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs Compression Acceleration Regularization 53

53 Feedforward => Recurrent neural network? Compression Acceleration Regularization 54

54 [Han et al. FPGA 17] ESE Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue FIFO FIFO FIFO FIFO Channel with multiple PEs FPGA PCIE Controller MEM Controller Memory DATA BUS Input Buffer Output Buffer PtrRead PtrRead PtrRead Pointer Buffer PtrRead Pointer Buffer Pointer Buf Buffer Buf Pointer Buf Buffer Buf Buf Buf Buf Buf SpMM SpmatRead SpmatRead SpMV Accu SpmatRead Weight Buffer SpMV Accu SpmatRead Weight Buffer SpMV Accu Weight Buf Buffer Buf SpMV Accu Weight Buf Buffer Buf Buf Buf Buf Buf PE N PE k PE 1 PE Act 0 Buffer Act Buffer Act Buffer Buf Act Buffer Buf Buf Buf Buf Buf Assemble y t ESE Controller Channel 0 PE PE PE ESE Accelerator Channel 1 PE PE PE Channel N PE PE PE W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t Buffer Compression Acceleration Regularization 55

55 ESE is now on AWS Marketplace with Xilinx VU9P FPGA

56 ESE is now on AWS Marketplace with Xilinx VU9P FPGA

57 58 ESE is now on AWS Marketplace with Xilinx VU9P FPGA

58 Thank you! Smart Training Han et al. ICLR 17 Dense Sparse Dense Pruning Re-Dense Sparsity Constraint Increase Model Capacity Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) SpMat Fast Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Efficient 59

Efficient Methods for Deep Learning

Efficient Methods for Deep Learning Efficient Methods for Deep Learning Song Han Stanford University Sep 2016 Background: Deep Learning for Everything Source: Brody Huval et al., An Empirical Evaluation, arxiv:1504.01716 Source: leon A.

More information

arxiv: v2 [cs.cv] 3 May 2016

arxiv: v2 [cs.cv] 3 May 2016 EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu

More information

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong

More information

Hardware for Deep Learning

Hardware for Deep Learning Hardware for Deep Learning Bill Dally Stanford and NVIDIA Stanford Platform Lab Retreat June 3, 2016 HARDWARE AND DATA ENABLE DNNS 2 THE NEED FOR SPEED Larger data sets and models lead to better accuracy

More information

High-Performance Hardware for Machine Learning

High-Performance Hardware for Machine Learning High-Performance Hardware for Machine Learning Cadence ENN Summit 2/9/2016 Prof. William Dally Stanford University NVIDIA Corporation Hardware and Data enable DNNs The Need for Speed Larger data sets and

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural ... SOFTWARE HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION... Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

CNN optimization. Rassadin A

CNN optimization. Rassadin A CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

arxiv: v1 [cs.cv] 11 Feb 2018

arxiv: v1 [cs.cv] 11 Feb 2018 arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

Xilinx Machine Learning Strategies For Edge

Xilinx Machine Learning Strategies For Edge Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

Lecture 12: Model Serving. CSE599W: Spring 2018

Lecture 12: Model Serving. CSE599W: Spring 2018 Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page

More information

In Live Computer Vision

In Live Computer Vision EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

More information

A Method to Estimate the Energy Consumption of Deep Neural Networks

A Method to Estimate the Energy Consumption of Deep Neural Networks A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

The Explosion in Neural Network Hardware

The Explosion in Neural Network Hardware 1 The Explosion in Neural Network Hardware Arm Summit, Cambridge, September 17 th, 2018 Trevor Mudge Bredt Family Professor of Computer Science and Engineering The, Ann Arbor 1 What Just Happened? 2 For

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer DEEP NEURAL NETWORKS AND GPUS Julie Bernauer GPU Computing GPU Computing Run Computations on GPUs x86 CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int

More information

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017 DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X

More information

The Explosion in Neural Network Hardware

The Explosion in Neural Network Hardware 1 The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family Professor of Computer Science and Engineering The, Ann Arbor 1 What Just Happened? 2 For years the common wisdom

More information

THE NVIDIA DEEP LEARNING ACCELERATOR

THE NVIDIA DEEP LEARNING ACCELERATOR THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices Chunhua Deng + City University of New York chunhua.deng@rutgers.edu Keshab K. Parhi University of Minnesota, Twin Cities parhi@umn.edu

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to FPGAs Sebastian Buschjäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 24, 2017 1 Recap: Convolution Observation 1 Even smaller images

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

Arm s First-Generation Machine Learning Processor

Arm s First-Generation Machine Learning Processor Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Brainchip OCTOBER

Brainchip OCTOBER Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

Deep Learning Compiler

Deep Learning Compiler Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,

More information

Architecting new deep neural networks for embedded applications

Architecting new deep neural networks for embedded applications Architecting new deep neural networks for embedded applications Forrest Iandola 1 Machine Learning in 2012 Sentiment Analysis LDA Object Detection Deformable Parts Model Word Prediction Linear Interpolation

More information

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx Accelerating your Embedded Vision / Machine Learning design with the revision Stack Giles Peckham, Xilinx Xilinx Foundation at the Edge Vision Customers Using Xilinx >80 ADAS Models From 23 Makers >80

More information

arxiv: v1 [cs.cv] 20 May 2016

arxiv: v1 [cs.cv] 20 May 2016 arxiv:1605.06402v1 [cs.cv] 20 May 2016 Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks By Philipp Matthias Gysel B.S. (Bern University of Applied Sciences, Switzerland) 2012

More information

arxiv: v2 [cs.ar] 15 May 2018

arxiv: v2 [cs.ar] 15 May 2018 [DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches

More information

GPU-Accelerated Deep Learning

GPU-Accelerated Deep Learning GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,

More information

ASIC Design of Shared Vector Accelerators for Multicore Processors

ASIC Design of Shared Vector Accelerators for Multicore Processors 26 th International Symposium on Computer Architecture and High Performance Computing 2014 ASIC Design of Shared Vector Accelerators for Multicore Processors Spiridon F. Beldianu & Sotirios G. Ziavras

More information

SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity

SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity Jingyang Zhu 1, Jingbo Jiang 1, Xizi Chen 1 and Chi-Ying Tsui 2 Department of Electronic and Computer Engineering,

More information

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13 GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning

More information

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review Date of publication 2018 00, 0000, date of current version 2018 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.2890150.DOI arxiv:1901.00121v1 [cs.ne] 1 Jan 2019 FPGA-based Accelerators of Deep

More information

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM AGENDA 1 Introduction to Deep Learning 2 What is DIGITS 3 How to use DIGITS Practical DEEP LEARNING Examples Image Classification, Object Detection,

More information

Model Compression. Girish Varma IIIT Hyderabad

Model Compression. Girish Varma IIIT Hyderabad Model Compression Girish Varma IIIT Hyderabad http://bit.ly/2tpy1wu Big Huge Neural Network! AlexNet - 60 Million Parameters = 240 MB & the Humble Mobile Phone 1 GB RAM 1/2 Billion FLOPs NOT SO BAD! But

More information

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9-1 Administrative A2 due Wed May 2 Midterm: In-class Tue May 8. Covers material through Lecture 10 (Thu May 3). Sample midterm released on piazza. Midterm review session: Fri May 4 discussion

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse Accepted for publication at Design, Automation and Test in Europe (DATE 2019). Florence, Italy CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Reuse Alberto Marchisio, Muhammad Abdullah

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

arxiv: v1 [cs.ne] 20 Nov 2017

arxiv: v1 [cs.ne] 20 Nov 2017 E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks arxiv:1711.748v1 [cs.ne] 2 Nov 217 ABSTRACT Franyell Silfa, Gem Dot, Jose-Maria Arnau, Antonio Gonzalez Computer Architecture Deparment,

More information

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, Krste Asanović

More information

FABRICATION TECHNOLOGIES

FABRICATION TECHNOLOGIES FABRICATION TECHNOLOGIES DSP Processor Design Approaches Full custom Standard cell** higher performance lower energy (power) lower per-part cost Gate array* FPGA* Programmable DSP Programmable general

More information

Efficient Processing for Deep Learning: Challenges and Opportuni:es

Efficient Processing for Deep Learning: Challenges and Opportuni:es Efficient Processing for Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems In collabora*on with Yu-Hsin

More information

Real-time convolutional networks for sonar image classification in low-power embedded systems

Real-time convolutional networks for sonar image classification in low-power embedded systems Real-time convolutional networks for sonar image classification in low-power embedded systems Matias Valdenegro-Toro Ocean Systems Laboratory - School of Engineering & Physical Sciences Heriot-Watt University,

More information

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient

More information

direct hardware mapping of cnns on fpga-based smart cameras

direct hardware mapping of cnns on fpga-based smart cameras direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June

More information

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William

More information

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh and Debbie Marr Accelerator Architecture Lab,

More information

Fast Hardware For AI

Fast Hardware For AI Fast Hardware For AI Karl Freund karl@moorinsightsstrategy.com Sr. Analyst, AI and HPC Moor Insights & Strategy Follow my blogs covering Machine Learning Hardware on Forbes: http://www.forbes.com/sites/moorinsights

More information

Convolutional Neural Network Layer Reordering for Acceleration

Convolutional Neural Network Layer Reordering for Acceleration R1-15 SASIMI 2016 Proceedings Convolutional Neural Network Layer Reordering for Acceleration Vijay Daultani Subhajit Chaudhury Kazuhisa Ishizaka System Platform Labs Value Co-creation Center System Platform

More information

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Jiang Su, Julian Faraone, Junyi Liu, Yiren Zhao, David B. Thomas, Philip H. W. Leong, and Peter Y. K. Cheung

More information

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan Brucek Khailany Joel Emer Stephen W. Keckler

More information

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators Arash Azizimazreah, and Lizhong Chen School of Electrical Engineering and Computer Science Oregon State University, USA {azizimaa,

More information

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

Deep Neural Network Evaluation

Deep Neural Network Evaluation Lecture 8: Deep Neural Network Evaluation Visual Computing Systems Training/evaluating deep neural networks Technique leading to many high-profile AI advances in recent years Speech recognition/natural

More information

Implementing Deep Learning for Video Analytics on Tegra X1.

Implementing Deep Learning for Video Analytics on Tegra X1. Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning

More information

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018 Speculations about Computer Architecture in Next Three Years shuchang.zhou@gmail.com Jan. 20, 2018 About me https://zsc.github.io/ Source-to-source transformation Cache simulation Compiler Optimization

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

Scale-Out Acceleration for Machine Learning

Scale-Out Acceleration for Machine Learning Scale-Out Acceleration for Machine Learning Jongse Park Hardik Sharma Divya Mahajan Joon Kyung Kim Preston Olds Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology

More information