Bandwidth-Efficient Deep Learning
|
|
- Theodora Hancock
- 5 years ago
- Views:
Transcription
1 1 Bandwidth-Efficient Deep Learning from Compression to Acceleration Song Han Assistant Professor, EECS Massachusetts Institute of Technology
2 2 AI is Changing Our Lives Self-Driving Car Machine Translation 2 AlphaGo Smart Robots
3 3 Models are Getting Larger IMAGE RECOGNITION SPEECH RECOGNITION 16X Model 8 layers 1.4 GFLOP ~16% Error 152 layers 22.6 GFLOP ~3.5% error 10X Training Ops 80 GFLOP 7,000 hrs of Data ~8% Error 465 GFLOP 12,000 hrs of Data ~5% Error 2012 AlexNet 2015 ResNet 2014 Deep Speech Deep Speech 2 Dally, NIPS 2016 workshop on Efficient Methods for Deep Neural Networks
4 4 The first Challenge: Model Size Hard to distribute large models through over-the-air update
5 The Second Challenge: Speed ResNet18: ResNet50: ResNet101: ResNet152: Error rate 10.76% 7.02% 6.21% 6.16% Training time 2.5 days 5 days 1 week 1.5 weeks Such long training time limits ML researcher s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 5
6 6 The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game on mobile: drains battery on data-center: increases TCO
7 7 Where is the Energy Consumed? larger model => more memory reference => more energy
8 8 Where is the Energy Consumed? larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost = 1000
9 9 Where is the Energy Consumed? larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost how to make deep learning more efficient?
10 10 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design
11 11 Application as a Black Box Algorithm Spec 2006 Hardware CPU
12 12 Open the Box before Hardware Design Algorithm? Hardware?PU Breaks the boundary between algorithm and hardware
13 Proposed Paradigm
14 How to compress a deep learning model? Compression Acceleration Regularization 14
15 Learning both Weights and Connections for Efficient Neural Networks Han et al. NIPS 2015 Compression Acceleration Regularization 15
16 Pruning Neural Networks Pruning Trained Quantization Huffman Coding 16
17 Pruning Happens in Human Brain 1000 Trillion Synapses 50 Trillion Synapses 500 Trillion Synapses Newborn 1 year old Adolescent Christopher A Walsh. Peter Huttenlocher ( ). Nature, 502(7470): , Pruning Trained Quantization Huffman Coding 17
18 Pruning AlexNet [Han et al. NIPS 15] CONV Layer: 3x FC Layer: 10x Pruning Trained Quantization Huffman Coding 18
19 [Han et al. NIPS 15] Pruning Neural Networks x+1 6M 60 Million 10x less connections Pruning Trained Quantization Huffman Coding 19
20 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 20
21 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 21
22 Retrain to Recover Accuracy [Han et al. NIPS 15] Accuracy Loss Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 22
23 [Han et al. NIPS 15] Iteratively Retrain to Recover Accuracy Accuracy Loss Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 23
24 Pruning RNN and LSTM [Han et al. NIPS 15] *Karpathy et al "Deep Visual- Semantic Alignments for Generating Image Descriptions" Pruning Trained Quantization Huffman Coding 24
25 Pruning RNN and LSTM [Han et al. NIPS 15] 90% Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball 90% Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area 90% 95% Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field Pruning Trained Quantization Huffman Coding 25
26 26 Exploring the Granularity of Sparsity that is Hardware-friendly 4 types of pruning granularity irregular sparsity regular sparsity more regular sparsity fully-dense model => => => [Han et al, NIPS 15] [Molchanov et al, ICLR 17]
27 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Han et al. ICLR 2016 Best Paper Pruning Trained Quantization Huffman Coding 28
28 [Han et al. ICLR 16] Trained Quantization 2.09, 2.12, 1.92, Pruning Trained Quantization Huffman Coding 29
29 [Han et al. ICLR 16] Trained Quantization weights cluster index (32 bit float) (2 bit uint) centroids : cluster : : : gradient group by reduce Pruning Trained Quantization Huffman Coding 30
30 After Trained Quantization: Discrete Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 31
31 After Trained Quantization: Discrete Weight after Training [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 32
32 [Han et al. ICLR 16] How Many Bits do We Need? Pruning Trained Quantization Huffman Coding 33
33 [Han et al. ICLR 16] How Many Bits do We Need? 77% Non-uniform quantization Uniform quantization Top-1 Accuracy 76% 75% 74% full-precision top-1 accuracy: 76.15% 73% 2bits 4bits 6bits 8bits Number of bits per weight for ResNet-50 after fine-tuning Pruning Trained Quantization Huffman Coding 34
34 35 More Aggressive Compression: Ternary Quantization Full Precision Weight Normalized Full Precision Weight Intermediate Ternary Weight Trained Quantization Final Ternary Weight Normalize Quantize Wn Scale Wp t 0 t Wn 0 Wp Feed Forward Back Propagate Inference Time gradient1 gradient2 Loss 3 res1.0/conv1/wn res1.0/conv1/wp res3.2/conv2/wn res3.2/conv2/wp linear/wn linear/wp Ternary Weight Value
35 [Han et al. ICLR 16] Results: Compression Ratio Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy LeNet KB 27KB 40x 98.36% 98.42% LeNet KB 44KB 39x 99.20% 99.26% AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.09% Inception- V3 91MB 4.2MB 22x 93.56% 93.67% ResNet-50 97MB 5.8MB 17x 92.87% 93.04% Can we make compact models to begin with? Compression Acceleration Regularization 36
36 SqueezeNet Input 1 Conv Squeeze Conv Expand x3 Conv Expand Output Concat/Eltwise 128 Vanilla Fire module Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arxiv 2016 Compression Acceleration Regularization 37
37 Compressing SqueezeNet Network Approach Size Ratio Top-1 Accuracy Top-5 Accuracy AlexNet - 240MB 57.2% 80.3% AlexNet SVD 48MB 5x 56.0% 79.4% AlexNet Deep Compression 6.9MB 35x 57.2% 80.3% SqueezeNet - 4.8MB 50x 57.5% 80.3% SqueezeNet Deep Compression 0.47MB 510x 57.5% 80.3% Compression Acceleration Regularization 38
38 39 Results: Speedup 40fps 30fps 20fps 10fps 0fps 1.6x speedup 20fps 33fps 1.6x speedup 5fps 8fps Baseline: map = / / FLOP = 17.5G # Parameters = 6.0M Pruned: map = / / FLOP = 8.9G # Parameters = 2.5M 1fps Batch = 1 Batch = 8 Batch = 32 2fps
39 Deep Compression Applied to Industry Deep Compression Compression Acceleration Regularization 40
40 41 A Facebook AR prototype backed by Deep Compression - 8x model size reduction by Deep Compression - Run on mobile - Source: Facebook F8 17
41 EIE: Efficient Inference Engine on Compressed Deep Neural Network Han et al. ISCA
42 43 Deep Learning Accelerators First Wave: Compute (Neu Flow) Second Wave: Memory (Diannao family) Third Wave: Algorithm / Hardware Co-Design (EIE) Google TPU: This unit is designed for dense matrices. Sparse architectural support was omitted for time-to-deploy reasons. Sparsity will have high priority in future designs
43 EIE: the First DNN Accelerator for Sparse, Compressed Model [Han et al. ISCA 16] 0 * A = 0 W * 0 = , 1.92=> 2 Sparse Weight 90% static sparsity Sparse Activation 70% dynamic sparsity Weight Sharing 4-bit weights 10x less computation 3x less computation 5x less memory footprint 8x less memory footprint Compression Acceleration Regularization 44
44 EIE: Parallelization on Sparsity [Han et al. ISCA 16] a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 Compression Acceleration Regularization 45
45 EIE: Parallelization on Sparsity [Han et al. ISCA 16] PE PE PE PE PE PE PE PE Central Control PE PE PE PE PE PE PE PE a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 Compression Acceleration Regularization 46
46 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 rule of thumb: 0 * A = 0 W * 0 = 0 Compression Acceleration Regularization 47
47 [Han et al. ISCA 16] EIE Architecture Weight decode Compressed DNN Model Input Image Encoded Weight Relative Index Sparse Format 4-bit Virtual weight 4-bit Relative Index Weight Look-up Index Accum 16-bit Real weight ALU 16-bit Mem Absolute Index Prediction Result Address Accumulate rule of thumb: 0 * A = 0 W * 0 = , 1.92=> 2 Compression Acceleration Regularization 48
48 [Han et al. ISCA 16] Post Layout Result of EIE Technology 40 nm # PEs 64 on-chip SRAM Max Model Size Static Sparsity Dynamic Sparsity Quantization ALU Width Area MxV Throughput Power 8 MB 84 Million 10x 3x 4-bit 16-bit 40.8 mm^2 81,967 layers/s 586 mw 1. Post layout result 2. Throughput measured on AlexNet FC-7 Compression Acceleration Regularization 49
49 5. [Han et al. ISCA 16] Speedup on EIE SpMat CPU Dense (Baseline) 1000x CPU Compressed GPU Dense 507x 248x mgpu Compressed Ptr_Even Act_1 Arithm SpMat Ptr_Odd Speedup Act_0 100x 25x 14x 10x 5x 3x 2x 0.6x 24x 9x 2 14x 5x 0.5x x 1.0x EIE 92x 22x 10x 8x 9x mgpu Compressed EIE 1.0x 0.5x 48x 25x 9x 3x 3x 2x 60x 33x 15x 9x 0.3x 189x 98x 63x 34x 16x 10x 2x 0.5x 2x 15x 3x 0.5x 3x 0.6x 0. Alex-6 Layout of one PE in EIE under TSMC 45nm process. Table II Figure 6. Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 ry network er national cell ueue ad tread munit W cell Power (mw) (59.15%) (20.46%) (11.20%) (9.18%) (1.23%) (19.73%) (54.11%) (12.68%) (12.25%) (%) Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 Energy Efficiency 98x (%) CPU Dense (Baseline) x (93.22%) 10000x (0.14%) (1.48%) 1000x (1.40%) (3.76%) 100x (0.12%) (19.10%) 10x (73.57%) (0.49%) (2.97%) (3.76%) Figure 7. 2x NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 189x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS x mgpu Dense 618x 210x 115x 94x 56x GPU Compressed 1018x 60x 37x 9x12x 26x 37x 5x 7x 10x Alex-6 GPU Dense 119,797x 61,533x 34,522x 25x 9x CPU Compressed 48x 14,826x 15x Alex-7 15x 78x 10 59x 7x10x 3x 18x 7x Alex-8 GPU Compressed 17x 10x 13x VGG-6 3x mgpu Dense mgpu Compressed EIE 76,784x 6 20x 10x 11,828x 24,207x 10,904x 9,485x 8,053x 102x 14x VGG-7 2x 5x 25x 14x 8x 39x 14x 6x 6x 6x 6x 8x 5x VGG-8 NT-We 3x 25x 7x 15x 20x 4x 5x 7x 23x 6x 7x 36x 9x Geo Mean NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. corner. We placed and routed the PE using the Synopsys IC B 0.6x 0.5x compiler (ICC). We used Cacti [25] to get SRAM area and Layer nit: I/O and Computing. In the I/O mode, all of re idle while the activations and weights in every e accessed by a DMA connected with the Central s is one time cost. In the Computing mode, the eatedly collects a non-zero value from the LNZD and broadcasts this value to all PEs. This process until the input length is exceeded. By setting the gth and starting address of pointer array, EIE is to execute different layers. Table III ENCHMARK FROM STATE - OF - THE - ART Size Weight% Act% energy numbers. We annotated the toggle rate from the RTL 9216, Alex-6 9% 4096 simulation to the gate-level netlist, which was dumped to 4096, switching activity interchange format (SAIF), and estimated Alex-7 9% V. E VALUATION M ETHODOLOGY 4096 power using Prime-Time PX. tor, RTL and Layout. We the implemented a custom 4096, Alex-8 25% urate C++ simulator for the accelerator aimed to 1000 Comparison Baseline. We compare EIE with three dife RTL behavior of synchronous circuits. Each EIE CPU GPU mgpu 25088, module is abstracted as an ferent object thatoff-the-shelf implements VGG-6 4% computing units: CPU, GPU and mobile 4096 act methods: propagate and update, corresponding 4096, ation logic and the flip-flop GPU. in RTL. The simulator VGG-7 4% or design space exploration. It also serves as a ) CPU. We use Intel Core i k CPU, a Haswell-E or RTL verification. 4096, asure the area, power and critical delay, we that has been used in NVIDIA Digits Deep VGG-8 23% classpath processor, 1000 ted the RTL of EIE in Verilog. The RTL is verified Dev Box as aacceleration CPU baseline. To run the benchmarkregularization 4096, e cycle-accurate simulator.learning Then we synthesized Compression NT-LSTM g the Synopsys Design Compiler (DC) under the Geo Mean NT-We % DNN MODELS FLOP% 35.1% 3% 35.3% 3% 37.5% 10% 18.3% 1% 37.5% 2% 41.1% 9% 100% 10% Description Compressed AlexNet [1] for large scale image classification Compressed VGG-16 [3] for large scale image classification and object detection Compressed NeuralTalk [7] 50
50 CPU Dense (Baseline) Wd NT-LSTM Speedup 1000x 25x 14x GPU Compressed mgpu Dense Geo Mean 24x 9x 2 14x mgpu Compressed 135x 92x 22x 10x 34x 16x 10x 60x 33x 15x 9x 2x 0.6x 5x 0.5x x 1.0x 0.3x 25x 9x 3x 3x 2x 2x 0.5x 0.5x 48x 15x 2x 0.5x 3x 3x 0.6x 0. Alex-6 Figure 6. Act_0 Ptr_Even Act_1 Arithm Alex-7 Alex-8 CPU Dense (Baseline) Ptr_Odd VGG-6 VGG x CPU Compressed GPU Dense 119,797x 61,533x 34,522x 100x Figure 7. 37x 9x12x 26x 37x 5x 7x 10x 10x Alex-6 78x 10 59x 15x 7x10x 3x Alex-7 18x 7x Alex-8 (%) mgpu Dense 11,828x 17x 10x 13x VGG x 10x 102x 14x VGG-7 NT-LSTM Geo Mean mgpu Compressed EIE EIE 2x 5x 8x 25x 14x 5x VGG-8 10,904x 9,485x 39x 14x 6x 6x 6x 6x 8x NT-We 25x 7x NT-Wd 24,207x 8,053x 15x 20x 4x 5x 7x NT-LSTM 23x 6x 7x 36x 9x Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 10,904x corner. We placed and routed8,053x the PE using the Synopsys IC Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 NT-Wd 76,784x 1000x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS Power (mw) GPU Compressed 10000x Layout of one PE in EIE under TSMC 45nm process. Table II NT-We 14,826x mgpu Compressed SpMat VGG-8 Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. SpMat Energy Efficiency 5. 3x 5x 9x [Han et al. ISCA 16] 189x 98x 63x Energy Efficiency on EIE There is no batching in all cases. 10x 8x EIE 618x 210x 115x 94x 56x GPU Dense 1018x 507x 248x 100x CPU Compressed (%) 24,207x Table III B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS compiler(93.22%) (ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description (0.14%) energy numbers. We annotated the toggle rate from the RTL 9216, (1.48%) Alex-6 9% 35.1% 3% (1.40%) 4096 Compressed simulation to the gate-level netlist, which was dumped to (3.76%) 4096, AlexNet [1] for (1.23%) (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% (19.73%)switching (19.10%) 4096 large scale image (54.11%) (73.57%) the power using Prime-Time PX (12.68%) (0.49%) 4096, classification Alex-8 25% 37.5% 10% (12.25%) (2.97%) 1000 Comparison Baseline. We compare EIE with three dif(3.76%) 25088, VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 nit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for GPU. 36x re idle while the activations and weights in every 4096, VGG-7 4% 37.5% 2% large scale image 25x 23x e accessed by a DMA connected with the Central x 1) CPU. We use Intel Core i k CPU, a Haswell-E classification and 15x s is one time cost. In the Computing mode, the 4096, eatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection class processor, that has been used in NVIDIA Digits 7x Deep x and broadcasts this value to all PEs. This process 5x Dev the Box as a CPU baseline. To run the benchmark 4096, Compressed 4x until the input length is Learning exceeded. By setting NT-We 10% 100% 10% gth and starting address on of pointer array, EIE is 600 NeuralTalk [7] CPU, we used MKL CBLAS GEMV to implement the to execute different layers. 9x 600, with RNN and 7xMKL SPBLAS CSRMV 7xV. EVALUATION METHODOLOGY dense model and for the NT-Wd 11% 100% 11% original 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic tor, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% 2400 image captioning urate C++ simulator for the aimed toby the pcm-power utility provided by Intel. areaccelerator as reported e RTL behavior of synchronous circuits. Each GPU mgpu EIE CPUX GPU, 2) GPU. We use NVIDIA GeForce GTX Titan module is abstracted as an object that implements act methods: propagate and update, corresponding The uncompressed DNN model is obtained from Caffe a state-of-the-art GPU for deep learning as our baseline ation logic and the flip-flop in RTL. The simulator or design space exploration. It alsonvidia-smi serves as a model zoo [28] and NeuralTalk model zoo [7]; The comusing utility to report the power. To run or RTL verification. pressed DNN model is produced as described in [16], [23]. benchmark, asure the area, power andthe critical path delay, we we used cublas GEMV to implement ted the RTL of EIE in Verilog. The RTL is verified The benchmark networks have 9 layers in total obtained thethen original dense layer. For the compressed sparse layer, e cycle-accurate simulator. we synthesized Compression Acceleration Regularization from AlexNet, VGGNet, and NeuralTalk. We use the Imagewe stored thethesparse matrix in in CSR format, and used g the Synopsys Design Compiler (DC) under ry network er national cell ueue ad tread munit W cell Wd (59.15%) (20.46%) (11.20%) (9.18%) Geo Mean NT-LSTM Geo Mean 51
51 [Han et al. ISCA 16] Comparison: Throughput EIE 1E+06 Throughput (Layers/s in log scale) 1E+05 1E+04 1E+03 GPU ASIC ASIC ASIC ASIC 1E+02 1E+01 CPU mgpu FPGA 1E+00 Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs Compression Acceleration Regularization 52
52 Comparison: Energy Efficiency [Han et al. ISCA 16] EIE 1E+06 Energy Efficiency (Layers/J in log scale) 1E+05 ASIC ASIC 1E+04 ASIC ASIC 1E+03 1E+02 1E+01 GPU mgpu 1E+00 CPU Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu FPGA A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs Compression Acceleration Regularization 53
53 Feedforward => Recurrent neural network? Compression Acceleration Regularization 54
54 [Han et al. FPGA 17] ESE Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue FIFO FIFO FIFO FIFO Channel with multiple PEs FPGA PCIE Controller MEM Controller Memory DATA BUS Input Buffer Output Buffer PtrRead PtrRead PtrRead Pointer Buffer PtrRead Pointer Buffer Pointer Buf Buffer Buf Pointer Buf Buffer Buf Buf Buf Buf Buf SpMM SpmatRead SpmatRead SpMV Accu SpmatRead Weight Buffer SpMV Accu SpmatRead Weight Buffer SpMV Accu Weight Buf Buffer Buf SpMV Accu Weight Buf Buffer Buf Buf Buf Buf Buf PE N PE k PE 1 PE Act 0 Buffer Act Buffer Act Buffer Buf Act Buffer Buf Buf Buf Buf Buf Assemble y t ESE Controller Channel 0 PE PE PE ESE Accelerator Channel 1 PE PE PE Channel N PE PE PE W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t Buffer Compression Acceleration Regularization 55
55 ESE is now on AWS Marketplace with Xilinx VU9P FPGA
56 ESE is now on AWS Marketplace with Xilinx VU9P FPGA
57 58 ESE is now on AWS Marketplace with Xilinx VU9P FPGA
58 Thank you! Smart Training Han et al. ICLR 17 Dense Sparse Dense Pruning Re-Dense Sparsity Constraint Increase Model Capacity Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) SpMat Fast Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Efficient 59
Efficient Methods for Deep Learning
Efficient Methods for Deep Learning Song Han Stanford University Sep 2016 Background: Deep Learning for Everything Source: Brody Huval et al., An Empirical Evaluation, arxiv:1504.01716 Source: leon A.
More informationarxiv: v2 [cs.cv] 3 May 2016
EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu
More informationESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA
ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong
More informationHardware for Deep Learning
Hardware for Deep Learning Bill Dally Stanford and NVIDIA Stanford Platform Lab Retreat June 3, 2016 HARDWARE AND DATA ENABLE DNNS 2 THE NEED FOR SPEED Larger data sets and models lead to better accuracy
More informationHigh-Performance Hardware for Machine Learning
High-Performance Hardware for Machine Learning Cadence ENN Summit 2/9/2016 Prof. William Dally Stanford University NVIDIA Corporation Hardware and Data enable DNNs The Need for Speed Larger data sets and
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationSOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural
... SOFTWARE HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION... Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationHow to Estimate the Energy Consumption of Deep Neural Networks
How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationDeep Learning on Modern Architectures. Keren Zhou 4/17/2017
Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others
More informationCNN optimization. Rassadin A
CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationarxiv: v1 [cs.cv] 11 Feb 2018
arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationXilinx Machine Learning Strategies For Edge
Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac
More informationEFFICIENT INFERENCE WITH TENSORRT. Han Vanholder
EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60
More informationLecture 12: Model Serving. CSE599W: Spring 2018
Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page
More informationIn Live Computer Vision
EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018
More informationA Method to Estimate the Energy Consumption of Deep Neural Networks
A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationThe Explosion in Neural Network Hardware
1 The Explosion in Neural Network Hardware Arm Summit, Cambridge, September 17 th, 2018 Trevor Mudge Bredt Family Professor of Computer Science and Engineering The, Ann Arbor 1 What Just Happened? 2 For
More informationCharacterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager
Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance
More informationDEEP NEURAL NETWORKS AND GPUS. Julie Bernauer
DEEP NEURAL NETWORKS AND GPUS Julie Bernauer GPU Computing GPU Computing Run Computations on GPUs x86 CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int
More informationDEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017
DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X
More informationThe Explosion in Neural Network Hardware
1 The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family Professor of Computer Science and Engineering The, Ann Arbor 1 What Just Happened? 2 For years the common wisdom
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationInference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA
Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationPERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices
PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices Chunhua Deng + City University of New York chunhua.deng@rutgers.edu Keshab K. Parhi University of Minnesota, Twin Cities parhi@umn.edu
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationMIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius
MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to FPGAs Sebastian Buschjäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 24, 2017 1 Recap: Convolution Observation 1 Even smaller images
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationArm s First-Generation Machine Learning Processor
Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationBrainchip OCTOBER
Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationDeep Learning Compiler
Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,
More informationArchitecting new deep neural networks for embedded applications
Architecting new deep neural networks for embedded applications Forrest Iandola 1 Machine Learning in 2012 Sentiment Analysis LDA Object Detection Deformable Parts Model Word Prediction Linear Interpolation
More informationAccelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx
Accelerating your Embedded Vision / Machine Learning design with the revision Stack Giles Peckham, Xilinx Xilinx Foundation at the Edge Vision Customers Using Xilinx >80 ADAS Models From 23 Makers >80
More informationarxiv: v1 [cs.cv] 20 May 2016
arxiv:1605.06402v1 [cs.cv] 20 May 2016 Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks By Philipp Matthias Gysel B.S. (Bern University of Applied Sciences, Switzerland) 2012
More informationarxiv: v2 [cs.ar] 15 May 2018
[DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches
More informationGPU-Accelerated Deep Learning
GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,
More informationASIC Design of Shared Vector Accelerators for Multicore Processors
26 th International Symposium on Computer Architecture and High Performance Computing 2014 ASIC Design of Shared Vector Accelerators for Multicore Processors Spiridon F. Beldianu & Sotirios G. Ziavras
More informationSparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity
SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity Jingyang Zhu 1, Jingbo Jiang 1, Xizi Chen 1 and Chi-Ying Tsui 2 Department of Electronic and Computer Engineering,
More informationGPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13
GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning
More informationFPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Date of publication 2018 00, 0000, date of current version 2018 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.2890150.DOI arxiv:1901.00121v1 [cs.ne] 1 Jan 2019 FPGA-based Accelerators of Deep
More informationDEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM
DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM AGENDA 1 Introduction to Deep Learning 2 What is DIGITS 3 How to use DIGITS Practical DEEP LEARNING Examples Image Classification, Object Detection,
More informationModel Compression. Girish Varma IIIT Hyderabad
Model Compression Girish Varma IIIT Hyderabad http://bit.ly/2tpy1wu Big Huge Neural Network! AlexNet - 60 Million Parameters = 240 MB & the Humble Mobile Phone 1 GB RAM 1/2 Billion FLOPs NOT SO BAD! But
More informationFei-Fei Li & Justin Johnson & Serena Yeung
Lecture 9-1 Administrative A2 due Wed May 2 Midterm: In-class Tue May 8. Covers material through Lecture 10 (Thu May 3). Sample midterm released on piazza. Midterm review session: Fri May 4 discussion
More informationConvolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech
Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:
More informationCapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse
Accepted for publication at Design, Automation and Test in Europe (DATE 2019). Florence, Italy CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Reuse Alberto Marchisio, Muhammad Abdullah
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationarxiv: v1 [cs.ne] 20 Nov 2017
E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks arxiv:1711.748v1 [cs.ne] 2 Nov 217 ABSTRACT Franyell Silfa, Gem Dot, Jose-Maria Arnau, Antonio Gonzalez Computer Architecture Deparment,
More informationStrober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL
Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, Krste Asanović
More informationFABRICATION TECHNOLOGIES
FABRICATION TECHNOLOGIES DSP Processor Design Approaches Full custom Standard cell** higher performance lower energy (power) lower per-part cost Gate array* FPGA* Programmable DSP Programmable general
More informationEfficient Processing for Deep Learning: Challenges and Opportuni:es
Efficient Processing for Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems In collabora*on with Yu-Hsin
More informationReal-time convolutional networks for sonar image classification in low-power embedded systems
Real-time convolutional networks for sonar image classification in low-power embedded systems Matias Valdenegro-Toro Ocean Systems Laboratory - School of Engineering & Physical Sciences Heriot-Watt University,
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationdirect hardware mapping of cnns on fpga-based smart cameras
direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June
More informationLayer-wise Performance Bottleneck Analysis of Deep Neural Networks
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William
More informationAccelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC
Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh and Debbie Marr Accelerator Architecture Lab,
More informationFast Hardware For AI
Fast Hardware For AI Karl Freund karl@moorinsightsstrategy.com Sr. Analyst, AI and HPC Moor Insights & Strategy Follow my blogs covering Machine Learning Hardware on Forbes: http://www.forbes.com/sites/moorinsights
More informationConvolutional Neural Network Layer Reordering for Acceleration
R1-15 SASIMI 2016 Proceedings Convolutional Neural Network Layer Reordering for Acceleration Vijay Daultani Subhajit Chaudhury Kazuhisa Ishizaka System Platform Labs Value Co-creation Center System Platform
More informationRedundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification
Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Jiang Su, Julian Faraone, Junyi Liu, Yiren Zhao, David B. Thomas, Philip H. W. Leong, and Peter Y. K. Cheung
More informationA Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,
More informationDistributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,
More informationSCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan Brucek Khailany Joel Emer Stephen W. Keckler
More informationShortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators
Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators Arash Azizimazreah, and Lizhong Chen School of Electrical Engineering and Computer Science Oregon State University, USA {azizimaa,
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationDeep Neural Network Evaluation
Lecture 8: Deep Neural Network Evaluation Visual Computing Systems Training/evaluating deep neural networks Technique leading to many high-profile AI advances in recent years Speech recognition/natural
More informationImplementing Deep Learning for Video Analytics on Tegra X1.
Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning
More informationSpeculations about Computer Architecture in Next Three Years. Jan. 20, 2018
Speculations about Computer Architecture in Next Three Years shuchang.zhou@gmail.com Jan. 20, 2018 About me https://zsc.github.io/ Source-to-source transformation Cache simulation Compiler Optimization
More informationNVIDIA FOR DEEP LEARNING. Bill Veenhuis
NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA
More informationScale-Out Acceleration for Machine Learning
Scale-Out Acceleration for Machine Learning Jongse Park Hardik Sharma Divya Mahajan Joon Kyung Kim Preston Olds Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology
More information