Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Size: px

Start display at page:

Download "Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal"

Verity Taylor
5 years ago
Views:

1 Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal

2 Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs Significance and Trade Offs References Questions

3 Basics of Artificial Neural Networks Artificial Neural Networks (ANNs) model how neurons in the brain work ANNs have an input, hidden, and output layer Training needs to occur to use the networks (large amounts of data processing) Image Source:

predict outputs based on inputs Weights are assigned within the

4 Basics of Deep Learning Neural Networks Type of ANN Multiple hidden layers between the input and output Network is trained to predict outputs based on inputs Weights are assigned within the network based on a cost function Image Source:

5 Deep Learning Applications Speech recognition Image recognition Healthcare Advertising Self-driving cars Language translation

6 Deep Learning Challenges Accuracy Energy Energy per operation Throughput//Latency Cost GOPS Frame rate Delay Large data sets Area (memory and logic size) Monetary

Solutions to Deep Learning Challenges DNN Accelerator Architectures Temporal Architecture and Spatial Architecture Bottleneck is in memory access Solutions

7 Solutions to Deep Learning Challenges DNN Accelerator Architectures Temporal Architecture and Spatial Architecture Bottleneck is in memory access Solutions Convolutional reuse Reuse activations and filter weights Feature Map Reuse Reuse activations Filter Reuse Reuse filter weights Image Source:

8 Parallelism in DNN Data inputs Filters and Convolution Elements within a filter Multiplies within layer are independent Sums are reductions Only layers are dependent Non data dependent operations -> can be statically scheduled

9 CPUs for Deep Learning CPUs for Deep Learning Intel Knights Landing (2016) 7 TFLOPS 14nm Process

10 GPUs for Deep Learning NVIDIA PASCA: GP100 10/20 TFLOPS 16 nm process NVIDIA DGX TFLOPS 8 Tesla P100s and Dual Xeon Same or better prediction accuracy Faster results Smaller footprint Lower power

11 FPGAs for Deep Learning Intel Stratix TFLOPS XIlinx Virtex Ultrascale 16nm process Faster and more efficient for special DNNs

12 FPGA for Deep Learning

13 ASICS - DianNao Improved CNN computation efficiency Dedicated functional units and memory buffers optimized for the CNN workload Low-level fine-grained ISA Multiplier, adder tree, shifter, and non linear lookup Weights in off-chip DRAM 452 GOP/s, 3.02mm^2 and 485mW

14 DianNao NFU Pipeline

15 Performance of ASICs

5-1kB of shared memory (for each processing unit) 100-500kB of

16 Accelerators Accelerators Minimize data movement Optimize accesses to different areas of memory Temporal Structure Characteristics.5-1kB of shared memory (for each processing unit) kB of shared memory (for each global buffer) Image Sources:

Accelerators DNN Dataflows and their relation to comp arch Weighted Stationary (WS) Weights put in register file at processing element (PE) and remain stationary Minimize movement cost of weights

17 Accelerators DNN Dataflows and their relation to comp arch Weighted Stationary (WS) Weights put in register file at processing element (PE) and remain stationary Minimize movement cost of weights Output Stationary (OS) Outputs put in register file at PE Minimize movement cost of partial sums No Local Reuse(NLR) No local storage, all space allocated to global buffer to increase capacity Row Stationary Row of convolution filter stored in PE Energy consumption is 1.4x to 2.5x more energy Image Source:

18 Accelerators Precision Reduction Most GPUs use 32 and 64 bits 16 bits can be used without impacting accuracy. Object detection algorithm requires 9-bits per dimension Some DNNs use 8-bit integer ops Savings of 2.56x to 2.24x Sparsity SVM (supervised models) Inputs can be made sparse (by pre-processing) to allow for power reduction Ie. input image made sparse can reduce power consumption by 24% Pruning of weights occurs to minimize power cost Specifically cut expensive inputs Hardware can be used to exploit sparse weights Units designed to skip reads and macs when inputs are zero resulting in 45% energy reduction. Image Source:

19 Significance and Trade-Offs Low power - training large amounts of data, portable devices CPU FPGA ASIC High performance - throughput/latency crucial (self-driving cars) ASIC GPU Low cost - consumer electronics CPU GPU

20 References [1]J. Emer, V. Sze, and Y.-H. Chen, Discursus super oratione dnn. legatorum Danicorvm habita in consessv dnn. Ordinum Generalium [...] die 10. ian. anno 1660, Dutch Pamphlets Online, pp. 1 32, [2]Y.-H. Chin, J. Emer, A. Suleiman, and Z. Zhang, Figure 2f from: Irimia R, Gottschling M (2016) Taxonomic revision of Rochefortia Sw. (Ehretiaceae, Boraginales). Biodiversity Data Journal 4: e Hardware for Machine Learning: Challenges and Oppotrunities, pp. 1 8, Oct [3]R. Raicea, Figure 2f from: Irimia R, Gottschling M (2016) Taxonomic revision of Rochefortia Sw. (Ehretiaceae, Boraginales). Biodiversity Data Journal 4: e Want to know how Deep Learning works? Here's a quick guide for everyone., Oct

21 Questions?

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient