Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal

Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs Significance and Trade Offs References Questions

Basics of Artificial Neural Networks Artificial Neural Networks (ANNs) model how neurons in the brain work ANNs have an input, hidden, and output layer Training needs to occur to use the networks (large amounts of data processing) Image Source: https://goo.gl/bup9sf

Basics of Deep Learning Neural Networks Type of ANN Multiple hidden layers between the input and output Network is trained to predict outputs based on inputs Weights are assigned within the network based on a cost function Image Source: https://goo.gl/mwfpbk

Deep Learning Applications Speech recognition Image recognition Healthcare Advertising Self-driving cars Language translation

Deep Learning Challenges Accuracy Energy Energy per operation Throughput//Latency Cost GOPS Frame rate Delay Large data sets Area (memory and logic size) Monetary

Solutions to Deep Learning Challenges DNN Accelerator Architectures Temporal Architecture and Spatial Architecture Bottleneck is in memory access Solutions Convolutional reuse Reuse activations and filter weights Feature Map Reuse Reuse activations Filter Reuse Reuse filter weights Image Source: https://goo.gl/ftc3cr

Parallelism in DNN Data inputs Filters and Convolution Elements within a filter Multiplies within layer are independent Sums are reductions Only layers are dependent Non data dependent operations -> can be statically scheduled

CPUs for Deep Learning CPUs for Deep Learning Intel Knights Landing (2016) 7 TFLOPS 14nm Process

GPUs for Deep Learning NVIDIA PASCA: GP100 10/20 TFLOPS 16 nm process NVIDIA DGX-1 170 TFLOPS 8 Tesla P100s and Dual Xeon Same or better prediction accuracy Faster results Smaller footprint Lower power

FPGAs for Deep Learning Intel Stratix 10 10 TFLOPS XIlinx Virtex Ultrascale 16nm process Faster and more efficient for special DNNs

FPGA for Deep Learning

ASICS - DianNao Improved CNN computation efficiency Dedicated functional units and memory buffers optimized for the CNN workload Low-level fine-grained ISA Multiplier, adder tree, shifter, and non linear lookup Weights in off-chip DRAM 452 GOP/s, 3.02mm^2 and 485mW

DianNao NFU Pipeline

Performance of ASICs

Accelerators Accelerators Minimize data movement Optimize accesses to different areas of memory Temporal Structure Characteristics.5-1kB of shared memory (for each processing unit) 100-500kB of shared memory (for each global buffer) Image Sources: https://goo.gl/hdv2dq

Accelerators DNN Dataflows and their relation to comp arch Weighted Stationary (WS) Weights put in register file at processing element (PE) and remain stationary Minimize movement cost of weights Output Stationary (OS) Outputs put in register file at PE Minimize movement cost of partial sums No Local Reuse(NLR) No local storage, all space allocated to global buffer to increase capacity Row Stationary Row of convolution filter stored in PE Energy consumption is 1.4x to 2.5x more energy Image Source: https://goo.gl/hdv2dq

Accelerators Precision Reduction Most GPUs use 32 and 64 bits 16 bits can be used without impacting accuracy. Object detection algorithm requires 9-bits per dimension Some DNNs use 8-bit integer ops Savings of 2.56x to 2.24x Sparsity SVM (supervised models) Inputs can be made sparse (by pre-processing) to allow for power reduction Ie. input image made sparse can reduce power consumption by 24% Pruning of weights occurs to minimize power cost Specifically cut expensive inputs Hardware can be used to exploit sparse weights Units designed to skip reads and macs when inputs are zero resulting in 45% energy reduction. Image Source: https://goo.gl/hdv2dq

Significance and Trade-Offs Low power - training large amounts of data, portable devices CPU FPGA ASIC High performance - throughput/latency crucial (self-driving cars) ASIC GPU Low cost - consumer electronics CPU GPU

References [1]J. Emer, V. Sze, and Y.-H. Chen, Discursus super oratione dnn. legatorum Danicorvm habita in consessv dnn. Ordinum Generalium [...] die 10. ian. anno 1660, Dutch Pamphlets Online, pp. 1 32, 2017. [2]Y.-H. Chin, J. Emer, A. Suleiman, and Z. Zhang, Figure 2f from: Irimia R, Gottschling M (2016) Taxonomic revision of Rochefortia Sw. (Ehretiaceae, Boraginales). Biodiversity Data Journal 4: e7720. https://doi.org/10.3897/bdj.4.e7720, Hardware for Machine Learning: Challenges and Oppotrunities, pp. 1 8, Oct. 2017. [3]R. Raicea, Figure 2f from: Irimia R, Gottschling M (2016) Taxonomic revision of Rochefortia Sw. (Ehretiaceae, Boraginales). Biodiversity Data Journal 4: e7720. https://doi.org/10.3897/bdj.4.e7720, Want to know how Deep Learning works? Here's a quick guide for everyone., Oct. 2017.

Questions?