Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Similar documents
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Deep Learning Accelerators

DNN Accelerator Architectures

How to Estimate the Energy Consumption of Deep Neural Networks

Research Faculty Summit Systems Fueling future disruptions

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Brainchip OCTOBER

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

A Method to Estimate the Energy Consumption of Deep Neural Networks

Revolutionizing the Datacenter

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

In Live Computer Vision

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

An Adaptable Deep Learning Accelerator Unit (DLAU) for FPGA

THE NVIDIA DEEP LEARNING ACCELERATOR

Hardware for Deep Learning

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands MSc THESIS. Exploring Convolutional Neural Networks on the

Small is the New Big: Data Analytics on the Edge

Automatic Speech Recognition (ASR)

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

A Communication-Centric Approach for Designing Flexible DNN Accelerators

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Low-Power Neural Processor for Embedded Human and Face detection

Lecture 41: Introduction to Reconfigurable Computing

arxiv: v1 [cs.ne] 20 Nov 2017

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Xilinx ML Suite Overview

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Versal: AI Engine & Programming Environment

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

MODELING AND ANALYZING DEEP LEARNING ACCELERATOR DATAFLOWS WITH MAESTRO

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

arxiv: v1 [cs.cv] 11 Feb 2018

Introduction to Neural Networks

Software Defined Hardware

FPGA-based Supercomputing: New Opportunities and Challenges

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Deep Learning with Intel DAAL

AI Requires Many Approaches

Value-driven Synthesis for Neural Network ASICs

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Monolithic 3D IC Design for Deep Neural Networks

Course Overview Revisited

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Embedded Systems. 7. System Components

ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI

Toward a Memory-centric Architecture

PULP: an open source hardware-software platform for near-sensor analytics. Luca Benini IIS-ETHZ & DEI-UNIBO

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

Neural Computer Architectures

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Fast Hardware For AI

Efficient Processing for Deep Learning: Challenges and Opportuni:es

direct hardware mapping of cnns on fpga-based smart cameras

The Nios II Family of Configurable Soft-core Processors

Unified Deep Learning with CPU, GPU, and FPGA Technologies

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Neural Network based Energy-Efficient Fault Tolerant Architect

High-Performance Hardware for Machine Learning

Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

The Explosion in Neural Network Hardware

arxiv: v2 [cs.cv] 3 May 2016

A Lightweight YOLOv2:

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 23 Domain- Specific Architectures

Field Program mable Gate Arrays

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Cloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera

Smart Ultra-Low Power Visual Sensing

Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks

Binary Convolutional Neural Network on RRAM

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Won Woo Ro, Ph.D. School of Electrical and Electronic Engineering

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Digital system (SoC) design for lowcomplexity. Hyun Kim

Deep Neural Network Evaluation

Transcription:

Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal

Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs Significance and Trade Offs References Questions

Basics of Artificial Neural Networks Artificial Neural Networks (ANNs) model how neurons in the brain work ANNs have an input, hidden, and output layer Training needs to occur to use the networks (large amounts of data processing) Image Source: https://goo.gl/bup9sf

Basics of Deep Learning Neural Networks Type of ANN Multiple hidden layers between the input and output Network is trained to predict outputs based on inputs Weights are assigned within the network based on a cost function Image Source: https://goo.gl/mwfpbk

Deep Learning Applications Speech recognition Image recognition Healthcare Advertising Self-driving cars Language translation

Deep Learning Challenges Accuracy Energy Energy per operation Throughput//Latency Cost GOPS Frame rate Delay Large data sets Area (memory and logic size) Monetary

Solutions to Deep Learning Challenges DNN Accelerator Architectures Temporal Architecture and Spatial Architecture Bottleneck is in memory access Solutions Convolutional reuse Reuse activations and filter weights Feature Map Reuse Reuse activations Filter Reuse Reuse filter weights Image Source: https://goo.gl/ftc3cr

Parallelism in DNN Data inputs Filters and Convolution Elements within a filter Multiplies within layer are independent Sums are reductions Only layers are dependent Non data dependent operations -> can be statically scheduled

CPUs for Deep Learning CPUs for Deep Learning Intel Knights Landing (2016) 7 TFLOPS 14nm Process

GPUs for Deep Learning NVIDIA PASCA: GP100 10/20 TFLOPS 16 nm process NVIDIA DGX-1 170 TFLOPS 8 Tesla P100s and Dual Xeon Same or better prediction accuracy Faster results Smaller footprint Lower power

FPGAs for Deep Learning Intel Stratix 10 10 TFLOPS XIlinx Virtex Ultrascale 16nm process Faster and more efficient for special DNNs

FPGA for Deep Learning

ASICS - DianNao Improved CNN computation efficiency Dedicated functional units and memory buffers optimized for the CNN workload Low-level fine-grained ISA Multiplier, adder tree, shifter, and non linear lookup Weights in off-chip DRAM 452 GOP/s, 3.02mm^2 and 485mW

DianNao NFU Pipeline

Performance of ASICs

Accelerators Accelerators Minimize data movement Optimize accesses to different areas of memory Temporal Structure Characteristics.5-1kB of shared memory (for each processing unit) 100-500kB of shared memory (for each global buffer) Image Sources: https://goo.gl/hdv2dq

Accelerators DNN Dataflows and their relation to comp arch Weighted Stationary (WS) Weights put in register file at processing element (PE) and remain stationary Minimize movement cost of weights Output Stationary (OS) Outputs put in register file at PE Minimize movement cost of partial sums No Local Reuse(NLR) No local storage, all space allocated to global buffer to increase capacity Row Stationary Row of convolution filter stored in PE Energy consumption is 1.4x to 2.5x more energy Image Source: https://goo.gl/hdv2dq

Accelerators Precision Reduction Most GPUs use 32 and 64 bits 16 bits can be used without impacting accuracy. Object detection algorithm requires 9-bits per dimension Some DNNs use 8-bit integer ops Savings of 2.56x to 2.24x Sparsity SVM (supervised models) Inputs can be made sparse (by pre-processing) to allow for power reduction Ie. input image made sparse can reduce power consumption by 24% Pruning of weights occurs to minimize power cost Specifically cut expensive inputs Hardware can be used to exploit sparse weights Units designed to skip reads and macs when inputs are zero resulting in 45% energy reduction. Image Source: https://goo.gl/hdv2dq

Significance and Trade-Offs Low power - training large amounts of data, portable devices CPU FPGA ASIC High performance - throughput/latency crucial (self-driving cars) ASIC GPU Low cost - consumer electronics CPU GPU

References [1]J. Emer, V. Sze, and Y.-H. Chen, Discursus super oratione dnn. legatorum Danicorvm habita in consessv dnn. Ordinum Generalium [...] die 10. ian. anno 1660, Dutch Pamphlets Online, pp. 1 32, 2017. [2]Y.-H. Chin, J. Emer, A. Suleiman, and Z. Zhang, Figure 2f from: Irimia R, Gottschling M (2016) Taxonomic revision of Rochefortia Sw. (Ehretiaceae, Boraginales). Biodiversity Data Journal 4: e7720. https://doi.org/10.3897/bdj.4.e7720, Hardware for Machine Learning: Challenges and Oppotrunities, pp. 1 8, Oct. 2017. [3]R. Raicea, Figure 2f from: Irimia R, Gottschling M (2016) Taxonomic revision of Rochefortia Sw. (Ehretiaceae, Boraginales). Biodiversity Data Journal 4: e7720. https://doi.org/10.3897/bdj.4.e7720, Want to know how Deep Learning works? Here's a quick guide for everyone., Oct. 2017.

Questions?