Deep Learning Accelerators

Similar documents
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

DNN Accelerator Architectures

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Research Faculty Summit Systems Fueling future disruptions

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

How to Estimate the Energy Consumption of Deep Neural Networks

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Xilinx ML Suite Overview

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Revolutionizing the Datacenter

COSC 6339 Accelerators in Big Data

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Improving Performance of Machine Learning Workloads

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

In Live Computer Vision

Master Informatics Eng.

Portland State University ECE 588/688. Graphics Processors

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 23 Domain- Specific Architectures

Research Faculty Summit Systems Fueling future disruptions

Software Defined Hardware

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

High Performance Computing

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

THE NVIDIA DEEP LEARNING ACCELERATOR

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

Adaptable Intelligence The Next Computing Era

ECE 8823: GPU Architectures. Objectives

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

arxiv: v1 [cs.cv] 11 Feb 2018

Matrix Multiplication

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

CafeGPI. Single-Sided Communication for Scalable Deep Learning

NVIDIA PLATFORM FOR AI

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

XPU A Programmable FPGA Accelerator for Diverse Workloads

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Binary Convolutional Neural Network on RRAM

A Communication-Centric Approach for Designing Flexible DNN Accelerators

! Readings! ! Room-level, on-chip! vs.!

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Fast Hardware For AI

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

TVM Stack Overview. Tianqi Chen

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

The Explosion in Neural Network Hardware

Versal: AI Engine & Programming Environment

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Parallel Processing SIMD, Vector and GPU s cont.

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Efficient Processing for Deep Learning: Challenges and Opportuni:es

An introduction to Machine Learning silicon

Enabling Technology for the Cloud and AI One Size Fits All?

Hello Edge: Keyword Spotting on Microcontrollers

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Computer Architecture

A Method to Estimate the Energy Consumption of Deep Neural Networks

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Fundamental CUDA Optimization. NVIDIA Corporation

DNN Dataflow Choice Is Overrated

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CS 426 Parallel Computing. Parallel Computing Platforms

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

The Explosion in Neural Network Hardware

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

Addressing the Memory Wall

Fundamental CUDA Optimization. NVIDIA Corporation

Transcription:

Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project

Outline Introduction Tensor Processing Unit (Google) TPU Architecture Evaluation Nvidia Tesla V100 Cloud TPU Eyeriss (MIT) What is Deep Learning? Why do we need Deep Learning Accelerators? A Primer on Neural Networks Convolutional Neural Networks (CNNs) Dataflow Taxonomy Eyeriss dataflow Evaluation How do Eyeriss and TPU compare? Many more DL accelerators References

Introduction

What is Deep Learning? Image Source: Google Images

Why do we need DL accelerators? DL models essentially comprise of compute intensive operations like matrix multiplication, convolution, FFT etc. Input data for these models is usually of the order of GBs Large amount of computation over massive amounts of data CPUs support computations spanning all kinds of applications, hence they are bound to be slower when compared to an application specific hardware CPUs are sophisticated due to their need to optimize control flow (branch prediction, speculation etc.) while Deep Learning barely has any control flow Energy consumption can be minimized with specialization 350k tweets / minute 350M images / day Sources: Twitter Facebook Youtube 300 hours of video / minute

A Primer on Neural Networks Matrix Multiplication

Tensor Processing Unit (Google)

Tensor Processing Unit [TPU] Developed by Google to accelerate neural network computations Production-ready co-processor connected to host via PCIe Powers many of Google s services like Translate, Search, Photos, Gmail etc. Why not GPUs? GPUs don t meet the latency requirements for performing inference GPUs tend to be underutilized for inference due to small batch sizes GPUs are still relatively general-purpose Host sends instructions to TPU rather than the TPU fetching it itself TPU closer in spirit to a Floating Point Unit rather than a GPU

TPU Architecture Host sends instructions over PCIe bus into the instruction buffer Matrix Multiply Unit (MMU) heart of TPU 256x256 8-bit MACs Accumulators aggregate partial sums Weight Memory (WM) off-chip DRAM - 8 GB Weight FIFO (WFIFO) on-chip fetcher to read from WM Unified Buffer (UB) on-chip for intermediate values

TPU Architecture Host sends instructions over PCIe bus into the instruction buffer Matrix Multiply Unit (MMU) heart of TPU 256x256 8-bit MACs Accumulators aggregate partial sums Weight Memory (WM) off-chip DRAM - 8 GB Weight FIFO (WFIFO) on-chip fetcher to read from WM Unified Buffer (UB) on-chip for intermediate values

TPU Architecture Host sends instructions over PCIe bus into instruction buffer Matrix Multiply Unit (MMU) heart of TPU 256x256 8-bit MACs Accumulators aggregate partial sums Weight Memory (WM) off-chip DRAM - 8 GB Weight FIFO (WFIFO) on-chip fetcher to read from WM Unified Buffer (UB) on-chip for intermediate values

MMU implemented as a systolic array Multiplying an input vector by a weight matrix with a systolic array

TPU Architecture Host sends instructions over PCIe bus into instruction buffer Matrix Multiply Unit (MMU) heart of TPU 256x256 8-bit MACs Accumulators aggregate partial sums Weight Memory (WM) off-chip DRAM - 8 GB Weight FIFO (WFIFO) on-chip fetcher to read from WM Unified Buffer (UB) on-chip for intermediate values

TPU Architecture Host sends instructions over PCIe bus into instruction buffer Matrix Multiply Unit (MMU) heart of TPU 256x256 8-bit MACs Accumulators Aggregate partial sums Weight Memory (WM) Off-chip DRAM - 8 GB Weight FIFO (WFIFO) On-chip fetcher to read from WM Unified Buffer (UB) On-chip for intermediate values

TPU Architecture Host sends instructions over PCIe bus into instruction buffer Matrix Multiply Unit (MMU) heart of TPU 256x256 8-bit MACs Accumulators Aggregate partial sums Weight Memory (WM) Off-chip DRAM - 8 GB Weight FIFO On-chip fetcher to read from WM Unified Buffer (UB) On-chip for intermediate values

TPU ISA CISC instructions (average CPI = 10 to 20 cycles) 12 instructions Read_Host_Memory: reads data from host memory into Unified Buffer Read_Weights: reads weights from Weights Memory into Weight FIFO MatrixMultiply/Convolve: perform matmul/convolution on data from UB and WM and store into Accumulators B X 256 input and 256 X 256 weight => B X 256 output in B cycles (pipelined) Activate: apply activation function on inputs from Accumulator and store into Unified Buffer Write_Host_Memory: writes data from Unified Buffer into host memory Software stack - application code to be run on TPU written in Tensorflow and compiled into an API which can be run on TPU (or even GPU)

Evaluation Performance comparison based on predictions per second on common DL workloads overpowers GPUs massively for CNNs performs reasonably well than GPUs for MLPs performs close to GPUs for LSTMs Good programmability production ready Bad converts convolution into matmul which may not be most optimal no direct support for sparsity

Nvidia Tesla V100 Tensor cores programmable matrix-multiply-and-accumulate units 8 cores/sm => total = 640 cores input - 4x4 matrices A,B must be FP16 C,D can be FP16/FP32 Exposed as Warp-level matmul operation in CUDA 9 Specialized matrix load/multiply/accumulate/store operations Part of multi GPU system optimized using NvLink interconnect and High Bandwidth Memory

Cloud TPU Part of Google Cloud Each node comprises of 4 chips 2 tensor cores per chip each core has scalar, vector and matrix units (MXU) 8/16 GB on-chip HBM per core 8 cores per cloud TPU node coupled with high bandwidth interconnect TPU Estimator APIs used to generate tensorflow computation graph, which is sent over grpc and Just In Time compiled onto the cloud TPU node TPU chip (v2 and v3) as part of cloud TPU node

Eyeriss (MIT)

Convolutional Neural Networks Each convolution layer identifies certain fine grained features from the input image, aggregating over features from previous layers Very often there are certain optional layers in between CONV layers such as NORM/POOL layers to reduce the range/size of input values Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption

2D Convolution operation 2D convolution is a set of multiply and accumulate operations of the kernel matrix (also known as filter) and the input image feature map by sliding the filter over the image Image Source: Understanding Convolutional Layers in Convolutional Neural Networks (CNNs)

Multi-channel input with multi-channel filters Each filter and fmap have C channels -> the application of a filter on an input fmap across C channels results in one cell of the output fmap Rest of the cells of the output fmap are obtained by sliding the filter over the input fmap producing one channel of the output fmap Application of M such filters results in a single M channeled output fmap with as many channels as the number of filters Previous steps are batched over multiple input fmaps resulting in multiple output fmaps

Things to note Operations exhibit high parallelism High throughput possible Memory access is the bottleneck Lot of scope for data reuse 200x 1x WORST CASE: all memory R/W are DRAM accesses Example: AlexNet [NIPS 2012] -> 724M MACs = 2896M DRAM accesses required

Memory access is the bottleneck

Memory access is the bottleneck 1 Opportunities: 1. Reuse filters/fmap reducing DRAM reads

Memory access is the bottleneck 1 2 Opportunities: 1. Reuse filters/fmap reducing DRAM reads 2. Partial sum accumulation does not have to access DRAM

Types of data reuse in DNN

Spatial Architecture for DNN Efficient Data Reuse Distributed local storage (RF) Inter PE communication Sharing among regions of PEs

Data movement is expensive

Data movement is expensive How to exploit data reuse and local accumulation with limited low-cost local storage?

Data movement is expensive How to exploit data reuse and local accumulation with limited low-cost local storage? Require specialized processing dataflow!

Dataflow Taxonomy Weight Stationary (WS) - reduce movement of filter weights Output Stationary (OS) - reduce movement of partial sums No Local Reuse (NLR) - no local storage at the PE, use a global buffer of larger size

Weight Stationary Examples: Chakradhar [ISCA 2010], Origami [GLSVLSI 2015]

Output Stationary Examples: Gupta [ICML 2015], ShiDianNao [ISCA 2015]

No Local Reuse Examples: DaDianNao [MICRO 2014], Zhang [FPGA 2015]

Eyeriss data flow: Row Stationary Previous approaches only optimize for certain types of data reuse -> this may lead to performance degradation when input dimensions vary Eyeriss maximizes reuse and accumulation at RF Eyeriss optimizes for overall energy efficiency instead of only a specific input type (input fmap, filters, psums) Eyeriss tries to break high dimensional convolution into 1D convolutional primitives which operate on one row of filter weights, one row of input feature map generating one row of partial sums => Row Stationary

1D Row Convolution in PE

1D Row Convolution in PE

1D Row Convolution in PE

1D Row Convolution in PE Maximize row convolutional reuse in RF Keep a filter row and fmap sliding window in RF Maximize row psum accumulation in RF

2D convolution in a PE array

2D convolution in a PE array

2D convolution in a PE array

2D convolution in a PE array

Convolutional Reuse Maximized Filter rows are reused across PEs horizontally

Convolutional Reuse Maximized Fmap rows are reused across PEs diagonally

Convolutional Reuse Maximized Partial sums accumulated across PEs vertically

DNN Processing - The Full Picture

Mapping DNN to the PEs

Eyeriss Deep CNN Accelerator

Evaluation

How do Eyeriss and TPU compare? Programmability? Usability? Eyeriss memory hierarchy also includes Inter PE communication while TPU s does not explicitly Applications? TPU is relatively more general purpose while Eyeriss is highly optimized for CNNs Memory hierarchy? TPU is far more programmable than Eyeriss TPUs are being pushed towards training workloads while Eyeriss is optimized for inference Energy? Chip size and cost?

Many more DL accelerators... State-of-the-art neural networks (AlexNet, ResNet, LeNet etc) large in size high power consumption due to memory access difficult to deploy on embedded devices End-to-end deployment solution (Song et.al.) use deep compression to make network fit into SRAM deploy it on EIE (Energy efficient Inference Engine) which accelerates resulting sparse vector matrix multiplication on the compressed network Accelerators for other DL models Generative Adversarial Networks - GANAX (Amir et.al.) RNNs, LSTMs - FPGA based accelerators, ESE (Song et.al.) Mobile phone SoCs Google Pixel 2 - Visual Core, IPhone X - Neural Engine, Samsung Exynos - NPU

References An in-depth look at Google s first Tensor Processing Unit (TPU) In-Datacenter Performance Analysis of a Tensor Processing Unit Images and some content pertaining to the Eyeriss architecture has been lifted as is from the Eyeriss tutorial. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks http://eyeriss.mit.edu EIE: Efficient Inference Engine on Compressed Deep Neural Network GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks ESE: Efficient Speech Recognition Engine with Sparse LSTMs on FPGA

:)