Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Similar documents
Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Staged Memory Scheduling

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Deep Learning Accelerators

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

DNN Accelerator Architectures

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

THE NVIDIA DEEP LEARNING ACCELERATOR

Adaptable Intelligence The Next Computing Era

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Arm s First-Generation Machine Learning Processor

XPU A Programmable FPGA Accelerator for Diverse Workloads

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Research Faculty Summit Systems Fueling future disruptions

Xilinx ML Suite Overview

Revolutionizing the Datacenter

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Research Faculty Summit Systems Fueling future disruptions

Versal: AI Engine & Programming Environment

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

High Performance Computing

Comparing Memory Systems for Chip Multiprocessors

Flexible On-chip Memory Architecture for DCNN Accelerators

Addressing the Memory Wall

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Research Challenges for FPGAs

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

A VM-HDL Co-Simulation Framework for Systems with PCIe-Connected FPGAs

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Parallel graph traversal for FPGA

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Overview of ROCCC 2.0

MODELING AND ANALYZING DEEP LEARNING ACCELERATOR DATAFLOWS WITH MAESTRO

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

In Live Computer Vision

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Deep Learning Hardware Acceleration

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

Deep Learning Requirements for Autonomous Vehicles

Scale-Out Acceleration for Machine Learning

Altera SDK for OpenCL

Software Defined Hardware

Simultaneous Multithreading on Pentium 4

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

THE DYNAMIC GRANULARITY MEMORY SYSTEM

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Energy scalability and the RESUME scalable video codec

direct hardware mapping of cnns on fpga-based smart cameras

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

An introduction to Machine Learning silicon

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

Brainchip OCTOBER

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

GreenDroid: An Architecture for the Dark Silicon Age

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Maximizing Face Detection Performance

Practical Near-Data Processing for In-Memory Analytics Frameworks

High-Speed NAND Flash

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Introduction to OpenMP. Lecture 10: Caches

Advanced CUDA Optimization 1. Introduction

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

A Cache Hierarchy in a Computer System

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Flexible Architecture Research Machine (FARM)

How to Estimate the Energy Consumption of Deep Neural Networks

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Transcription:

Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman

Maximizing Server Efficiency with ML accelerators Michael Ferdman

Stony Brook University

Toward Efficiency of Cloud Servers Number of servers grows rapidly User base increasing exponentially New services appearing daily Constant need for more servers Many costs: HW, space, power Want max server efficiency

Achieving Server Efficiency Must target two classes of workloads Server software Machine learning

Talk Outline Machine Learning Accelerators Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Resource Partitioning [FPL 16][ISCA 17] Server Workloads CloudSuite Benchmarks [ASPLOS 12][TopPicks 14][ISPASS 16] Proactive Instruction Fetch [MICRO 08][MICRO 11] Hardware Memoization [ASPLOS 15] Scale-Out Processors [ISCA 12] Reactive NUCA, Cuckoo Directories [ISCA 09][TopPicks 10][HPCA 10] Cache Bursts, Spatio-Temporal Streaming [MICRO 08][HPCA 09][TopPicks 10]

Deep Convolutional Neural Networks Revolutionizing machine learning Need fast and efficient evaluation mechanisms DOG Lots of computation, good target for accelerators GPUs, FPGAs, ASICs

CNN Accelerator Efficiency What we want: All compute units do useful work all the time Main challenges: Off-chip data transfer Starves compute units Expensive energy cost Mapping computation to compute units An accelerator has thousands of compute units How to make sure no compute unit is left idle?

Our Contributions Reduce off-chip data transfer: Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Increase Compute Unit utilization: Resource Partitioning [ISCA 17]

Feature Map Data in CNN Acceleration Layers are computed one after another Uses external memory at each layer Input and output small, inter-layer data are large Conv. Layer 0 Conv. Layer 1 Conv. Layer X DOG External Memory

Data Size [MB] 16 14 12 10 8 6 4 2 0 Feature Map Transfer Challenge Large amount of data transferred on and off chip e.g., VGGNet-E (includes pooling layers) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Convolutional Layer Number Input Size Output Size Weight Size Transfers MBs of data on and off chip per image

Fused-Layer CNNs Demonstrate existence of inter-layer data locality Re-order computation to cache intermediate data Trade on-chip buffer for reduced off-chip transfer Conv. Layer 0 Conv. Layer 1 Conv. Layer X DOG External Memory

Background: CNN Evaluation

Background: CNN Evaluation

Background: CNN Evaluation

Background: CNN Evaluation Becomes input for next layer Data goes to external memory in-between layers

Layer Fusion Save memory bandwidth by processing multiple layers Challenges: Intermediate data too large to store on chip Layer s output order different than next layer s input order Approach: Re-arrange CNN evaluation operations to exploit inter-layer locality

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P output pixel 1 tile 1

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P output pixel 1 output pixel 2 tile 1 tile 2

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P tile 1 tile 2 new data for tile 2 output pixel 1 output pixel 2

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P tile 1 tile 2 new data for tile 2 overlapping computation output pixel 1 Store intermediate values on chip output pixel 2

Layer Fusion Exploration Results No fused layers: 0 KB storage 86 MB transfer 118 KB storage 25 MB transfer DRAM Transfer [MB] 100 80 60 40 20 0 Example: VGGNet-E Layers 1 5 0 100 200 300 400 Extra on-chip storage required [KB] All fused layers: 362 KB storage 3.6 MB transfer

Fused-Layer Benefits Validated concept with prototype FPGA: fused-layer vs. FPGA 15 [Zhang et al.] 95% transfer reduction (on 5 layers of VGGNet-E) Just 20% extra on-chip memory cost CPUs experience ~6x speedup from fusing layers When targeting buffer size ~L1 cache size GPUs can similarly do fused-layer BW-limited mobile/embedded GPUs Implementing this can be challenging Massive BW reduction, works on FPGA, GPU, CPU!

Our Contributions Reduce off-chip data transfer: Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Increase Compute Unit utilization: Resource Partitioning [ISCA 17]

Data Size [MB] Feature Map Transfer Challenge Large amount of data transferred on and off chip e.g., VGGNet-E (includes pooling layers) 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Convolutional Layer Number Input Size Output Size Weight Size Transfers MBs of data on and off chip per image

Data Size [MB] Weight Data Transfer Challenge Large amount of data transferred on and off chip e.g., VGGNet-E (includes pooling layers) 350 300 250 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 F1 F2 F3 CNN Layers Input Size Output Size Weight Size Transfers MBs of data on and off chip per image

Weight Bandwidth and Batching Weight transfer a limiter of throughput Bandwidth(GB/s) Throughput VS. Weight Bandwidth Requirement 50 40 30 20 10 0 DDR3(800MHz) VGGNet-E 0 20 40 60 80 100 Throughput(Images/s) AlexNet Batching reduces weight transfer Read a weight on chip once, reuse for a batch of images Batching is limited in existing accelerators On-chip buffer capacity = # images * buffer size Need optimized batching to minimize bandwidth

Fully-Connected Layers Input Vector Output Vector Weight Vectors Input Vector Weight Vector Output

Output Buffer Size and Input Retransfer 12 Input Vector Accum. 1 st input Accum. 2 nd input Accum. all input Output Vector On-chip Buffers Weight Vectors Buffer whole Output Vector Input Vector read once Buffer half an Output Vector Input Vector read twice Smaller output buffer More Input Vector retransfer

MB transferrred per image Input vs. Weight Transfer Weight Dominant Input Dominant 20 16 160 KB (Outbuf Budget) 12 320 KB 8 640 KB 4 0 1280 KB 0 200 400 600 800 1000 G (Batch Size) (first fully-connected layer in VGGNet-E, 32-bit floating point) Different budgets have different optimal points Different layers also have different optimal points Flexible buffering is needed to be near optimal points

Optimized Batching Contributions Batch size affects Input Feature Map retransfer An important trade-off overlooked by existing designs Conv layers also need batching Existing designs only consider batching FC layers New accelerator design w/flexible buffering: Escher Different layers can use different optimized batch sizes Support batching in both Conv and FC layers Escher vs. existing batching: reduce bw. up to 2.4x Batching in Conv layers: reduce bw. up to 1.7x Flexible buffering enables optimized batching

Our Contributions Reduce off-chip data transfer: Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Increase Compute Unit utilization: Resource Partitioning [ISCA 17]

CNN Accelerator Underutilization State-of-the-art: all compute one processor Single-CLP (Convolutional Layer Processor) Dimension mismatch causes underutilization CNN layers each have different dimensions (N R C M) Traditional approach: one fixed-dimension CLP FPGA Not utilized CLP Convolutional Layer One size fits all poor compute utilization

Compute-unit Underutilization Problem Average utilization 56% Out-Buf In-Buf Out-Buf In-Buf Out-Buf Out-Buf Dimension mismatch poor dynamic utilization

Our Multi-CLP Solution Single large CLP Multiple smaller CLPs Each CLP optimized for a subset of layers CLP dimensions fit CNN layers Accelerator CLP {Layers 1 to 5} Accelerator CLP0 {Layer 1} CLP1 {Layer 3} CLP2 {Layers 2,4,5} Use specialization to increase dynamic utilization

Throughput Improvement Virtex-7 690T, 16-bit fixed-point Throughput is proportional to compute utilization 4 Single-CLP Multi-CLP Relative Throughput 3 2 1 0 AlexNet VGGNet-E SqueezeNet GoogleNet Multi-CLP significantly increases throughput

Scalability Projection AlexNet, 32-bit floating point, 100MHz Throughput (Images/s) 300 250 200 150 100 50 0 V7485T V7690T Multi-CLP Single-CLP 0 2000 4000 6000 8000 10000 DSP Slices VU9P VU11P Multi-CLP benefit scales with FPGA size

CNN Accelerator Conclusions Fusing Convolutional Layers Demonstrated new locality in CNN evaluation Use small on-chip memory to save off-chip data transfer Batch size Input vs. Weight trade-off Find optimal batch size per layer Benefits batching of Conv and FC layers Single-CLP has dynamic underutilization Multi-CLP has more flexibility Each CLP optimized for a subset of layers

Debugging Systems with FPGAs is Difficult Involve HW, SW, and OS at the same time Painfully slow FPGA compilation process Tedious host rebooting due to freezes Poor visibility for debugging

State Of The Art Testbenches for HW excludes SW and OS Existing full-system simulation is targeting SoC ASIC HW-SW Co-Simulation excludes OS Debug on hardware (synth, place, route, reboot)

The VM-HDL Co-Simulation Framework Virtual Host System Machine HDL FPGA Simulator Software Operating System Hardware Design Virtual Hardware PCIe Queue Link PCIe Simulation PCIe BlockBridge Attach GDB Save Waveforms

Run Time Comparison Real System (sec.) Co-Simulation (sec.) Compilation - 167 Synthesis 1617 - Place and Route 2672 - Reboot 120 - Execution 0.000032 6.02 Total 4409 (1h:13m:29s) 173 (2m:53s)

Talk Outline Machine Learning Accelerators Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Resource Partitioning [FPL 16][ISCA 17] Server Workloads CloudSuite Benchmarks [ASPLOS 12][TopPicks 14][ISPASS 16] Proactive Instruction Fetch [MICRO 08][MICRO 11] Hardware Memoization [ASPLOS 15] Scale-Out Processors [ISCA 12] Reactive NUCA, Cuckoo Directories [ISCA 09][TopPicks 10][HPCA 10] Cache Bursts, Spatio-Temporal Streaming [MICRO 08][HPCA 09][TopPicks 10]

Thanks! Happy to take any questions and comments Now, after the talk, or via email mferdman@cs.stonybrook.edu