PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Similar documents
PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators

MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

DRISA: A DRAM-based Reconfigurable In-Situ Accelerator

Energy-Efficient SQL Query Exploiting RRAM-based Process-in-Memory Structure

Binary Convolutional Neural Network on RRAM

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator

Emerging NVM Memory Technologies

Computation-oriented Fault-tolerance Schemes for RRAM-based Computing Systems

Outline. Deep Convolutional Neural Network (DCNN) Stochastic Computing (SC)

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

C-Brain: A Deep Learning Accelerator

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Practical Near-Data Processing for In-Memory Analytics Frameworks

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

AEP: An Error-bearing Neural Network Accelerator for Energy Efficiency and Model Protection

DNN Accelerator Architectures

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Scale-Out Acceleration for Machine Learning

Using FPGAs as Microservices

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Versal: AI Engine & Programming Environment

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

Research Faculty Summit Systems Fueling future disruptions

Deep Learning Accelerators

MNSIM: Simulation Platform for Memristor-based Neuromorphic Computing System

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Lecture 24 Near Data Computing II

Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Bridging the Gap Between Neural Networks and Neuromorphic Hardware with A Neural Network Compiler

RAPIDNN: In-Memory Deep Neural Network Acceleration Framework

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

XPU A Programmable FPGA Accelerator for Diverse Workloads

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

VLSID KOLKATA, INDIA January 4-8, 2016

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Bridging Analog Neuromorphic and Digital von Neumann Computing

Revolutionizing the Datacenter

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Emergence of the Memory Centric Architectures

Communication for PIM-based Graph Processing with Efficient Data Partition. Mingxing Zhang, Youwei Zhuo (equal contribution),

Emerging NVM Enabled Storage Architecture:

Brainchip OCTOBER

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Chapter 4. The Processor

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Fault-Tolerant Training Enabled by On-Line Fault Detection for RRAM-Based Neural Computing Systems

How to Estimate the Energy Consumption of Deep Neural Networks

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Deep Learning Performance and Cost Evaluation

Fault-Tolerant Training with On-Line Fault Detection for RRAM-Based Neural Computing Systems

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

High Performance Computing

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Lecture 5: Computing Platforms. Asbjørn Djupdal ARM Norway, IDI NTNU 2013 TDT

Real-time object detection towards high power efficiency

Deep Learning Hardware Acceleration

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Deep Learning Performance and Cost Evaluation

Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University

Don t Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Using Non-Volatile Memory for Computation-in-Memory

The Nios II Family of Configurable Soft-core Processors

Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

Intel HLS Compiler: Fast Design, Coding, and Hardware

Emerging DRAM Technologies

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

EECS150 - Digital Design Lecture 17 Memory 2

Transcription:

Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong Xu, Jishen Zhao δ, Yu Wang #, Yongpan Liu #, Yuan Xie * * Electrical and Computer Engineering Department University of California, Santa Barbara Nvidia, HP Labs, δ University of California, Santa Cruz # Tsinghua University, Beijing, China UCSB - Scalable Energy-Efficient Architecture Lab 1

Motivation Challenges Data movement is expensive Applications demand large memory bandwidth Processing-in-memory (PIM) Minimize data movement by placing computation near data or in memory 3D stacking revives PIM Embrace the large internal data transfer bandwidth Reduce the overheads of data movement Micron, Hybrid Memory Cube, HC 11 2

Motivation Neural network (NN) and deep learning (DL) Provide solutions to various applications Acceleration requires high memory bandwidth PIM is a promising solution Deng et al, Reduced-Precision Memory Value Approximation for Deep Learning, HPL Report, 2015 The size of NN increases e.g., 1.32GB synaptic weights for Youtube video object recognition NN acceleration GPU, FPGA, ASIC crossbar 3

Motivation Resistive Random Access Memory () Data storage: alternative to DRAM and flash Computation: matrix-vector multiplication (NN) Hu et al, Dot-Product Engine (DPE) for Neuromorphic Computing: Programming 1T1M to Accelerate Matrix-Vector Multiplication, DAC 16. Use DPE to accelerate pattern recognition on MNIST no accuracy degradation vs. software approach (99% accuracy) with only 4- bit DAC and ADC requirement 1,000X ~ 10,000X speed-efficiency product vs. custom digital ASIC Shafiee et al, ISAAC: A Convolutional Neural Network Accelerator with In- Situ Analog Arithmetic in s, ISCA 16. 4

Key idea PRIME: processing in main memory Based on main memory design [1] Memory Mode Store Data a 1 a 2 a 3 Comp. Mode w 1,1 w 1,2 w 2,1 w 2,2 w 3,1 w 3,2 Store Weight b 1 b 2 [1] Xu et al, Overcoming the challenges of crossbar resistive memory architectures, in HPCA 15. 5

Basics Voltage LRS ( 1 ) Wordline Top Electrode Metal Oxide Bottom Electrode HRS ( 0 ) RESET SET Cell (a) Conceptual view of a cell Voltage (b) I-V curve of bipolar switching (c) schematic view of a crossbar architecture 6

Based NN Computation Require specialized peripheral circuit design DAC, ADC etc. a 1 a 2 w 1,1 w 2,1 w 1,2 + b 1 w 2,2 + b 2 (a) An ANN with one input and one output layer a 1 a 2 w 2,1 w 1,1 w 1,2 w 2,2 b 1 b 2 (b) using a crossbar array for neural computation 7

Data Adr Global Row Decoder GWL PRIME Architecture Details Bank Mat Vol. col Mux. SA col Mux. A Vol. col Mux. B C SA col Mux. GDL Mem Subarray FF Subarray Controller E D Connection Buffer Subarray Global I/O Row Buffer 8

Data Adr Global Row Decoder GWL PRIME Architecture Details Bank Mat Vol. Controller E col Mux. SA col Mux. A Vol. col Mux. B C SA col Mux. D Connection Buffer Subarray Global I/O Row Buffer GDL Mem Subarray FF Subarray A. Wordline decoder and driver with multi-level voltage sources; B. column multiplexer with analog subtraction and sigmoid circuitry; C. reconfigurable SA with counters for multi-level outputs, and added ReLU and 4-1 max pooling function units; D. connection between the FF and Buffer subarrays; E. PRIME controller. 9

Data Adr Global Row Decoder GWL PRIME Architecture Details Bank Mat Vol. Controller E col Mux. SA col Mux. A Vol. col Mux. B C SA col Mux. D Connection Buffer Subarray Global I/O Row Buffer GDL Mem Subarray FF Subarray A. Wordline decoder and driver with multi-level voltage sources; B. column multiplexer with analog subtraction and sigmoid circuitry; C. reconfigurable SA with counters for multi-level outputs, and added ReLU and 4-1 max pooling function units; D. connection between the FF and Buffer subarrays; E. PRIME controller. 10

Overcome Precision Challenges Technology limitation Input precision (input voltage, DAC) Weight precision (MLC level) Output precision (analog computation, ADC) Propose input and synapse composing scheme Compose multiple low-precision input signals for one Compose multiple cells for one weight Compose multiple phases for one computation 11

Implementing MLP/CNN Algorithms MLP / Fully-connected Layer Matrix-vector multiplication Activation functions (sigmoid, ReLU) Convolution Layer Pooling Layer Max pooling, mean pooling Local Response Normalization (LRN) Layer 12

System-Level Design Stage 1: Program Target Code Segment Offline NN training Modified Code: Map_Topology (); Program_Weight (); Config_Datapath (); Run(input_data); Post_Proc(); NN param. file Stage 2: Compile Opt. I: NN Map Opt. II: Data Place Synaptic Weights Datapath Config Data Flow Ctrl NN mapping optimization Small-scale NN: Replication Medium-scale NN: Split-Merge Large-scale NN: Inter-bank Communication Stage 3: Execute Memory Controller mat FF PRIME subarray 13

Evaluation Benchmarks 3 MLPs (S, M, L), 3 CNNs (1, 2, VGG-D) MNIST, ImageNet PRIME configurations 2 FF subarrays and 1 Buffer subarray per bank Each crossbar: 256x256 cells Area overhead: 5.6% 14

Evaluation Comparisons Baseline CPU-only, pnpu-co, pnpu-pim [1] [1] T. Chen et al., DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning, in ASPLOS 14. 15

Speedup Norm. to CPU 1.7 8.2 6.0 4.0 5.5 8.5 8.5 5.0 42.4 33.3 55.1 88.4 45.3 147.5 2716 5101 2129 5824 3527 17665 545 1596 5658 44043 9440 73237 2899 11802 Performance results 1E+05 pnpu-co pnpu-pim-x1 pnpu-pim-x64 PRIME 1E+04 1E+03 1E+02 1E+01 1E+00 CNN-1 CNN-2 MLP-S MLP-M MLP-L VGG gmean PRIME is even 4x better than pnpu-pim-x64 16

Latency Norm. to pnpu-co pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME Performance results 100% 30% 20% 10% 0% Compute + Buffer Memory CNN-1 CNN-2 MLP-S MLP-M MLP-L VGG-D PRIME reduces ~90% memory access overhead 17

Energy Save Norm. to CPU 1.2 1.8 7.3 11.2 9.4 12.6 19.3 56.1 79.0 12.1 52.6 335 124.6 165.9 3801 1869.0 11744 23922 32548 138984 10834 Energy results 1E+06 pnpu-co pnpu-pim-x64 PRIME 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00 CNN-1 CNN-2 MLP-S MLP-M MLP-L VGG gmean PRIME is even 200x better than pnpu-pim-x64 18

Executive Summary Challenges: Data movement is expensive Apps demand high memory bandwidth, e.g. Neural Network Solutions: Processing-in-memory (PIM) crossbar accelerates NN computation Our proposal: A PIM architecture for NN acceleration in based main memory, including a set of circuit and microarchitecture design to enable NN computation and a software/hardware interface for developers to implement various NNs Improve energy efficiency significantly, achieve better system performance and scalability for MLP and CNN workloads, require no extra processing units, and have low area overhead 19

Thank you! 20