Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong Xu, Jishen Zhao δ, Yu Wang #, Yongpan Liu #, Yuan Xie * * Electrical and Computer Engineering Department University of California, Santa Barbara Nvidia, HP Labs, δ University of California, Santa Cruz # Tsinghua University, Beijing, China UCSB - Scalable Energy-Efficient Architecture Lab 1
Motivation Challenges Data movement is expensive Applications demand large memory bandwidth Processing-in-memory (PIM) Minimize data movement by placing computation near data or in memory 3D stacking revives PIM Embrace the large internal data transfer bandwidth Reduce the overheads of data movement Micron, Hybrid Memory Cube, HC 11 2
Motivation Neural network (NN) and deep learning (DL) Provide solutions to various applications Acceleration requires high memory bandwidth PIM is a promising solution Deng et al, Reduced-Precision Memory Value Approximation for Deep Learning, HPL Report, 2015 The size of NN increases e.g., 1.32GB synaptic weights for Youtube video object recognition NN acceleration GPU, FPGA, ASIC crossbar 3
Motivation Resistive Random Access Memory () Data storage: alternative to DRAM and flash Computation: matrix-vector multiplication (NN) Hu et al, Dot-Product Engine (DPE) for Neuromorphic Computing: Programming 1T1M to Accelerate Matrix-Vector Multiplication, DAC 16. Use DPE to accelerate pattern recognition on MNIST no accuracy degradation vs. software approach (99% accuracy) with only 4- bit DAC and ADC requirement 1,000X ~ 10,000X speed-efficiency product vs. custom digital ASIC Shafiee et al, ISAAC: A Convolutional Neural Network Accelerator with In- Situ Analog Arithmetic in s, ISCA 16. 4
Key idea PRIME: processing in main memory Based on main memory design [1] Memory Mode Store Data a 1 a 2 a 3 Comp. Mode w 1,1 w 1,2 w 2,1 w 2,2 w 3,1 w 3,2 Store Weight b 1 b 2 [1] Xu et al, Overcoming the challenges of crossbar resistive memory architectures, in HPCA 15. 5
Basics Voltage LRS ( 1 ) Wordline Top Electrode Metal Oxide Bottom Electrode HRS ( 0 ) RESET SET Cell (a) Conceptual view of a cell Voltage (b) I-V curve of bipolar switching (c) schematic view of a crossbar architecture 6
Based NN Computation Require specialized peripheral circuit design DAC, ADC etc. a 1 a 2 w 1,1 w 2,1 w 1,2 + b 1 w 2,2 + b 2 (a) An ANN with one input and one output layer a 1 a 2 w 2,1 w 1,1 w 1,2 w 2,2 b 1 b 2 (b) using a crossbar array for neural computation 7
Data Adr Global Row Decoder GWL PRIME Architecture Details Bank Mat Vol. col Mux. SA col Mux. A Vol. col Mux. B C SA col Mux. GDL Mem Subarray FF Subarray Controller E D Connection Buffer Subarray Global I/O Row Buffer 8
Data Adr Global Row Decoder GWL PRIME Architecture Details Bank Mat Vol. Controller E col Mux. SA col Mux. A Vol. col Mux. B C SA col Mux. D Connection Buffer Subarray Global I/O Row Buffer GDL Mem Subarray FF Subarray A. Wordline decoder and driver with multi-level voltage sources; B. column multiplexer with analog subtraction and sigmoid circuitry; C. reconfigurable SA with counters for multi-level outputs, and added ReLU and 4-1 max pooling function units; D. connection between the FF and Buffer subarrays; E. PRIME controller. 9
Data Adr Global Row Decoder GWL PRIME Architecture Details Bank Mat Vol. Controller E col Mux. SA col Mux. A Vol. col Mux. B C SA col Mux. D Connection Buffer Subarray Global I/O Row Buffer GDL Mem Subarray FF Subarray A. Wordline decoder and driver with multi-level voltage sources; B. column multiplexer with analog subtraction and sigmoid circuitry; C. reconfigurable SA with counters for multi-level outputs, and added ReLU and 4-1 max pooling function units; D. connection between the FF and Buffer subarrays; E. PRIME controller. 10
Overcome Precision Challenges Technology limitation Input precision (input voltage, DAC) Weight precision (MLC level) Output precision (analog computation, ADC) Propose input and synapse composing scheme Compose multiple low-precision input signals for one Compose multiple cells for one weight Compose multiple phases for one computation 11
Implementing MLP/CNN Algorithms MLP / Fully-connected Layer Matrix-vector multiplication Activation functions (sigmoid, ReLU) Convolution Layer Pooling Layer Max pooling, mean pooling Local Response Normalization (LRN) Layer 12
System-Level Design Stage 1: Program Target Code Segment Offline NN training Modified Code: Map_Topology (); Program_Weight (); Config_Datapath (); Run(input_data); Post_Proc(); NN param. file Stage 2: Compile Opt. I: NN Map Opt. II: Data Place Synaptic Weights Datapath Config Data Flow Ctrl NN mapping optimization Small-scale NN: Replication Medium-scale NN: Split-Merge Large-scale NN: Inter-bank Communication Stage 3: Execute Memory Controller mat FF PRIME subarray 13
Evaluation Benchmarks 3 MLPs (S, M, L), 3 CNNs (1, 2, VGG-D) MNIST, ImageNet PRIME configurations 2 FF subarrays and 1 Buffer subarray per bank Each crossbar: 256x256 cells Area overhead: 5.6% 14
Evaluation Comparisons Baseline CPU-only, pnpu-co, pnpu-pim [1] [1] T. Chen et al., DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning, in ASPLOS 14. 15
Speedup Norm. to CPU 1.7 8.2 6.0 4.0 5.5 8.5 8.5 5.0 42.4 33.3 55.1 88.4 45.3 147.5 2716 5101 2129 5824 3527 17665 545 1596 5658 44043 9440 73237 2899 11802 Performance results 1E+05 pnpu-co pnpu-pim-x1 pnpu-pim-x64 PRIME 1E+04 1E+03 1E+02 1E+01 1E+00 CNN-1 CNN-2 MLP-S MLP-M MLP-L VGG gmean PRIME is even 4x better than pnpu-pim-x64 16
Latency Norm. to pnpu-co pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME pnpu-co pnpu-pim PRIME Performance results 100% 30% 20% 10% 0% Compute + Buffer Memory CNN-1 CNN-2 MLP-S MLP-M MLP-L VGG-D PRIME reduces ~90% memory access overhead 17
Energy Save Norm. to CPU 1.2 1.8 7.3 11.2 9.4 12.6 19.3 56.1 79.0 12.1 52.6 335 124.6 165.9 3801 1869.0 11744 23922 32548 138984 10834 Energy results 1E+06 pnpu-co pnpu-pim-x64 PRIME 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00 CNN-1 CNN-2 MLP-S MLP-M MLP-L VGG gmean PRIME is even 200x better than pnpu-pim-x64 18
Executive Summary Challenges: Data movement is expensive Apps demand high memory bandwidth, e.g. Neural Network Solutions: Processing-in-memory (PIM) crossbar accelerates NN computation Our proposal: A PIM architecture for NN acceleration in based main memory, including a set of circuit and microarchitecture design to enable NN computation and a software/hardware interface for developers to implement various NNs Improve energy efficiency significantly, achieve better system performance and scalability for MLP and CNN workloads, require no extra processing units, and have low area overhead 19
Thank you! 20