Lecture 24 Near Data Computing II

EECS 570 Lecture 24 Near Data Computing II Winter 2018 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ EECS 570 Lecture 23 Slide 1

Readings ISAAC: A Convolutional Neural Network Accelerator with In- Situ Analog Arithmetic in Crossbars, ISCA 2016, Shafiee et al. In-Memory Data Parallel Processor, ASPLOS 2018, Fujiki, Mahlke, Das. EECS 570 Lecture 23 Slide 2

Executive Summary 3 Classifying Images is in vogue Lots of vector-matrix multiplication Conv nets are the best Analog memristor crossbar is a great fit Analog to Digital conversion overheads! Smart encoding reduces such overheads ISAAC 14.8x better in throughput and 5.5x better in energy than digital state of the art (DaDianNao) Balanced pipeline critical for high efficiency Preserving high precision is essential in analog EECS 570 Lecture 23 Slide 3

State of the art Convolutional Neural Networks 4 Deep Residual Networks 152 layers! 11 billion operations! Convolution Layers Pooling Layers Fully Connected Layers EECS 570 Lecture 23 Slide 4

Convolution Operation 5 Kernel 0 K x = 2 Kernel 1 Kernel 2 N i = 3 K y = 2 N y Stride S x, S y = 1 N o = 3 N x EECS 570 Lecture 23 Slide 5

Memristor Dot-product Engine 6 V1 V2 G1 I1 = V1.G1 G2 I2 = V2.G2 I = I1 + I2 = V1.G1 + V2.G2 x 0 x 1 x 2 x 3 w 00 w 01 w 02 w 03 w 10 w 11 w 12 w 13 w 20 w 21 w 22 w 23 w 30 w 31 w 32 w 33 y 0 y 1 y 2 y 3 EECS 570 Lecture 23 Slide 6

Memristor Dot-product Engine 7 Kernel 0 Kernel 1 Kernel 2 Kernel 0 K x = 2 Kernel 1 Kernel 2 N i = 3 K y = 2 N o = 3 N y Stride S x, S y = 1 N x EECS 570 Lecture 23 Slide 7

Crossbar 8 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16 iterations 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit Input Neurons EECS 570 Lecture 23 Slide 8

EECS 570 ISAAC Organization 9 Sigmoid Digital To Analog Rows 128-255 Crossbar 16 Iterations Input Register Output Register Shift and Add Partial Rows Output Partial Output Output 01 Register 0-127 Shift and Add Analog to Digital Lecture 23 Slide 9

An ISAAC Chip Inter-Tile Pipelined 10 Layer 1 Layer 2 Layer 3 edram Tile 1 edram Tile 2 edram Tile 3 EECS 570 Lecture 23 Slide 10

Balanced Pipeline 11 Layer i: S x = 1 and S y = 2 Replicate layer i 1 two times. Not computed yet Storage allocation: Start from Received last layer from previous layer Serviced and released EECS 570 Lecture 23 Slide 11

Balanced Pipeline 12 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 S x = 2, S y = 2 S x = 1 S y = 2 S x = 2, S y = 2 EECS 570 Lecture 23 Slide 12

The ADC overhead 13 Large area Power hungry Area and power increases exponentially with ADC resolution and frequency EECS 570 Lecture 23 Slide 13

ADC Resolution = log (R) + v + w 1 (if v=1) ADC Resolution = 9 bits The ADC overhead 14 v bits v bits v bits Memristor cells w bits w bits w bits R = 128 v = 1 w = 2 R rows v bits w bits EECS 570 log (R) 9-bit + v ADC + w 1 Lecture 23 Slide 14

If MSB = 1 with maximal input Store weights in flipped form such that MSB = 0 always. Effective ADC resolution required = 8 bits EECS 570 Encoding Scheme 15 M A X I M A L I N P U T 1 1 1 1 Memristor cells W 0, 0,0 0 W 0, 0,1 W 0, 0,2 2 W 0,R 1 0, R-1 If MSB = 01 Lecture 23 Slide 15

Handling Signed Arithmetic 16 Input neurons 2 s Compliment MSB = 1 represents 2 15 For 16 th iteration do shift-and-subtract Weights Like FP exponent representation Bias of 2 15 Subtract as many biases as the number of 1s in input EECS 570 Lecture 23 Slide 16

Analysis Metrics 17 1) CE: Computational Efficiency -> GOPS/s mm 2 2) PE: Power Efficiency -> GOPS/W 3) SE: Storage Efficiency -> MB/mm 2 EECS 570 Lecture 23 Slide 17

Design Space Exploration 18 1) rows per crossbar 2) ADCs per IMA 3) crossbars per IMA 4) IMA per tile EECS 570 Lecture 23 Slide 18

Design Space Exploration 19 GOPs/mm 2 ISAAC-PE ISAAC-CE ISAAC-SE Various Design Points GOPs/W ISAAC-PE ISAAC-CE ISAAC-SE EECS 570 Various Design Points Lecture 23 Slide 19

Power Contribution 20 Router 3% 5% 58% Hyper Transport 16% 12% 7% 49% EECS 570 Lecture 23 Slide 20

Improvement over DaDianNao (Throughput) 21 Throughput: 14.8x better because: 1. Memristor crossbar have high computational parallelism 2. DaDianNao fetches both inputs and weights from edram, ISAAC fetches just inputs 3. DaDianNao suffers due to bandwidth limitation in fully connected layers. ISAAC requires more power but is 5.5x better in terms of energy due to above reasons. EECS 570 Deep Neural Net Benchmarks Lecture 23 Slide 21

Conclusion 22 Takes advantage of analog in-situ computing. Fetches just the input neurons. Handles ADC overheads with smart encoding. Does not compromise on output precision. Is faster than DaDianNao due to 8x better computational efficiency and a balanced pipeline keeping all units busy. Few questions still remain: integrate online training? EECS 570 Lecture 23 Slide 22

In-Memory Data Parallel Processor Daichi Fujiki Scott Mahlke Reetuparna Das M-Bits Research Group

Data movement is what matters, not arithmetic Bill Dally CPU DATA-PARALLEL APPLICATIONS GPU MANY CORE SIMD OoO ARITHMETIC MANY THREAD SIMT SIMD DATA COMMUNICATION 1000x 40x 24

In-Memory Computing exposes parallelism while minimizing data movement cost CPU GPU IN-MEMORY In-situ computing Massive parallelism SIMD slots over dense memory arrays High bandwidth / Low data movement 25

In-Memory Computing Reduces Data Movement CPU GPU In-situ computing Massive parallelism IN-MEMORY Vdd/2 V1 11 C11 I11 = (Vdd/2) C11 11 C21 I21 = (Vdd/2) C21 I1 = (Vdd/2) (C11+ C21) (a) Addition C12 C11 I11= V1C11 I12= V1C12 V2 C21 C22 I21= V2C21 I22= V2C22 I1=I11+I21 I2=I21+I22 (b) Dot-product 26 (c

In-Memory Computing Exposes Parallelism IN-MEMORY In-situ computing Massive parallelism CPU (2 sockets) Intel Xeon E5-2597 GPU NVIDIA TITAN Xp ReRAM Scaled from ISAAC* Area (mm2) 912.24 471 494 TDP (W) 290 250 416 On-chip memory (MB) 78.96 9.14 8,590 SIMD slots 448 3,840 2,097,152 Freq (GHz) 3.6 1.585 0.02 SIMD Freq Product 3,227 6,086 41,953 27

In-Memory Computing Today V1 C 11 C 12 I11= V1C11 V2 C 21 C 22 I21= V2C21 I12= V1C12 I22= V2C22 V1 C 11 C 12 V2 I11 C 21 C 22 I21 I12 I22 ReRAM Dot-product Accelerator PRIME [Chi 2016, ISCA] ISAAC [Shafiee 2016, ISCA] Dot-Product Engine [Hu 2016, ] PipeLayer [Song 2017, HPCA] I1 = I11+ I21 I2 = I12+ I22 Multiplication + Summation 28

In-Memory Computing No Demonstration of General Purpose Computing IN-MEMORY How to program? No established programming model / execution model Limited computation primitives

HW In-Memory Data Parallel Processor Overview Microarchitecture ISA Memory ISA ADD DOT MUL SUB MOV MOVS MOVI MOVG SHIFT{L/R} MASK LUT REDUCE_SUM Execution Model Compiler IMP Compiler Module ILP IB1 IB1 IB1 IB2 IB2 IB2 SW Programming Model Data Flow Graph DLP 30

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives A C A Information stored in analog (cell conductance C = 1/resistance) A B C B Write Read C A 31

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives V dd V dd /2 C A V dd IA Ohm s law [mult] IA = (Vdd/2) CA C B IB I = ( IA + IB ) Kirchhoff s law [add] (a) Addition (b) Subtraction* * New primitive 32

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives V dd V dd /2 V dd V dd /2 C A 11 CA 11 V dd IA I 11 = 0(V dd /2) CIA 11 C B 00 C 21 B IB I 21 = (V dd /2) IB C 21 I = ( IA + IB ) I 1 = (VI dd /2) = ((C IAA 11 -- IB CB 21 ) B B (a) Addition (b) Subtraction* * New primitive 33

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives X Y A C B D = AX + BY CX + DY X Y (A B) = AX BY VX CA IAX= VACA V Y CB CC ICX= VXCC CD 11 V dd V 1 X0 C 11 A CB 12 VY0 2 V dd - Multiplier Multiplicand IBY= VYCB IDY= VYCD I1=IAX+IBY I2=ICX+IDY I 11 =(V dd V 1 X2 )C 11 A I 12 =(V dd V 2 Y2 ) CB 12 (c) Dot-product (c) d Element-wise multiplication * * New primitive 34

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Microarchitecture Cluster ReRAM PU... ReRAM PU Reg. File ReRAM PU... ReRAM PU LUT Router = RowDecoder + Shift&Add Unit RRAM XB S+H ADC ADC S+A Reg Processing Unit 35

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Microarchitecture ReRAM PU Cluster... ReRAM PU Reg. File Array Size 128 x 128 R/W Latency 50 ns ReRAM PU... ReRAM PU LUT PU + Reg PU + Reg PU + Reg PU + Reg Multi Level Cell 2 ADC Resolution 5 ADC Frequency 1.2 GSps RRAM XB S+H ADC ADC S+A Reg Processing Unit Shift and Hold Sample and Hold 8 PUs/array 128 Regs/PU Resolution 2 LUT size 256 x 8 36

ISA HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW In-situ Computation Moves R/W Misc Opcode Format Cycles ADD <MASK> <DST> 3 DOT <MASK> <REG_MASK> <DST> 18 MUL <SRC> <SRC> <DST> 18 SUB <SRC> <SRC> <DST> 3 MOV <SRC> <DST> 3 MOVS <SRC> <DST> <MASK> 3 MOVI <SRC> <IMM> 1 MOVG <GADDR> <GADDR> Variable SHIFT{L/R} <SRC> <SRC> <IMM> 3 MASK <SRC> <SRC> <IMM> 3 LUT <SRC> <SRC> 4 REDUCE_SUM <SRC> <GADDR> Variable

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Programming Model KEY OBSERVATION Need a programming language that merges concepts of Data-Flow and SIMD for maximizing parallelism Data-Flow SIMD Side-effect Free Explicit dataflow exposes Instruction Level Parallelism Data Level Parallelism No dependence on shared memory primitives 39

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Data Flow Graph 40

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix A Input Matrix B Input Matrix B Decomposed DFG Unroll innermost dimension Module Data Flow Graph Module Modularized execution flow Applied to the innermost dimension DLP 41

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Input Matrix A Input Matrix B ILP IB1 IB2 IB IB Data Flow Graph Decomposed Data Flow Graph Instruction Block (IB) Partial execution sequence of a Module Mapped to a single array Module 42

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model IB1 Module IB2 IB1 IB2 IB1 IB2 Module Modularized execution flow Applied to the innermost dimension IB IB IB1 IB1 IB1 IB2 IB2 IB2 Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array 43

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Module Data Flow Graphs Modularized execution flow Applied to the innermost dimension IB1 IB2 IB1 IB2 IB1 IB2 IB IB Modules Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array IB1 IB1 IB1 IB2 IB2 IB2 44

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 46

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Exploit multi-operand ADD/SUB Reduce redundant writebacks Semantic Analysis 2 6 3 5 2 6 3 5 Optimization NodeMerging IB Expansion Place holder Place holder Place holder Place holder Pipelining Instruction Lowering + + Add 8 8 NodeMerging + Add + Reduce 16 IB Scheduling + Reduce 16 CodeGen 48

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Expose more parallelism in a module to architecture. Semantic Analysis 2 6 3 5 2 6 3 5 Optimization NodeMerging IB Expansion Pipelining Instruction Lowering 2 3 + + Place holder 6 5 Add Place holder 8 8 IB Expansion + 8 Place holder Add Place holder Add Pack Unpack + 8 IB Scheduling CodeGen 49

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Pipelining Semantic Analysis Optimization NodeMerging IB Expansion Compute Add_0 WB Compute Add_1 WB Compute Reduce WB Pipelining Pipelining Instruction Lowering IB Scheduling Compute Add_0 WB Compute Add_1 WB CodeGen Compute Reduce WB 50

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend Instruction Lowering Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Instruction Lowering Instruction Lowering: Transform high level TF insts into memory ISA Div High-level TF Node Newton-Raphson / Maclaurin Inst Lowering Add LUT Mul Memory ISA Division Algorithm q = a / b 1. y 7 = rcp b (LUT) 2. q 7 = ay 7 3. e 7 = 1 by 7 4. q C = q 7 + e 7 q 7 5. e C = e 7 E 6. q E = q C + e C q C IB Scheduling Supported TF operation nodes CodeGen Add Sub Mul Div Sqrt Exp Sum Less Conv 2D 52

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Large Execution Time IB1 Optimization NodeMerging IB Expansion Pipelining IB Scheduling Instruction Lowering IB Scheduling CodeGen DFG Target # of IBs = 1 53

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend Large Execution Time IB Scheduling Semantic Analysis IB1 IB2 IB1 IB2 Optimization NodeMerging IB Expansion Pipelining IB Scheduling Instruction Lowering Network Delay IB Scheduling CodeGen DFG Target # of IBs = 2 Good :) Bad :( 54

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 55

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization 1 NodeMerging IB Expansion 1 Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration IB1 is chosen because Closer to operand locations Time 57

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization 1 2 2 NodeMerging IB Expansion Pipelining 1 2 IB2 is chosen because Earlier slots available Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration Time 58

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization 1 2 2 NodeMerging IB Expansion 1 2 Pipelining 1 Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 1 IB1 is chosen because Better overlap of comm. and computation Time 59

Evaluation Methodology Benchmarks PARSEC 3.0 Blackscholes, Canneal, Fluidanimate Rodinia Backprop, Hotspot, Kmeans, Streamcluster Methodology Processor CPU (2 sockets) GPU (1 card) IMP Intel Xeon E5-2597 v3, 3.6GHz, 28 cores, 56 threads NVIDIA Titan Xp, 1.6GHz, 3840 cuda cores 20MHz ReRAM, 4096 Tiles, 64 ReRAM PU / Tile On-chip memory 78.96 MB 9.14 MB 8,590 MB Off-chip memory 64 GB DRAM 12 GB DRAM Profiler / Simulator (Performance) Profiler / Simulator (Power) Intel VTune Amplifier Inter RAPL Interface NVPROF NVIDIA System Management Interface Cycle accurate simulator (Booksim Integrated) Trace based simulation

Offloaded Kernel / Application Speedup (CPU) Offloaded Kernel Speedup 100 10 1 41x Blackscholes Fluidanimate Canneal Streamcluster GEOMEAN Normalized Execution Time 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Kernel Data loading NoC Sequential+Barrier 7.5x CPU IMP CPU IMP CPU IMP CPU IMP CPU IMP Blackscholes Fluidanimate Canneal Streamcluster GEOMEAN Offloaded Kernel Speedup Application Speedup Capacity limitation of IMP settles the upper-bound of performance improvement. 61