PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

Size: px

Start display at page:

Download "PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning"

Clare Hunt
5 years ago
Views:

1 PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning Presented by Nils Weller Hardware Acceleration for Data Processing Seminar, Fall 2017

2 PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning Purpose: - Processing-in-Memory (PIM) architecture to accelerate Convolutional Neural Networks (CNNs) - Based on novel resistive memory (ReRAM) technology - Incremental improvement on prior works

3 Background: CNNs

4 Background: CNNs Goal: Classify image contents Not shown: Nonlinear activation function after convolution Image:

5 Background: CNNs Goal: Classify image contents Main layer type: Convolution Image:

6 Convolution operation Filter matrix Dot product Input image Output feature map Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.

7 Convolution operation Traditional: Fixed e.g. vertical Sobel: Filter matrix Dot product Input image Output feature map Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.

8 Convolution operation Traditional: Fixed e.g. vertical Sobel: Filter matrix CNNs: Learned weights for kernel: Dot product Input image Output feature map Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.

9 Background: CNNs Goal: Classify image contents Image:

10 Background: CNNs Goal: Classify image contents Two phases: 1. Training 2. Testing (= first half of training) Image:

11 Background: CNNs Phase 1: Training Process image Label: boat Image:

12 Background: CNNs Phase 1: Training Process image True value (label): dog (0) cat (0) boat (1) bird (0) Label: boat Image: E(output)

13 Background: CNNs Phase 1: Training Process image True value (label): dog (0) cat (0) boat (1) bird (0) Label: boat E(output) Backpropagate error, gradient descent method - Calculate error contribution for layers - Update weights to reduce error Image:

14 Background: CNNs Phase 1: Training... Image:

15 Background: CNNs Summary: - Large amounts of data - Acceleration desirable - Particularly for training - Simple core operations (matrix/dot product) - Opportunities for parallelization (single- or multi-image) - Non-trivial training process - Error computations - Dependencies on intermediate results

16 Background: Resistive RAM (ReRAM)

17 Background: Resistive RAM (ReRAM) 1971: Theory of Fourth Fundamental Circuit Element (Leon Chua) Resistor Capacitor Indctor Memristor = Memory + Resistance: - Passive element - Resistance depends on charge passed through it - Enabling inherent computational capabilities No separate processing units Electrical network theory Image: Wikipedia

18 Background: Resistive RAM (ReRAM) 1971: Theory of Fourth Fundamental Circuit Element (Leon Chua) Resistor Capacitor Indctor Memristor = Memory + Resistance: - Passive element - Resistance depends on charge passed through it - Enabling inherent computational capabilities No separate processing units 2008: Electrical network theory Image: Wikipedia Strukov et al. (HP Labs): The missing memristor found. In: Nature Discovery in molecular electronics: - Memristor-like behavior through metal-oxide structures - Enabled through flow of oxygen atoms

Background: Resistive RAM (ReRAM) 1971: Theory of Fourth Fundamental Circuit Element (Leon Chua) Resistor Capacitor Indctor Memristor = Memory + Resistance: -

network theory Image: Wikipedia Strukov et al. (HP Labs): The missing memristor found.

19 Background: Resistive RAM (ReRAM) 1971: Theory of Fourth Fundamental Circuit Element (Leon Chua) Resistor Capacitor Indctor Memristor = Memory + Resistance: - Passive element - Resistance depends on charge passed through it - Enabling inherent computational capabilities No separate processing units 2008: Electrical network theory Image: Wikipedia Strukov et al. (HP Labs): The missing memristor found. In: Nature Discovery in molecular electronics: - Memristor-like behavior through metal-oxide structures - Enabled through flow of oxygen atoms Since then: - Resistive memory designs and prototypes - Research in Processing-in-Memory with resistive memories

20 Background: Resistive RAM (ReRAM) Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing: Programming 1T1M Crossbar to Accelerate Matrix-Vector Multiplication

21 Background: Resistive RAM (ReRAM) Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing: Programming 1T1M Crossbar to Accelerate Matrix-Vector Multiplication - Accumulation of voltages (Kirchoff s Law) - Resistance of memristors acts as weight - Parallel processing! Conductance matrix Feedback resistance

22 Background: Resistive RAM (ReRAM) Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing: Programming 1T1M Crossbar to Accelerate Matrix-Vector Multiplication e v i a N

23 Background: Resistive RAM (ReRAM) Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing: Programming 1T1M Crossbar to Accelerate Matrix-Vector Multiplication e v i a N - Assumes linear memristor conductance - Ignores circuit pararistics More things to consider, but the basic idea is sound

24 ReRAM-based PIM architecture

25 ReRAM-based PIM architecture Building a complete ReRAM system from building blocks: - HW structures for real CNN processing - Programmable for different CNNs - Process real benchmarks

26 ReRAM-based PIM architecture Building a complete ReRAM system from building blocks: - HW structures for real CNN processing - Programmable for different CNNs - Process real benchmarks

27 ReRAM-based PIM architecture Building a complete ReRAM system from building blocks: - HW structures for real CNN processing - Programmable for different CNNs - Process real benchmarks a r t No u s g n i in - claim: pipeline design not suitable for training due to stalls - claim: ADC/DAC overhead could be improved t r o pp - doesn t do CNNs

ReRAM-based PIM architecture Building a complete ReRAM system from building blocks: - HW structures for real CNN processing - Programmable for different CNNs - Process

28 ReRAM-based PIM architecture Building a complete ReRAM system from building blocks: - HW structures for real CNN processing - Programmable for different CNNs - Process real benchmarks a r t No u s g n i in - claim: pipeline design not suitable for training due to stalls - claim: ADC/DAC overhead could be improved t r o pp - doesn t do CNNs

29 Side note Full CNN processing introduces further practical issues: 1. Computations are analog errors will occur 2. Some CNN layers cannot be computed with ReRAM AlexNet, 2012:

30 Side note Full CNN processing introduces further practical issues: Empirical results: NNs are resilient to errors 1. Computations are analog errors will occur 2. Some CNN layers cannot be computed with ReRAM AlexNet, 2012: 2015: CNNs without LCN shown to work just as well

31 PipeLayer: Architecture Main considerations: 1. Training support 2. Intra-Layer Parallelism 3. Inter-Layer Parallelism

32 PipeLayer: Architecture 1. Training support Figure 3: PipeLayer configured for training

33 PipeLayer: Architecture 1. Training support Intermediate memory (memory subarray) Computation and weight storage (morphable subarray) Partial derivative for weight (averaged) Training label Figure 3: PipeLayer configured for training

34 PipeLayer: Architecture 1. Training support Intermediate memory (memory subarray) Computation and weight storage (morphable subarray) Partial derivative for weight (averaged) Training label Figure 3: PipeLayer configured for training Concept of batching: - Process batch of images with fixed weights - Update weights after batch Reduce update overhead

35 PipeLayer: Architecture 1. Training support Figure 3: PipeLayer configured for training Process image 1 of 2-sized batch (ignoring parallelism)

36 PipeLayer: Architecture 1. Training support Figure 3: PipeLayer configured for training Process image 1 of 2-sized batch (ignoring parallelism)

37 PipeLayer: Architecture 1. Training support Figure 3: PipeLayer configured for training Process image 1 of 2-sized batch (ignoring parallelism)

38 PipeLayer: Architecture 1. Training support Figure 3: PipeLayer configured for training Process image 2 of 2-sized batch (ignoring parallelism)

39 PipeLayer: Architecture 1. Training support Figure 3: PipeLayer configured for training Process image 2 of 2-sized batch (ignoring parallelism)

40 PipeLayer: Architecture 1. Training support Figure 3: PipeLayer configured for training Batch complete - Weight update

41 PipeLayer: Architecture 1. Training support Image unclear: - Weight update path not shown - Text references nonexistent b derivatives Figure 3: PipeLayer configured for training Batch complete - Weight update

42 PipeLayer: Architecture 2. Intra-layer parallelism

43 PipeLayer: Architecture 2. Intra-layer parallelism Without parallelism: Added complexity: - Process batch of images in one go - Use multiple kernels Basic crossbar array matrix-vector computation scheme

44 PipeLayer: Architecture 2. Intra-layer parallelism Without parallelism: With parallelism: - Duplicate processing structure for parallelism - Break up computation arrays due to HW size constraints

45 PipeLayer: Architecture 3. Inter-layer parallelism

46 PipeLayer: Architecture 3. Inter-layer parallelism Conceptually: img2 img1

47 PipeLayer: Architecture 3. Inter-layer parallelism Conceptually: img3 img2 img1

48 PipeLayer: Architecture 3. Inter-layer parallelism Conceptually: img4 img3 img2 img1

49 PipeLayer: Architecture 3. Inter-layer parallelism Conceptually: img4 Implications: img3 img2 img1 - Need to buffer multiple intermediate results for later use

50 PipeLayer: Architecture 3. Inter-layer parallelism Conceptually: img4 Implications: img3 img2 img1 - Need to buffer multiple intermediate results for later use - Weight update requires pipeline flush (does it really?)

51 PipeLayer: Architecture 3. Inter-layer parallelism Paper seems to agree on flush/stall: Last image before update (gap of 2L+1 cycles) Update looks larger, but is only 1 cycle

52 PipeLayer: Architecture 3. Inter-layer parallelism Paper seems to agree on flush/stall: but: Last image before update (gap of 2L+1 cycles) Update looks larger, but is only 1 cycle How is this pipeline design superior to ISAAC s?

53 PipeLayer: Implementation

PipeLayer: Implementation Spike coding: analog input to Spike coding driver (for energy/area reduction): Input to weighted spikes conversion digital spike sequence without ADC.

54 PipeLayer: Implementation Spike coding: analog input to Spike coding driver (for energy/area reduction): Input to weighted spikes conversion digital spike sequence without ADC. Output spike count = accumulated input*weight Activation function component Typical division into memory-only + memory/computation areas details like error propagation not visualized

55 PipeLayer: Discussion - Limited ReRAM precision - Previous works showed NNs to take errors well

56 PipeLayer: Evaluation - Large improvements vs. reference GPU - Architecture is simulated (could results be impaired?)

57 Summary The work: - Successful design of ReRAM-based memory architecture for PIM - Good improvements in test setup - Support for training is new (but not a groundbreaking idea) The paper: - Sensibly structured - Appropriate drawings - Many implicit assumptions; reasoning for claims often missing - Many grammatical errors

58 Take-aways 1. The work is made possible by progress in an interesting combination of fields 1990s: Initial PIM concepts 1971: Memristor 2008: Molecular electronics 2012: AlexNet CNN ReRAM-based CNN accelerators 2015: Good CNNs without contrast normalization layer 2. Various optimization techniques mentioned in this seminar are used - Hardware acceleration / PIM - Various layers of parallelism - Precision-speed trade-offs

59 Thanks for your time! Questions?

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong