Deep Learning Hardware Acceleration

Size: px

Start display at page:

Download "Deep Learning Hardware Acceleration"

Shanon French
5 years ago
Views:

1 * Deep Learning Hardware Acceleration Jorge Albericio + Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* + now at NVIDIA Andreas Moshovos

2 Disclaimer The University of Toronto has filed patent applications for the mentioned technologies.

3 Deep Learning: Where Time Goes? Time: ~ 6% - 9% inner products X 1s - + 1s X Convolutional Neural Networks: e.g., Image Classification

4 Deep Learning: Where Time Goes? Time: ~ 6% - 9% inner products X 1s - 1s + X 4

5 SIMD: Exploit Computation Stucture 15 DaDianNao 4K terms/cycle x 15 x x256 x16 x 15 x 5

6 Our Approach 15 Improve by Exploiting Value Properties Maintaining: Filter 15 Massive Parallelism Wide Memory Accesses Filter 15 SIMD Lanes No Modifications to the Networks 15 6

7 Longer Term Goal 15 Filter One Architecture to Rule them All 15 Filter

8 Value Properties to Exploit? Many ~ values 8

9 Value Properties to Exploit? Varying Precision Needs X X X 9

Our Results: Performance 4 3 3 Accuracy 1% 99% 4.

10 Our Results: Performance Accuracy 1% 99% 4.3x 2 2.8x 2 1.6x 3.1x x 1.9x TARTAN + CNVLUTIN STRIPES PRAGMATIC ENGINE P ISCA 16 MICRO 16 Arxiv +ICLR Workshop vs. DaDianNao which was ~3x over GPUs 1

11 Our Results: Memory Footprint and Bandwidth Proteus: 44% less memory bandwidth + footprint 11

12 Roadmap Avoiding computations with ~ Performance from precision Performance from zero bits Reducing footprint and bandwidth 12

13 #1: Skipping Ineffectual Activations 13

14 Cnvlutin: ISCA 16 X + X Many ineffectual multiplications 14

None always zero Zero Activations: Runtime calculated None that

15 Many Activations and Weights are Intrinsically Ineffectual (zero) 45% of Runtime Values are zero % Stable for any Input None always zero Zero Activations: Runtime calculated None that are always zero Zero Weights: Known in Advance Not pursued in this work 15

16 Cnvlutin X + X Many ineffectual multiplications 16

17 Cnvlutin b X + a X Many more ineffectual multiplications 17

18 Cnvlutin Beating Fast and X Dumb SIMD is Hard + On-the-fly ineffectual product elimination Performance + energy X Optional: accuracy loss +performance 18

19 Cnvlutin No Accuracy Loss +52% performance -7% power +5% area Can relax the ineffectual criterion better performance: 6% even more w/ some accuracy loss

20 Deep Learning: Convolutional Neural Networks Beaver maybe image layers 1s-1 2

21 Deep Learning: Convolutional Networks input activations weights i N output activations N i K K > 6% -- 9% of time in Convolutional Layers 21

22 Why are there so many zero neurons? Hypothesis: Filter = feature N z 22

23 SIMD: Exploit Computation Stucture 15 DaDianNao 4K terms/cycle x 15 x x256 x16 x 15 x 23

24 Skipping Ineffectual Activations: Key Challenge Processing all Activations: All Lanes operate in lock step 15 x 15 x x Lane 15 Lane 15 x

25 Naïve Solution: No Wide Memory Accesses 16 independent narrow activation streams Lane 15 Lane 25

26 encode Neuron Mem Removing Zeroes: At the output of each layer Beaver maybe Layer i Layer i + 1

27 Maintaining Wide Accesses But Skipping Zeroes #1: Partition NM in 16 Slices over 16 Banks Processing order does not matter Neuron lane 15 Neuron lane 1 Neuron lane 27

28 Maintaining Wide Accesses But Skipping Zeroes #2: Fetch and Maintain One Container per Slice Container: up to 16 non-zero neurons Neuron lane 15 Neuron lane 1 Neuron lane 28

29 Maintaining Wide Accesses But Skipping Zeroes #3: Keep Neuron Lanes Supplied with One Neuron Per Cycle Lane 15 Lane 1 Lane 29

30 Maintaining Wide Accesses But Skipping Zeroes #4: When a container is exhausted, get the next one within the slice Lane 15 Lane 1 Lane 3

31 Maintaining Wide Accesses But Skipping Zeroes Container: stores only non-zeros Encoding: Value, 4-bit offset Could use 1 extra bit: encoded vs. raw

32 Inside Each Neuron Container ZFNAf: Enabling the Skipping of Ineffectual Neurons Zero-Free Neuron Array Format: Only non-zero neurons + offsets Brick-level 32

33 Cnvlutin: No Accuracy Loss better 33

34 Loosening the Ineffectual Neuron Criterion Treat Neurons close to zero as zero Open Questions: Are these robust? How to find the best? 34

35 #2: Exploiting Precision X X X 35

36 Another Property of CNNs 16 bits X + X Operand Precision Required Fixed? 16 bits? 36

37 CNNs: Precision Requirements Vary 16 bits p bits X + X Operand Precision Required Fixed Varies 5 bits to 13 bits 37

38 Stripes p bits X + X Execution Time = 16 / P Peformance + Energy Efficiency + Accuracy Knob 38

39 Stripes: Key Concept 2 2x2b Terms/Step 2 1x2b 4 1x2b Devil in the Details: Carefully chose what to serialize and what to reuse same input wires as baseline 39

40 SIMD: Exploit Computation Stucture 15 DaDianNao 4K terms/cycle x 15 x x16 x 15 x 4

41 Stripes Bit-Serial Engine x x x x 41

42 Compensating for Bit-Serial s Compute Bandwidth Loss neurons synapses 16 neurons Each Tile: 16 Windows Concurrently 16 neurons each 16 Filters 16 partial output neurons 42

43 Stripes No Accuracy Loss +192% performance -57% energy +32% area * More performance w/ accuracy loss * W/O Older: LeNet + Covnet

44 better Stripes: Performance Boost 44

45 Fully-Connected Layers? neurons synapses 16 neurons Each Tile: No Weight Reuse Cannot Have 16 Windows 45

46 Fully-Connected Layers synapses Input neurons Output neurons No Weight Reuse Cannot Have 16 Windows 46

47 TARTAN: Accelerating Fully-Connected Layers Bit-Parallel Engine V: activation I: weight Both 2 bits 47

48 Bit-Parallel Engine: Processing one Activation x Weight Cycle 1: Activation: a1 and Weight: W 48

49 Bit-Parallel Engine: Processing Another Pair Cycle 2: Activation: a2 and Weight: W a1 x W + a2 x W over two cycles 49

50 weights TARTAN engine 2 x 1b activation inputs 2b or 2 x 1b weight inputs activations 5

51 TARTAN: Convolutional Layer Processing Cycle 1: load 2b weight into BRs 51

52 TARTAN: Weight x 1 st bit of Two Activations Cycle 2: Multiply W with bit 1 of activations a1 and a2 52

53 TARTAN: Weight x 2 nd bit of Two Activations Cycle 3: multiply W with 2 nd bit of a1 and a2 Load new W into BR 3-stage pipeline to do 2: 2b activation x 2b weight 53

54 TARTAN: Fully-Connected Layers: Loading Weights What is different? Weights cannot be reused Cycle 1: Load first bit of two weights into Ars Bit 1 of Two Different Weights 54

55 TARTAN: Fully-Connected Layers: Loading Weights Cycle 2: Load 2 nd bit of w1 and w2 into ARs Bit 2 of Two Different Weights Loaded Different Weights to Each Unit 55

56 TARTAN: Fully-Connected Layers: Processing Activations Cycle 3: Move AR into BR and proceed as before over two cycles 5-stage pipeline to do: TWO of (2b activation x 2b weight) 56

57 TARTAN: Result Summary Bit-Serial TARTAN 2.4x faster than DaDiannao 1.25x more energy efficient at the same frequency 1.5x area overhead 2-bit at-a-time TARTAN 1.6x faster over DaDiannao Roughly same energy efficiency 1.25x area overhead 57

58 Bit-Pragmatic Engine X + X Operand Information Content Varies 58

59 Inner-Products Want to do A x B Let s look at A X B Which bits really matter? 59

60 Zero Bit Content: 16-bit fixed-point Only 8% of bits are non-zero once precision is reduced 15%-1% otherwise 6

61 Zero Bit Content: 8-bit Quantized (Tensorflow-like) Only 27% of bits are non-zero 61

62 Pragmatic Concept: Bit-Parallel Engine 62

63 Pragmatic Concept: Use Shift-and-Add Simply Modify Stripes? Too Large + Cross Lane Synchronization 63

64 Bit-Parallel Engine x x 64

65 STRIPES x x x x 65

66 Pragmatic: Naive STRIPES extension? Problem #1: Too Large offset 248 offset 15 offset 255 offset 16 >> 32 >> 32 BIG >> 32 >> 32 BIG = 3.7x area overhead just for the datapath 66

67 Solution to #1? 2-Stage Shifting Process in groups of Max N Difference Example with N = >> >> 2 OK >> Some opportunity loss, much lower area overhead 67

Difference Example with N = 4 1 1 1 1 15 1

68 Solution to #1? 2-Stage Shifting Process in groups of Max N Difference Example with N = >> >> 2 OK >> Some opportunity loss, much lower area overhead 68

69 Lane Synchronization Different # of 1 bits Lanes go out of sync May have to fetch up to 256 different activations from NM Keep Lanes Synchronized: No cost: All lanes Extra register for weights: Allow columns to advance by 1 Some cost but much better performance 7

70 Speedup and Energy Efficiency vs. DaDianNao 71

71 Bit-Pragmatic No Accuracy Loss X +31% performance - 48% Energy % Area X Better w/ 8-bit Quantization 4.3x with Encoding 72

72 Reducing Memory Footprint and Bandwidth 73

73 Proteus Operand Precision Required Varies X + X Proteus: Store in reduced precision in memory Less Bandwidth, Less Energy 74

74 Proteus: Pick Per Layer Precision Weights (synapses) and Data (activations/neurons) Layered Extension: Compatible with Existing Systems 75

75 Conventional Format: Base Precision 15 x 15 x Data Physically aligns with Unit Inputs

76 4K x 4K Crossbar Conventional Format: Base Precision 15 x 255 x Need Shuffling Network to Route Synapses 4K input bits Any 4K output bit position

77 Proteus Key Idea: Pack Along Data Lane Columns 15 x 15 x Local Shufflers: 16b input 16b output Much simpler 78

78 Proteus 44% less memory bandwidth

79 What s Next Training Prototype Design Space: lower-end confs Unified Architecture Dispatcher + Compute Other Workloads: Comp. Photo General Purpose Compute Class 8

80 A Value-Based Approach to Acceleration More properties to discover and exploit E.g., Filters do overlap significantly CNNs one class Other networks Use the same layers Relative importance different Training 81

Value-Based Deep- Learning Acceleration

THEME ARTICLE: Automotive Computing Value-Based Deep- Learning Acceleration Andreas Moshovos University of Toronto Jorge Albericio NVIDIA Patrick Judd University of Toronto Alberto Delmás Lascorz University