ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI

Size: px

Start display at page:

Download "ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI"

Alban Morris
6 years ago
Views:

1 ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI Bert oons, Roel Uytterhoeven, Wim Dehaene, arian Verhelst ESAT/ICAS - KU Leuven 1of 56

2 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence Raw Data Information CLOUD GPU 2of 56

3 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence Local Processing 3of 56

4 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence 1-to-10 TOPS/W CNN processing is crucial for always-on embedded Local Processing operation. 4of 56

Always-on Neural Networks Large-Scale, highly accurate CNN s are too expensive for embedded always-on operation. VGG-16 Recognition on LFW* Classes 5760 Accuracy 92.

5 Always-on Neural Networks Large-Scale, highly accurate CNN s are too expensive for embedded always-on operation. VGG-16 Recognition on LFW* Classes 5760 Accuracy 92.5% Complexity odel Size Processing Energy / 1 TOPS/W 15.4 GACs 15 B ~ 30 mj/f ~ fps [*] Labeled Faces in the Wild Data set LFW 5of 56 A A A 1200mAh - 1.5V Drains in 2h

6 Presentation Outline A. 1. Hierarchical Recognition 2. DVAFS: Dynamic-Voltage- Accuracy-Frequency-Scaling B. 1. Hardware Implementation 2. Results 6of 56

7 Hierarchical recognition Hierarchical processing enables always on CNN-based visual recognition 7of 56

8 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Large-Scale Recognition 6 ACs 15.4GACs 8of 56

9 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Owner Detected? Large-Scale Recognition N Y 6 ACs 12ACs 15.4GACs 9of 56

10 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Owner Detected? Friend Detected? Large-Scale Recognition N N Y N 6 ACs 12ACs 500ACs 15.4GACs 10 of 56

11 CONV-1 Face Detected 6 ACs? 22 kb 5-44%=0 2-4b Ops Hierarchical Face Recognition Hierarchical processing enables always-on compute CONV-2 Owner Detected 12 ACs? 42 kb 8-45%=0 3-4b Ops CONV-3 Friend 500 Detected ACs? 742 kb 8-47%=0 4-6b Ops Large-Scale CONV-4 Recognition 15 GACs 15 B 5-82%=0 4-6b Ops 94 % Nacc. 96 % acc. 94 % acc % acc. N Y Always-on ~1% on ~0.1% on ~0.01% on Increasing # Classes / Network Size / FP precision / Energy per frame 6 ACs 12ACs 500ACs 15.4GACs 11 of 56 N

12 DVAFS: Dynamic-Voltage- Accuracy-Frequency-Scaling An at run-time Energy-vs-Computational Precision trade-off 12 of 56

13 Precision Scaling - DVAS x 0 /0 x 1 /0 x 2 DVAS Dynamic-Voltage-Accuracy-Scaling 4 y 3 y 2 y 1 /0 y 0 /0 x 3 x 2 x 1 x 0 y 3 y 2 y 1 y 0 Gate LSB Gate LSB Standard ultiplier x 3 z 3 z 2 z 1 z As in [4] oons, VLSI2016 ; oons, JSSC of 56

14 Precision Scaling - DVAFS DVAFS Dynamic-Voltage-Accuracy-Frequency-Scaling x 00 y 11 y 01 y 10 y 00 x 11 x 01 x 10 x 00 y 11 y 01 y 10 y 00 x 10 x 01 Subword-Parallel ult. x 11 z 31 z 21 z 11 z 01 z 30 z 20 z 10 z of 56

15 Precision Scaling - DVAFS DVAFS Dynamic-Voltage-Accuracy-Frequency-Scaling x 00 y 11 y 01 y 10 y 00 x 11 x 01 x 10 x 00 y 11 y 01 y 10 y 00 x 10 x 01 x 11 DVAFS is a dynamic Subword-p. precision technique, ultiplier lowering all run-time adaptable parameters: activity, frequency z 31 zand 21 z 11 supply z 01 z 30 voltage z 20 z 10 z of 56

16 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision DVAS Energy/word High precision DVAS DVAFS CTRL & Transfer Energy/word emory Overhead Compute Compute Overhead Compute 16 of 56

17 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision DVAFS Energy/word High precision DVAFS DVAFS CTRL & Transfer Energy/word emory Overhead Compute Compute Overhead Compute 17 of 56

18 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision Rel. Energy / operation [-] * T = 76 GOPS Precision [bits] 8x in DVAS 20x in DVAFS 18 of 56

19 Precision Scaling BB in FDSOI DVAFS modulates leakage-vs-dynamic balance Body-Bias tuning allows minimizing energy High precision Reduce V T, constant (V - V T ) and f f Dominant Low precision Increase V T, constant (V - V T ) and f f Dominant Dynamic Leakage BB nom BB optimal BB nom BB optimal 19 of 56

20 Processor Architecture Exploits: A. Parallelism and Data Reuse; B. Network sparsity; C. Varying precision through DVAFS. 20 of 56

21 Optimization: CNN Characteristics (A) Convolution operators are highly parallel Algorithm allows inherent data reuse Three types of Reuse supported in Envision Images Filter Image Filters 2 Image Filter Convolutional Reuse Image Reuse Filter Reuse [3] Chen, ISSCC of 56

22 Optimization: CNN Characteristics (B,C) CNN weights and activations are sparse. Precision varies between apps, networks, layers Sparsity RELU activations Varying precision Non-uniform 99*% relative benchmark accuracy Network Sparsity LeNet % AlexNet 5-90% VGG 5-82% Network LeNet-5 AlexNet VGG (*95%) Precision 1-5 bits 4-9 bits 4-6 bits 22 of 56

23 A 2D-SID DVAFS Architecture IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 23 of 56

24 A 2D-SID DVAFS Architecture IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 24 of 56

25 A 2D-SID DVAFS Architecture Filter Image Partial Sum * = 25 of 56

26 A 2D-SID DVAFS Architecture Feature SRA No Reuse in Scalar Solution 1x16b 1 Feature * 1 Weight Filter SRA 1x16b 26 of 56

27 A 2D-SID DVAFS Architecture Convolutional Reuse in 1D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 1 Weight 1x16b 27 of 56

28 A 2D-SID DVAFS Architecture Convolutional Reuse in 1D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 1 Weight 1x16b F I F O 28 of 56

29 A 2D-SID DVAFS Architecture Convolutional + Image Reuse in 2D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 16 Weights 16x16b F I F O 29 of 56

30 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS Feature SRA Filter SRA 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) N=2 30 of 56

31 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 256 AC units SR Feature SRA 16b Feature * SRA 16b Filter Filter SRA SRA 48b 48b N = 1, 1x16b Accumulate 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) F I F O N=1 *Status Register 31 of 56

32 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 512 AC units SR Feature 8b SRA8b * Unused 4 8b 8b N = 2, 2x8b Filter SRA 2x 24b 2x 24b Unused 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) N=2 *Status Register 32 of 56

33 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 1024 AC units 16x(Nx16b/N) / 1x(Nx16b/N) SR Feature 4b SRA 4b 4b 4b * 16N Features Unused * 16N Weights N=4 Unused 16x(Nx16b/N) Filter SRA 4x 12b 4x 12b N = 4, 4x4b *Status Register 33 of 56

34 A 2D-SID DVAFS Architecture Guard SRA and 2D-Array from sparse operators 4 [4] oons, VLSI 2016 Feature SRA GRD SRA F I F O GRD 0 1 Filter SRA GRD of 56

35 Flexible emory / IO compression IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 35 of 56

36 Flexible emory / IO compression IO en/decoder D A RISC CTRL C-programmable 1D-SID: ReLu, ALU 4 16b Instructions ax-pool, AC, 4 Huffman-based IO 2D-SID compression, AC-array up to Input 5.8x processing AlexNet 4 data 16 kb P 4 128kB D 4 D A D B P data o 3-wise parallel acc. 4kB GRD SRA 4 D C D D GRD o sparsity flags GRD As in [4] oons, VLSI2016 Input procesing 36 of 56

37 Physical Implementation Efficiency and scalability through granular Power and Body-Bias domains 37 of 56

38 Physical Implementation 28 FDSOI V E BBGND IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing V 2D BB1 V CTRL BB2 D A D C D B D D P GRD GRD Input procesing 38 of 56

39 Physical Implementation 28 FDSOI 1.29 mm 1.45 mm 2D-SID AC array RISC, DA E 1.87 mm 2 39 of 56

40 easurement Results Efficiencies from 0.25-to-10 TOPS/W depending on Precision and Network Sparsity 40 of 56

41 easurement Results 1x16b Eff. [TOPS/W] Voltage [V] BB nom 1.05V 0.25 TOPS/W Throughput [GOPS] 41 of 56

42 easurement Results 1x16b * 2x8b Eff. [TOPS/W] Voltage [V] BB nom 0.8V 1 TOPS/W Throughput [GOPS] 42 of 56

43 easurement Results * + 1x16b 2x8b 4x4b Eff. [TOPS/W] Voltage [V] BB nom 0.67V 4 TOPS/W Throughput [GOPS] 43 of 56

44 easurement Results * + o 1x16b 2x8b 4x4b 30-60% Sparse 4x3-4b Eff. [TOPS/W] Voltage [V] BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 44 of 56

45 easurement Results BB nom 1x16b = +/-.6V V = 0.85V f, T * + 2x8b 4x4b L D o 30-60% Sparse 4x3-4b BB nom Eff. [TOPS/W] Voltage [V] BB nom 0.85V 0.33 TOPS/W Throughput [GOPS] 45 of 56

easurement Results BB nom 1x16b = +/-.6V V = 0.85V BB* opt = 2x8b +/- 1.2V V = 0.70V Power @ f, T + L D 4x4b 1.6x L D o 30-60% Sparse 4x3-4b BB nom BB opt Eff.

46 easurement Results BB nom 1x16b = +/-.6V V = 0.85V BB* opt = 2x8b +/- 1.2V V = 0.70V f, T + L D 4x4b 1.6x L D o 30-60% Sparse 4x3-4b BB nom BB opt Eff. [TOPS/W] Voltage [V] BB nom 0.85V 0.33 TOPS/W BB opt 0.70V 0.53 TOPS/W Throughput [GOPS] Throughput [GOPS] 46 of 56

47 easurement Results BB nom 1x16b = +/-.6V V = 0.61V f, T * + + 2x8b 4x4b 4x4b L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 47 of 56

48 easurement Results BB nom 1x16b = +/-.6V V = 0.61V BB * opt = 2x8b +/- 0.2V V = 0.63V f, T + + 4x4b 4x4b L D 1.2x L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] BB opt 0.63V 10 TOPS/W 300 Throughput [GOPS] 48 of 56

easurement Results BB nom 1x16b = +/-.6V V = 0.61V BB * opt = 2x8b +/- 0.2V V = 0.63V Power @ f, T + + 4x4b 4x4b L D 1.

49 easurement Results BB nom 1x16b = +/-.6V V = 0.61V BB * opt = 2x8b +/- 0.2V V = 0.63V f, T + + 4x4b 4x4b L D 1.2x L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 40x BB opt 0.63V 10 TOPS/W 300 Throughput [GOPS] 49 of 56

50 Hierarchical Face Recognition Revisited Hierarchical processing enables always-on compute 3 uj/f 2-4b CONV 4.2 TOPS/W 6 uj/f CONV 4 TOPS/W 500 uj/f CONV 1.8TOPS/W uj/f 4-6b CONV 1.3 TOPS/W N N Y N Always-on ~1% on ~0.1% on ~0.01% on 50 of 56

51 Hierarchical Face Recognition Revisited Hierarchical processing enables always-on compute 3 uj/f 2-4b CONV 4.2 TOPS/W 6 uj/f CONV 4 TOPS/W 500 uj/f CONV 1.8TOPS/W uj/f CONV 1.3 TOPS/W This Functionality Always-on At 6uJ / frame average CONVlayer energy consumption N N Y N Always-on ~1% on ~0.1% on ~0.01% on 51 of 56

52 Comparison A. Highest scalability of Energy-vs- Computational Precision (40x) B. Efficiencies up to 10 TOPS/W 52 of 56

53 Eyeriss 3 ISSCC 16 oons 4 VLSI 16 This work N = 1, 2 or 4 Technology 65nm LP 40nm LP 28nm FDSOI f nom f nom Peak GOPS ANet CONV VGG CONV Comparison with SotA 200Hz 1V mW@35fps - 200Hz 1.1V fps - 200Hz 1V N x fps 1.7fps Power GOPS nom in. Eff. ax. Eff GOPS 0.17 TOPS/W 0.25 TOPS/W GOPS 0.27 TOPS/W 2.60 TOPS/W GOPS 0.25 TOPS/W 10.0 TOPS/W 53 of 56

54 Comparison with SotA homes.esat.kuleuven.be/~mverhels/dlicsurvey.html bit 8-bit 16-bit Energy-Efficiency [TOPS/W] This work oons 4 ID14.6 Chen Throughput [GOPS] ID14.2 ID of 56

55 Summary Envision: A 0.25-to-10 TOPS/W CNN processor, trading energy-vscomputational precision 55 of 56

56 Summary Always-on through hierarchical computing. An energy-efficient CNN-architecture: 1. 2D-SID baseline; 2. DVAFS-compatible 3. Operator guarding and IO-compression. Envision: a 0.25-to GOPS varying with the required network precision. Acknowledgement: This work was partly funded by FWO and Intel Corporation. We thank Synopsys for tool support, STicroelectronics for silicon donation. 56 of 56

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110