Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters

Size: px

Start display at page:

Download "Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters"

Everett Kelly
5 years ago
Views:

1 Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Gürkaynak 1 Prof. Luca Benini 1,2 1 2 Università di Bologna

2 Advanced Processing in IoT Sense Analyze and Classify Transmit Low Power Processing System Complex preprocessing close to sensor, e.g.: Feature extraction, regression, classification, compression, sensor fusion 2

3 Arithmetic with High Dynamic Range (HDR) Desirable Low Power Processing System 100 µw - 2 mw Fixed-Point 1-10 mw Idle: ~1µW Active: ~ 50mW Fixed-point: labor intensive, error-prone, quality losses 3

4 Arithmetic with High Dynamic Range (HDR) Desirable Low Power Processing System 100 µw - 2 mw HDR Arithmetic 1-10 mw Idle: ~1µW Active: ~ 50mW Fixed-point: labor intensive, error-prone, quality losses Energy-efficient, low-cost HDR arithmetic desirable 4

5 Logarithmic Number System (LNS) FP: integer exponent FP: integer mantissa Efficient MUL, DIV, SQRT c = log 2 (2 a */ 2 b ) = log 2 (2 a ± b ) = a ± b c = log 2 (sqrt(2 a )) = log 2 (2 0.5a ) = 0.5a = a >> 1 Simple integer operations! Nonlinear ADD, SUB, I2F, F2I LNS: fixed-point exponent function interpolator large LNS unit (LNU) 5

6 Precision & Approximation Bilateral filter example: LNS 8.23 (0.5ulp) precise LNS 8.17 (16 ulp) approximate Error tolerant applications Full precision not always required Additional tuning knob 6

7 Contributions Generator framework for automatic generation of precise (0.5ulp) and approximate (> 0.5ulp) LNU instances. Design space exploration of precise / approximate LNUs. 33%-71% smaller LNU (precise) with more functionality than previous designs [8,9,27]. Case study: accuracy/performance tradeoffs of a shared LNU in a 65nm CMOS multicore cluster. [8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC

8 Problematic LNS Additions/Subtractions C=A ± B with A = 2 a, B = 2 b, C = 2 c Easy case (ADD): c = log 2 (2 a + 2 b ) = max(a,b) + f + ( a-b ) Hard case (SUB): c = log 2 (2 a - 2 b ) = max(a,b) + f - ( a-b ) critical region 8

9 Critical Region Decomposition Analytic transformation of f - into subfunctions Literature: Coleman (1995) [5] Arnold (1998) [4] Vouzis (2007) [7] Coleman (2008) [8] Ismail (2011) [9] Gautschi, Popoff (2016) [27,11] This work, using Paliouras (1996) [3] ASIC complexity 8.23bit, 0.5ulp (Synthesis): 94 kge 63 kge 40 kge 27 kge 9

10 Critical Region Decomposition c = max(a,b) + f - (r) c = max(a,b) - log 2 ((1-2 -r ) / r) + log 2 (r) cotrans(r) critical region 10

11 Function Approximation f (r) E.g., 8.23 LNS Different methods: 1) LUT only (very large!) 2) High order polynomial Often high order required Large interpolator delay 3) LUT + piecewise poly Tradeoff: precomputation vs. interpolation f (r) Interpolation error f (r) 1 st order 2 nd order 3 rd order r r Half precision - single precision: 1-2nd order d d d d r 11

12 LNU Generator Framework Specs: bitwidth, accuracy, order Iterative fitting heuristic (similar to [30]) Piecewise minimax polynomials (using Sollya [29]) [30] De Dinechin et al., Automatic Generation of Polynomial-Based Hardware Architectures for Function Evaluation, ASAP 2010 [29] Chevillarde et al., Sollya: An Environment for the Development of Numerical Codes, ICMS

13 Architecture Template Preprocessing Block Main Interpolator Block Log/Exp Block Postprocessing Block 13

14 LNS Sub (critical): c = max(a,b) + cotrans(r)+ log 2 (r) Main Interpolator Block Log/Exp Block Postprocessing Block 14

15 LNS Sub (critical): c = max(a,b) + cotrans(r)+ log 2 (r) LUTs Log/Exp Block N th order interpolator Postprocessing Block 15

16 LNS Sub (critical): c = max(a,b) + cotrans(r)+ log 2 (r) Postprocessing Block 16

17 LNS Sub (critical): c = max(a,b) + cotrans(r)+ log 2 (r) 17

Precise 32bit LNU: Features & Comparison ELM [8] ROM-less [9] ISSCC 16 [27] This Work Functionality ADD, SUB ADD, SUB F2I, I2F, EXP, LOG, ADD, SUB F2I, I2F, EXP, LOG, ADD, SUB Max error [ulp] 0.454 0.

18 Precise 32bit LNU: Features & Comparison ELM [8] ROM-less [9] ISSCC 16 [27] This Work Functionality ADD, SUB ADD, SUB F2I, I2F, EXP, LOG, ADD, SUB F2I, I2F, EXP, LOG, ADD, SUB Max error [ulp] LUT size [Kbit] Technology 180 nm 180 nm 65 nm 65 nm Area [um 2 ] Post-synthesis [kge] Min delay [ns] Max delay [ns] [8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC

19 Design Space: Precision vs. delay in umc65, post-synthesis ulp in the LNS domain - 40% Tipping point 1 st 2 nd order 19

20 Case Study: HW Platform Parallel Ultra-Low-Power (PULP) Platform [31] 4x 32b OpenRISC Cores (in-order) 16 kbyte shared L1 (TCDM), 16 kbyte L2 memory Configurations: 1 Shared LNU (Precise, Approx1, Approx2) 4, 3 or 2 pipeline registers Fair round robin arbiter 4 Private FPUs (reference) Directly integrated into cores 2 pipeline register PE0 PE1 LNU PE2 PE3 PE0 FPU FPU PE2 PE1 FPU FPU PE3 [31] M. Gautschi et al., Tailoring Instruction-Set Extensions for an Ultra-Low Power Tightly-Coupled Cluster of OpenRISC Cores, in VLSI-SoC,

21 Chip Complexities Name FPU Precise Approx1 Approx2 Format IEEE754 LNS LNS LNS Bitwidth Precision 0.5 ulp 0.5 ulp 4 ulp* 16 ulp* Order Pipeline Stages FPU/LNU [kge] 4x Total Complexity [kge] * In the LNS domain 21

22 Kernel Level Results umc65, post-layout Pipeline depth is the relevant factor! Energy efficiency gains mainly due to corresponding speedup! 22

23 Conclusions Generator Framework for precise and approximate LNUs Very compact 8.23bit LNU (33%-71% smaller) Shared setting attractive for LNU Up to 4.2x more energy efficient than private FPU baseline Approximation: Additional gains in area, speedup and energy efficiency Energy-efficiency gains mainly due to lower latency and speedup Less time is needed to complete a task lower system energy consumption 23

OpenRISC / RISC-V ISA Open source, silicon

24 Outlook Vectorization and trigonometric extensions Optimization opportunities for many algorithms to leverage LNS and approximation PULP Platform: Looking for Collaborators! OpenRISC / RISC-V ISA Open source, silicon proven Extending DSP capabilities pulp@pulp.ethz.ch

25 Q&A Acknowledgements: Nano Tera IcySoC project

26 Backup Slides 26

27 Outline Motivation Preliminaries: LNS Add/Sub and Interpolation LNU Architecture and Generator Framework Multicore Hardware Platform Results Conclusion Q&A 27

28 Private FPUs INT operations Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU HDR-ADD/SUB/MUL 50% 28

29 Private LNUs INT operations Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU LNU LNU LNU LNU Area: 1 LNU < 4 standard IEEE compliant FPU (no DIV) Poor LNU utilization ~ 0.2 HDR MUL/DIV/SQRT ADD/SUB 29

30 Shared LNU INT operations HDR-MUL/DIV/SQRT Core 0 Core 1 Core 2 Core 3 Arbiter Interconnect LNU HDR-ADD/SUB/I2F/F2I 30

Design Space Exploration Bitwidth: Half to Single Precision: 5.10 8.

31 Design Space Exploration Bitwidth: Half to Single Precision: Accuracy: Precise (0.5ulp) and Approximate (up to 16ulp) Order: 1st/2nd Order Interpolation 31

32 Design Space: Area vs. Delay * Required # pipeline stages for 500MHz target * * Precise Approx2 Approx1 32

33 Kernels Linear Algebra: AXPY, GEMM, GEMV, DotP Matrix Factorizations: Chol, QR Geometry: Homographies, Distances, Pojection Errors Image: Gradient Magnitude, Bilateral, FIR Audio: Butterworth, Sine, DCT-II Other: Radial Basis Functions 50% 25% 33

34 LNU PULP Chips Selene (ISSCC 16 [27]) UMC 65nm 4 OpenRISC Cores 1 shared 32bit LNU Phoebe UMC 65nm 4 OpenRISC Cores 1 shared 32bit LNUv2 1 shared 2x16bit LNUv2 34

35 Comparison with SFU Functionality Format Functionality Precision Order NaN, INF support Postlayout [kge] Caro et al SQRT, INVSQRT, INV, LOG, EXP, SQRT2, INVSQRT2 IEEE754, ulp 2 no 36.3 LNU ADD, SUB, F2I, I2F, LOG, EXP, INV*, INVSQRT*, SQRT* LNS, ulp 2 yes 36 * Evaluated in integer cores D. D. Caro, N. Petra, and A. G. M. Strollo, High-Performance Special Function Unit for Programmable 3-D Graphics Processors, IEEE TCAS I, vol. 56, no. 9, pp , Sept

36 PULP Architecture with shared LNU Periphery and L2 Memory 4 Core Cluster and L1 Memory 36

37 PULP Architecture with shared LNU 37

38 LNS Example IEEE 754 float = (-1) 0 * ( ) * 2 5 LNS = (-1) 0 *

39 Accuracy Impact (1) 39

40 Accuracy Impact (2) 40

Evaluating RISC-V Cores for PULP

Evaluating RISC-V Cores for PULP An Open Parallel Ultra-Low-Power Platform www.pulp.ethz.ch 30 June 2015 Sven Stucki Antonio Pullini Michael Gautschi Frank K. Gürkaynak Andrea Marongiu Igor Loi Davide