Iterative Refinement on FPGAs

Size: px

Start display at page:

Download "Iterative Refinement on FPGAs"

Emil Farmer
5 years ago
Views:

1 Iterative Refinement on FPGAs Tennessee Advanced Computing Laboratory University of Tennessee JunKyu Lee July 19 th 2011 This work was partially supported by the National Science Foundation, grant NSF CHE

2 Floating Point Performance Processors (CPU/GPU): Fast, customized static ALUs FPGAs: -Slower clock, parallel, application specific ALUs -Pipelining -Precision 1) Good performance for single and double for CPU/GPU 2) Can we exploit FPGA flexibility? - Arbitrary precision

3 Benefits from Lower Precision ALUs on FPGAs Lower ALU Precision Higher Smaller ALUs Larger ALUs SPEED UP!! Shorter Wires Shorter Pipeline Clock Rate Number of ALUs in Fixed Area Parallelism Ok, Let us Explore Precisions on FPGAs Which applications? Dense Linear System Solvers : Iterative Refinement on FPGAs Provide high performance according to a prescribed accuracy

4 Dense Linear System Solvers with Arbitrary Accuracy - extended Mixed Precision Iterative Refinement (XMIR) 1. Iterative Refinement Algorithm 2. Implementation of XMIR on FPGAs 3. Performance Comparison with GPGPUs (Xilinx XC6VSX475T vs NVIDIA GTX480) 4. Conclusions

5 Iterative Refinement Step 1: Repeat LU (GEPP) Partial Pivoting i = 0 LU x (1) = b i = i + 1 O(n 3 ), P L Step 2: r (i) = b A x (i) O(n 2 ), P H Step 3: LU z (i) = P r (i) O(n 2 ), P L Step 4: x (i+1) = x (i) + z (i) O(n), P H Until x (i+1) 2 ε x (1) : a zero vector (n 1) A : a square matrix (n n) Note) GEPP: Gaussian Elimination with Partial Pivoting P: Permutation matrix from GEPP P L : Lower precision P H : Higher precision r: Residual vector x: Solution b: Right side vector in system equation (Ax = b) ε :Prescribed Accuracy Computationally Inexpensive

6 Direct Method Matrix Decomposition: 2/3 n 3 Ops A U x = b L LU decomposition with Partial Pivoting (LUPP) n 2 Ops (Forward Substitution) y = b L n 2 Ops (Back Substitution) U x = y Success Condition for Iterative Refinement x x* / x* = q (q<1; for IR), q = ϕ(n)κ(a)ϵ 0 (LUPP)

7 MPIR XMIR Single precision for LUPP Arbitrary precision for LUPP Double precision for LUPP Arbitrary precision for LUPP Double precision refinement Arbitrary precision refinement Converge? Converge? DONE DONE

8 Impact of XMIR Assume that Time Cost Model, T(є) = α m β, Performance Comparison, Let γ = (m L /m H ), MPIR T = T(є D ) (2/3 n 3 γ β + 2 n 2 m (1+γ β ) ) XMIR T = T(є AH ) (2/3 n 3 γ β + 2 n 2 m (1+γ β ) ) If T(A L ) << T(A H ) Achievable Accuracy: XMIR: Arbitrarily High MPIR: Double

9 Dense Linear System Solvers with Flexible Precisions Direct Method 1 precision Wilkinson s Iterative Refinement (WIR) 2 precisions (Original/Higher) Accuracy: Original Precision Mixed Precision IR (MPIR) 2 precisions (S/D(O)) Accuracy: Double Arbitrary Initial Precision MPIR (AMIR) 2 precisions (A/D(O)) Accuracy: Double extended MPIR (XMIR) 2 precisions (A/A(O)) Accuracy: Arbitrary

10 Implementation (Xilinx ISE/VHDL) Step 2. Residual Calculation r = b Ax 0 MSB <= not(msb) sel Microblaze BRAM 0 Status BRAM 1 Status Loading Multiplier Adder Loading b Register File B R A M 0 B R A M 1 PE Partial Sum Register B R A M 0 B R A M 1 Matrix Side BRAMs Vector Side BRAMs Execution time: T = (n/# PEs) (n + k + l + r) / f n 2 /(# PEs f ) 2 Ops per PE per clock cycle, Implementation block size = 1,024

11 Implementation PAR: Xilinx XC5VLX110T Step 2. Residual Calculation Exp Size Precision Mantissa Size Pipeline Depth Add/Mult DSP 48E 8 23(S) 12/8 2/64 Slices Registers LUTs 1496/69, /69,120 # of BRAM (36Kb) # of PEs s 2 PAR CLK GFLOPs 4/ MHz 13 PAR: Xilinx XC6VSX475T Exp Size Precision Mantissa Size Pipeline Depth (Add/Mult) DSP 48E Slices Registers LUTs # of BRAM (36Kb) # of PEs s 2 PAR CLK GFLOPs 8 23(S) 11/8 4/2, /12 5/2, (D) 14/15 13/2, /22 16/2,016 1,278/595,200 1,748/297,600 2,355/595,200 2,807/297,600 2,912/595,200 3,546/297,600 3,816/595,200 4,517/297,600 4/1, MHz 71 6/1, MHz 35 8/1, MHz 21 8/1, MHz 23

12 Implementation Step 3. Triangular System Solver (Block Method) L 11 L 21 L 31 L 22 L 32 L 33 y 1 z 2 = z 2 L 21 y 1 y 2 z 3 = z 3 L 31 y 1 L 32 y 2. z 8 = z 8 L 81 y 1 L 82 y 2.. L 87 y 7. y 1 y 2 y 3 Update z vector block size = 64 z 1 z 2 z 3 L 41 L 42 L 43 L 44 y 4 = z 4 L 51 L 52 L 53 L 54 L 55 y 5 z 5 L 61 L 62 L 63 L 64 L 65 L 66 y 6 z 6 L 71 L 72 L 73 L 74 L 75 L 76 L 77 y 7 z 7 L 81 L 82 L 83 L 84 L 85 L 86 L 87 L 88 y 8 z 8

13 Implementation Step 3. Triangular System Solver (Small triangular matrices) block size = 64 = x 0 = 1/l 00 (b 0 ), x 1 = 1/l 11 (b 1 l 10 x 0 ), x 2 = 1/l 22 (b 2 l 20 x 0 l 21 x 1 ), x 3 = 1/l 33 (b 3 l 30 x 0 l 31 x 1 l 32 x 2 ). z 1 = (b 1 l 10 x 0 ), z 2 = (b 2 l 20 x 0 ), and z 3 = (b 3 l 30 x 0 ). In the next iteration, z 2 = (z 2 l 21 x 1 ), and z 3 = (z 3 l 31 x 1 ), and so on.

14 Implementation Step 3. Triangular System Solver (block size = 64) Arbiter zf done zf data PE Triangular matrix Td data - * ext_enable act Td/zf data T BRAM z BRAM To all the modules Div_inter b vector => Intermediate z vector addr Td Td BRAM Xdone/xdata XADDR addrx/xdatadly Diagonal elements => Final Solution loc_enable Latency from division is hidden 2 operations per clock cycle

15 GFlops Performance Comparison NVIDIA GTX480 Xilinx XC6VSX475T 71 GFlops 47 GFlops FPGA GPU 48 GFlops 32 GFlops 35 GFlops Precision (Mantissa Size) MAGMA v0.2 Exclude data transfer time from host to both accelerators (GPGPU/FPGA)

16 Conclusions 1. XMIR can produce arbitrary accuracy in linear system solvers 2. For applications requiring very high accuracy, impact of XMIR is maximized 3. XMIR (FPGA): Lower Precision / Beyond Double Precision MPIR (GPU): Moderately High Precision Future Work Hybrid-Platform (FPGA + GPU) - Power-Aware Performance Dynamic precision? - Update precisions during iteration Thank You, Any Questions?

AIR: Adaptive Dynamic Precision Iterative Refinement

University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 8-2012 AIR: Adaptive Dynamic Precision Iterative Refinement Jun Kyu Lee jlee57@utk.edu