Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System Simulation, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) HiStencils, Amsterdam, The Netherlands; January 20, 2015

Motivation Multigrid methods are widely used Solution of discretized PDEs Preconditioners for other iterative solvers and highly scalable O(N) operations but also on a device scale: from embedded hardware to supercomputers! But: Most efficient implementation varies greatly on numerical problem and target architecture! Our research Generation of multigrid-based solvers and automatic application of domain- and target-specific optimizations Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 1

Motivation Code generator works for supercomputers, e. g., on JUQUEEN (TOP500 #8) 1 : 10 total runtime [s] 8 6 4 2 0 512 1k 2k 4k 8k 16k 32k 64k 128k 256k number of cores but can it also work on the other end of the scale, i. e., for energy-efficient embedded devices such as FPGAs? 1 Christian Schmitt et al. ExaSlang: A Domain-Specific Language for Highly Scalable Multigrid Solvers. In: Proceedings of the 4th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC). (New Orleans, LA, USA). IEEE Computer Society, Nov. 17, 2014, pp. 42 51. ISBN: 978-1-4799-7020-9. DOI: 10.1109/WOLFHPC.2014.11 Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 2

Basic Multigrid Ideas Multigrid method 4 1. Pre-smoothing 3 2. Calculation of residual 3. Restriction 4. Recursive call(s) or solve (at coarsest level) 5. Prolongation 6. Correction 7. Post-smoothing level 4 3 2 level 2 1 0 time 1 0 time Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 3

Basic Multigrid Ideas Residual on fine grid Smoother applied Residual on coarse grid Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 4

FPGA Basics Field Programmable Gate Arrays Array of lookup tables and registers configurable logic blocks (CLBs) Switch matrices to connect CLBs Trade-off between performance of hardware (ASIC) and flexibility of software Programming via Hardware Description Language, e. g., VHDL, Verilog Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 5

FPGA Basics Spread across chip: Hard-IP cores Block RAM Distributed memory ~1000, 1-2 kb each Very high memory bandwidth DSP Blocks Dedicated multiplier/adder units Typically 16 16 bit Clock: ~500 MHz High-speed serial I/O PCIe hard-ip for communication with host PC Off-chip DDR3 Soft IP support for various protocols (Infiniband, Ethernet,... ) Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 6

FPGA Basics Basic principle: Spatial Computing, e. g., stream processing Temporal Computing: Sequential execution time Spatial Computing: Parallel execution time Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 7

FPGA Basics Traditional Workflow Hand-coding HDL Register Transfer Level (RTL) Post Place & Route (PPnR) Upload configuration file to FPGA s t r u c t Smoother_1Kernel { double operator ( ) ( double rhsdata_1 [3][3], double solutiondata_1 [ 3 ] [ 3 ] ) { double temp1 = 4.0 f * solutiondata_1 [1][1] solutiondata_1 [2][1] solutiondata_1 [0][1] solutiondata_1 [1][2] solutiondata_1 [ 1 ] [ 0 ] ; double temp2 = rhsdata_1 [ 1 ] [ 1 ] temp1 ; double temp3 = solutiondata_1 [ 1 ] [ 1 ] + temp2 / 2 ; r e t u r n temp3 ; } } ; void Smoother_1 ( hls : : stream<double>& rhs_in, hls : : stream<double >& data, hls : : stream<double>& sol, hls : : stream< double >& rhs_out ) { s t r u c t Smoother_1Kernel Smoother_1_inst ; processmimo<16384, 32, 32, 3> ( rhs_in, data, sol, rhs_out, 32, 32, Smoother_1_inst, BorderPadding : :BORDER_CLAMP) ; } High-level Synthesis (HLS) Behavioral description (algorithm, math. model), often in a subset of C/C++ Conversion to structured description Register-transfer level (RTL) Connected blocks Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 8

ExaSlang Multi-layered DSL Description of multigrid-based numerical solvers Layer 4 aimed at computer scientists Explicitly parallel by providing simple communication statements Definition of arrays, stencils, loops Explicit addressing of different multigrid levels Level Specifications Referencing of multigrid levels Absolute: @0, @coarsest Relative: @current, @coarser, @finer Used to implement multigrid recursion. Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 9

ExaSlang Fields and Layouts Represent multi-dimensional arrays Size(s) determined automatically Layout determines datatype, communication Multiple fields can have the same layout Layout NoComm <Real> @all { ghostlayers = [ 0, 0] duplicatelayers = [ 1, 1 ] } Field Solution <global, NoComm, 0.0> @all Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 10

ExaSlang Stencils Stencil Laplace @all { [ 0, 0] => 4.0, [ 1, 0] => -1.0, [-1, 0] => -1.0 [ 0, 1] => -1.0, [ 0, -1] => -1.0 } Loops loop over Solution @current { Solution2 @current = Solution @current + ((( 1.0 / diag(laplace @current)) * 0.8) * (RHS @current - Laplace @current * Solution @current)) } Bounds of loop determined by field One loop can be mapped to one kernel function Arguments for function via dependency analysis Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 11

Mapping to FPGAs Resolve stencil applications per multigrid levels Map loop over statements to separate IP core Dependency analysis: Add fields to IP core inputs/outputs Calculate field (stream) sizes for IP core Replace loop over statements with process statements Connect IP cores with streams and insert copy/split kernels to duplicate streams Add iteration intervals from simulation Resource sharing optimizations Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 12

Mapping to FPGAs Kernels connected via streams: pre8 data_in rhs_in smooth8_1 data_out sol8_1 rhs8 post8 smooth8_2 sol8_2 rhs8 residual8 res8 cpy8 res8_1 rhs8 res8_2 smooth8_4 rhs8 sm8 smooth8_3 corr8 correction up8 downsample8 upsample8 FIFO buffers needed between cores for different stream sizes, i. e., downsampling and upsampling Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 13

Mapping to FPGAs Stages can be reused due to FIFO buffering: sol8 rhs8 data_in data_out II=1 corr8 residual8 smooth8 smooth8 res8 correct8 down8 restrict8 B prolong8 up8 sol7 rhs7 smooth7 B smooth7 II=4 B smooth7 up8 sm7 sol7 smooth7 rhs7 ~ ~ restrict2 II=4096 B ~ prolong2 ~ sol7 rhs7 B II=16384 B smooth1 8 STAGE MULTIGRID SOLVER B Buffer Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 14

Results Setup V(2,2) solver for Poisson s equation 8 multigrid levels fixed Jacobi smoothers Jacobi applied multiple times for coarse grid solving Input grid size of 4096 4096 High-level synthesis Xilinx Vivado HLS v14.2 Small support library to help with IP core instantiation 2 Buffer sizes calculated using external simulation tool 2 Moritz Schmid et al. An Image Processing Library for C-based High-Level Synthesis. In: Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL). (Munich, Germany). Sept. 2 4, 2014 Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 15

Results Resource usage on FPGAs for double precision: FPGA LUTs FFs DSPs BRAMs F max [MHz] Kintex-7 3 140% 43% 111% 124% 232.0 Virtex-7 4 73% 29% 33% 53% 229.4 Sharing of stages due to FIFO buffers Increase coarser stages iteration intervals (II) for resource sharing Double precision not possible for Kintex-7 More stages could be added for Virtex-7 3 Estimation by Vivado HLS. Place & Route not possible due resource constraints. 4 PPnR Result Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 16

Results Performance figures for a single V-cycle Target Runtime [ms] Throughput [Vps 5 ] FPGA 6 83.1 12.3 Intel i7 7 223.1 4.5 5 V-cycles per second 6 Performance is the same for Kintex-7 (XC7VX485T) and Virtex-7 (XC7K325T). Single precision on Kintex-7, double precision on Virtex-7. 7 Intel i7-3770, 3.40 GHz, single thread. Besides AVX, no optimization where applied by our code generator. Double precision. Code is memory-bandwidth bound. Compiled with -O3. Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 17

Summary Conclusions Code generation for FPGAs based on HDL ExaStencils code generator flexible enough to emit code for a fundamentally different computing model Performance of mid-range FPGAs already promising Future work Research (algorithmical) optimization potential Smarter grid traversal for 3D Automatic calculation of buffer sizes Partitioning among multiple FPGA boards Automatic application of HLS and hardware synthesizing Automatic re-configuration at runtime if convergence prediction insufficient Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 18

Thanks for listening. Questions? E astencils ExaStencils Advanced Stencil Code Engineering http://www.exastencils.org ExaStencils is funded by the German Research Foundation (DFG) as part of the Priority Program 1648 (Software for Exascale Computing). Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 19