Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Size: px

Start display at page:

Download "Generation of Multigrid-based Numerical Solvers for FPGA Accelerators"

Georgia Lester
5 years ago
Views:

1 Tis is te autor s version of te work. Te definitive work was publised in Proceedings of te 2nd International Worksop on Hig-Performance Stencil Computations (HiStencils), Amsterdam, Te Neterlands, Jan 20, Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Cristian Scmitt, Moritz Scmid, Frank Hannig, Jürgen Teic, Sebastian Kuckuk, and Harald Köstler Hardware/Software Co-Design, Department of Computer Science System Simulation, Department of Computer Science Friedric-Alexander-Universität Erlangen-Nürnberg (FAU) ASTRACT Not only in te field of Hig-Performance Computing (HPC), Field Programmable Gate Arrays (FPGAs) are a soaringly popular accelerator tecnology. However, tey increase te eterogeneity of clusters, wic migt be equipped already today wit accelerators, suc as GPUs. Tis results in aving to combine expertise from different fields, e. g., matematical, algoritmic and tecnical experts are needed to create numerical solvers for suc systems. To bridge tis programmability gap, Domain-Specific Languages (DSLs) are a popular coice to generate low-level implementations from an abstract algoritm description. In tis work, we demonstrate te generation of implementations of numerical solvers based on te multigrid metod for FPGAs from te same codebase tat is also used to generate code for CPUs using a ybrid parallelization of MPI and OpenMP. Our approac yields in a ardware design tat can compute up to 12 V-cycles per second wit an input grid size of on a mid-range FPGA, beating vectorized, single-treaded execution on an Intel i7 by a factor of almost tree. 1. INTRODUCTION Already today, a large percentage of clusters and supercomputers are equipped wit accelerators and we expect tat, in order to acieve exascale performance, te use of suc tecnologies will intensify and different solutions will be employed at te same time. However, te implementation of numerical solvers tat unleas suc a macine s full potential poses a great callenge, even for programming experts. Tey would not only require excellent knowledge of te different tecnologies but also of te matematical and algoritmic implementation details. A common solution to tis callenge are Domain-Specific Languages (DSLs). Tey decouple te algoritm from its implementation, allowing to formulate a description of te program in native terms. Tis description ten is trans- HiStencils 2015 Second International Worksop on Hig-Performance Stencil Computations January 20, 2015, Amsterdam, Te Neterlands In conjunction wit HiPEAC ttp:// formed into a (binary) program by a DSL compiler. To add new optimizations, e. g., support for upcoming accelerator tecnologies, only te compiler as to be extended. Ten, it merely takes a recompilation step of te original DSL program to benefit from te compiler s improvements, were in traditional approaces te program as to be extended or often even re-written from scratc in order to be efficiently parallelized and optimized towards te specifics of a novel arcitecture. Tis approac is used by project ExaStencils [7], wic researc te feasibility of generating igly scalable numerical solvers based on te multigrid metod. As input, a multilayered DSL is proposed, making it possible to tailor eac layer especially towards a certain user group. Furtermore, tis approac in combination wit a description of te target platform and profound built-in domain knowledge enables a great variety of possible optimizations tat can be done automatically by te DSL compiler. FPGAs ave been a popular coice for te implementation of signal processing for a long time. Due to teir ig computational power in combination wit excellent energy efficiency, tey are increasingly drawing interest from users of oter domains. Furtermore, newer approaces to supersede traditional and-coding of Register Transfer Level (RTL) designs ave been matured: Hig-Level Syntesis (HLS) frameworks often provide an equal quality of results wile acieving significant productivity gains troug generating syntesizable ardware descriptions from beavioral algoritm descriptions on a iger abstraction level, e. g., C code. However, describing algoritms using suc HLS-specific C code is still very specific towards a certain implementation, wereas DSLs allow to formulate algoritms in a muc more abstract manner and tus enable even iger productivity improvements. Regardless of its development, te end product of suc a ardware development is syntesized into a so-called Intellectual Property (IP) core, wic ten can be loaded onto a FPGA or integrated as part of a Application Specific Integrated Circuit (ASIC). In tis work, we demonstrate te feasibility of generating ardware-based multigrid solvers from a Domain-Specific Language via Hig-Level Syntesis. We substantiate our claims by providing results of a solver tat as been generated from our DSL, called ExaSlang 4, and processed by Vivado HLS. However, te presented approac is applicable to any C-based Hig-Level Syntesis in general.

2 Te rest of tis work is structured as follows: In section 2, related work is reviewed. In section 3, a brief introduction to multigrid metods is given. We provide a brief overview of our DSL and present its programming model in section 4. In section 5, te callenges and solutions arising from te sift towards code generation for FPGAs are expounded, wereas in section 6 evaluation results of te actual ardware implementation are presented. Lastly, conclusions of te presented researc are drawn in section RELATED WORK For HLS, numerous solutions ave been developed. A popular approac is to start from a simple imperative programming language, e. g., a subset of C, and ten translate it by stepwise refinement into a syntesizable ardware description language (HDL). Commercial examples include, besides te aforementioned Vivado HLS by Xilinx, Catapult by Calypto, Forte (now Cadence) Cyntesizer and Synopsys Synpony C Compiler. For specific application fields, programming aids in te form of libraries are available and often are sipped wit HLS frameworks. For te domain of image processing, a partial port of te computer vision library OpenCV is sipped wit Vivado HLS [16]. However, extending suc a library can become quite a burden and poses problems wen porting to new ardware. In contrast, DSL-based approaces separate algoritms from teir implementation and provide greater flexibility by allowing easier extension to new platforms. PARO [4], for instance, is a HLS environment for te domain of image processing and provides domain-specific augmentations for border treatment and reductions suc as median filtering. It is also capable of adaptive multiresolution filtering in medical imaging [5]. In previous work, te benefits of domain-specific optimization ave been sown in various domains. SPIRAL [9], for example, is a widely recognized framework for te generation of ard- and software implementations of digital signal processing algoritms (linear transformations, suc as FIR filtering, FFT, and DCT). In case of ardware generation, soft IP cores in syntesizable RTL Verilog are emitted. AT- LAS [15] and FFTW [3] are examples for te generation of matematical code from abstract descriptions for specific applications suc as FFTs, were furter optimizations are selected via auto-tuning. In te field of scientific computing and especially for stencil computations, numerous approaces building upon Domain- Specific Languages and code generation exist. Examples include Liszt [2], wic adds abstractions to Java to ease stencil computations for unstructured problems, and Pocoir [14], wic employs a divide-and-conquer skeleton on top of te parallel C extension Cilk to make stencil computations cace-oblivious. PATUS [1] uses auto-tuning tecniques to improve performance. However, none of te aforementioned approaces support code generation for FPGAs. Computations in image processing often are similar to stencil computations and also anoter popular area for DSLs. Halide [10] generates, among oters, CUDA and OpenCL code from a DSL embedded into C++. Te same description can be transformed to Verilog by Darkroom [6]. HIPAcc [8] provides native support for multigrid metods by offering appropriate language elements. It generates lowlevel accelerator suc as CUDA, OpenCL and Renderscript code from a DSL embedded into C++ and was recently extended to emit code tat can be processed to IP cores using Vivado HLS [12]. 3. MULTIGRID METHODS In scientific computing, multigrid metods are a popular coice for te solution of large systems of linear equations tat may stem from te discretization of partial differential equations (PDEs). One of te most researced PDEs is Poisson s equation wic is used for modeling diffusion processes, e. g., in te simulation of temperature distributions. Te V-cycle, one variant of a multigrid metod, is sown in Algoritm 1. In te pre- and post-smooting steps, igfrequency components of te error are damped by smooters suc as te Jacobi or te Gauss-Seidel metods. In te algoritm, ν 1 and ν 2 denote te number of smooting steps tat are applied. Low-frequency components are transformed into ig-frequency components by restricting tem to a coarser level, tus making tem good targets for te smooter once more. On te coarsest level, direct solving of te remaining linear system of equations is possible due to its low number of unknowns. However, it is also possible to apply a number of smooter iterations. In te case of a single unknown, one smooter iteration corresponds to directly solving te problem. if coarsest level ten solve A u = f exactly or by many smooting iterations else ū (k) = Sν 1 r = f A ū (k) ( u (k), A, f ) r H = Rr e H = V H (0, A H, r H, ν 1, ν 2 ) e = P e H ũ (k) = ū(k) u (k+1) = S ν 2 {pre-smooting} {compute residual} {restrict residual} {recursion} {interpolate error} + e {coarse grid correction} ( ) ũ (k), A, f {post-smooting} end Algoritm 1: Recursive V-cycle ( to solve u (k+1) = V u (k), A, f, ν 1, ν 2 ). 4. PROGRAMMING MODEL ExaStencils is a project researcing te generation of efficient and scalable numerical solvers based on multigrid metods from a description of te problem formulated in a Domain-Specific Language. During te translation process, domain-specific and ardware-specific optimizations are applied to generate ig-performance, scalable C++ code. ExaSlang sort for ExaStencils language is a Domain- Specific Language consisting of four layers of abstractness, geared towards different classes of users from diverse domains. ExaSlang 4 constitutes te most concrete layer and allows to specify standard and custom multigrid cycles by providing appropriate language elements suc as fields and stencils. It is a procedural programming language, featuring control structures suc as functions, loops and conditions. Furtermore, tis layer is explicitly parallel by providing very simple communication statements to specify data to be

3 1 Stencil { 2 [ 0, 0] => [ 1, 0] => [-1, 0] => [ 0, 1] => [ 0, -1] => } 8 9 Function () : Unit { 10 loop over { 11 = + ((( 1.0 / * 0.8) * - * 12 } 13 } Listing 1: Example of a stencil definition and its use as part of a weigted Jacobi smooter in 2D. communicated. A more toroug description of te language and its code generation and transformation framework can be found in [13]. 4.1 Language Elements eing a language tat targets te description of multigridbased numerical solvers, ExaSlang 4 combines elements of procedural languages, suc as functions and loops, wit domain-specific elements, e. g., stencils and communicationenabled memory arrays. ExaSlang 4 features special data types tat we call Algoritmic data types, as tey stem from te language s domain. Terefore, Stencil and Field can only be as part of numerical calculations. Tey are explained in more detail in te next paragraps. Stencils are defined by listing te weigts around te center using relative addressing. Weigts can be represented by any type of expression, including binary expressions and function calls, but, naturally, constant values are possible as well. An example can be seen in Listing 1, were a weigted Jacobi smooter is described. Furtermore, tis example sowcases te loop over statement, a language element tat is used to instantiate an iteration over te computational domain (or a part of it) by defining a Field tat is used to determine te computational bounds. Fields represent te actual data tat stencils are applied to. Tey may be given by te user, e. g., as te rigt-and side of a calculation, or te unknown to be solved for. A field is defined by providing a layout wic specifies furter options, e. g., enables communication among te domain partitions. Furtermore, an underlying numerical data type as to be provided. A key element of ExaSlang 4 are Level Specifications. Tey allow te definition of objects suc as functions, variables, stencils and fields depending on multigrid levels and can be used to override defaults at specific levels, as illustrated in Listing 2. Here, a smooter tat is different from te one called at oter levels is implemented for te secondfinest level. Identifiers suc as current and finest can be used to ease modifying te dept of te multigrid recursion witout rewriting large parts of te DSL code. Consequently, 1 Function () : Unit { 2 /*... */ 3 } 4 5 Function - 1) () : Unit { 6 /*... */ 7 } 8 9 Function () : Unit { 10 repeat 3 times { 11 () 12 } 13 /*... */ 14 } Listing 2: Example of overriding te default smooter function at a specific multigrid level. references to level-dependent entities, e. g., function calls, need to include te target level. 4.2 Code Generation Multigrid algoritms described in ExaSlang 4 are transformed into C++ code by a transformation framework written in Scala. Te input file is parsed and transformed into an abstract syntax tree (AST), wic target platform specific alterations and optimizations are applied to. We call tis stage, were most of te transformations take place, te intermediate representation (IR), since it is only available in te compiler instance. After numerous alterations, te IR is emitted as C++ code tat can be compiled wit standard compilers suc as gcc, Clang, IM XL and MSVC. 5. MAPPING TO HARDWARE To generate C++ code tat can be mapped and syntesized efficiently to a ardware arcitecture for FPGAs, te transformation cain as to respect a number of specifics. 5.1 Computational Model Te conventional way to stencil computations is to allocate a continuous cunk of memory and apply te stencils by iterating sequentially over te memory, i. e., a multigrid algoritm is realized as a sequence of smooter, residual calculation, restriction and prolongation stencils. Usually, tese kernels are executed linearly, were application of a new kernel starts only after completion of te previous one. As a consequence, to improve te overall performance, kernel execution times ave to be reduced. In contrast, FPGAs offer a massively parallel ardware arcitecture wic can acieve te best results in combination wit data streaming. Te concept of implementing te multigrid algoritm as a sequence of stencils can be carried over to te FPGA arcitecture by converting te computational kernels into ardware modules and laying tem out in parallel on te cip. Te modules are ten interconnected by data streams to form a pipeline, troug wic data is streamed from one entity to anoter. Once te pipeline is completely filled, all of te computations are carried out in parallel, providing a continuous output flow of results from a continuous delivery of input data. A key concept in ardware development is to design a component once and replicate it as often as necessary. For multigrid algoritms,

4 sol8 residual8 down8 sol7 rs8 sol7 sol7 res8 rs7 rs7 rs7 data_in smoot8 restrict8 restrict2 smoot1 II=1 II=4 II= STAGE MULTIGRID SOLVER II=16384 data_out smoot8 prolong8 prolong2 corr8 correct8 up8 sm7 uffer Figure 1: Structural representation of te multigrid algoritm implementation. we can make use of tis principle by designing one stage of te algoritm and replicate it to implement te recursion levels. An important fact to consider is tat te lower stages always only ave to process a fraction of te data of te next iger stage, e. g., a quarter in te case of 2D. Altoug tis could be exploited in ardware by lowering te clock frequency of te lower stages, a muc more sopisticated approac is to increase te pipeline interval, also often called Iteration Interval (II), wic describes te amount of clock cycles between te arrival of new data elements. To acieve a ig performance, te top most level uses a pipeline interval of one, wic means te arcitecture can accept new input data in every clock cycle. In consequence, it also produces results in every clock cycle, after a certain latency. To acieve tis, owever, a dedicated operator must be instantiated for eac operation of te algoritm, and terefore te implementation requires a large amount of ardware resources. If te pipeline interval is increased on te lower levels, ardware operators can be sared among te operations, wic leads to significantly lower resource requirements. An overview of te structure for a multigrid solver is sown in Figure 1. In addition to te actual implementation of eac stage, te figure also sows te II for te stages, data streams, and indicates wic connections require buffering (depicted as ). Altoug it is possible to instantiate buffers on every stream, tere are actually only tree cases were te interconnection requires buffering. Tese are (a) after down samplers (part of restrict), (b) before up samplers (part of prolong), (c) before nodes tat combine data streams and ave different pat lengts. Te necessity for buffering after downsampling and before up sampling is due to different iteration intervals between te stages. Te buffer requirement for combining nodes, suc as te correction step of te prolongation, becomes evident from te structure of te accelerator. For example, te connection between restriction and prolongation requires a very large buffer, since prolongation must wait until data arrives after aving traversed all of te lower stages. Oter interconnections in te arcitecture can be set to simple registered andsake connections, wic will lower te total amount of ardware resources required for buffer implementation. A limitation of te current HLS up8 approac is tat te re-use of streams is problematic. In case a data stream is needed as input for multiple kernels, e. g., te result of te residual computation is used for downsampling and for correction, it as to be duplicated accordingly by inserting kernels tat copy, or split, te stream into te required number of output streams. Currently, HLS tools do not automatically duplicate suc streams or insert appropriate copy operations. Tus, te required split kernels need to be identified and placed into te pipeline at te correct positions. Proceeding in tis section, we will explain ow code generation must be altered to acieve efficient ardware accelerators by HLS. 5.2 Stencils and Kernels As already explained, computations are implemented by kernels and connected via streams tat represent input and output elements. In previous work, we ave introduced a library of standard components to facilitate code generation for C-based HLS [11]. In addition to support point and local operators wit an arbitrary number of input and output ports, te library also provides operations to andle data streams in complex pipelines. Stencils can be expressed by local operators, were te center corresponds to te output element being calculated and neigboring points are addressed in a relative manner, similar to te way stencils are defined in ExaSlang 4. As a consequence of te sift towards te stream processing model, iterations over te computational domain need to be transformed into separate kernels. As an iteration is declared by te loop over statement and all computational domain sizes are known at compile time, a corresponding kernel function wit te correct number of stream elements can be derived directly and transformed into a computational kernel. Te original loop statement is replaced wit an instantiation and call of te kernel. Vivado HLS syntesizes eac instantiated kernel into a dedicated ardware module and generates te data streaming interconnect fabric. During te generation of CPU code, stencil weigts are resolved directly into te calculations to reduce memory accesses and enable furter optimizations. Te same approac is used for te HLS code generation, were placing te coefficients directly into te code yields a lower number of memory streams tat need to be processed, i. e., streams tat only provide constants are not generated. 5.3 Loops and Recursion To enable pipelining and parallel execution of te kernels, te loop and recursion constructs of ExaSlang 4 must be unrolled and flattened. Te principle can be easily applied to te repeat N times loop, as N is a constant integer. An example for tis is te repeated application of te smooter. Since te number of applications is known at compile time, we can simply unroll te loop and generate appropriate kernels. A control structure tat requires more attention is recursion. For example, it is used to define te V-cycle, as depicted in algoritm 1. However, due to ExaSlang s static approac, also tis information is available at compile time wic enables us to unroll te recursion and instantiate appropriate kernels. In addition, we must adjust te iteration interval and te loop bounds of te individual kernels, in-

5 stantiate restriction and prolongation operators, as well as duplicate data streams were necessary. Moreover, ExaSlang contains repeat until loops, wic are executed until a certain criteria is fulfilled. Supporting tese constructs inside of te V-cycle, for example for solving te coarsest grid is difficult, since it would require te repeated sequential execution of kernel, wic would interrupt te streaming pipeline and require extensive buffering of te results of te iger levels. Anoter use case for te construct is te repeated application of te V-cycle until desired precision as been reaced. However, tis does not affect ardware generation. 6. CASE STUDY To evaluate our approac, we consider a typical example problem: A solution to a finite differences (FD) discretization of Poisson s equation wit Diriclet boundary conditions. We ave implemented a typical multigrid solver in ExaSlang 4 using a V(2,2) cycle and a recursion dept of 8. For smooting, a weigted Jacobi wit a pre-calculated optimal ω is used. Solution on te coarsest grid of is approximated by multiple smooter steps. Te code generated by ExaSlang 4 was used to infer a ardware description of te multigrid solver using Xilinx Vivado HLS v14.2. As evaluation ardware, we ave cosen a mid-range Xilinx Kintex 7 (xc7k325t) FPGA. Altoug interconnecting individual modules using FIFO streams and executing tese concurrently is supported by te tool using te data flow directive, te buffer size is statically set to 1 and tere is no mecanism for automatic determination of te necessary dept of te buffer. To allow uninterrupted execution, buffers on all levels must provide enoug space to not produce back-pressure on te pipeline, wic migt result in a deadlock, once te upper most level is affected. In addition to static delays between te interconnected functions, due to te latencies of te ardware modules, a complex pipeline consisting of multiple up- and down-sampling steps also produces runtime delays, wic vary according to te cosen arcitecture and are ard to anticipate. As under provisioning may at least decrease te performance and over provisioning may waste resources, we ave used logic simulation to determine te necessary buffer sizes, wic are sown in Table 1. For te cosen grid size of floating point values and 8 recursion levels, te buffers on te top four levels become very large and would overwelm te amount of available resources, even on very large FPGAs. A solution to allow te fastest possible execution and keep witin te maximum amount of available Level RHS buffer Result buffer 4K Pages Stage Stage Stage Stage Stage NA Stage NA Stage NA Stage NA Table 1: uffer sizes for interconnecting te stages of te V-cycle. resources is to offload te most callenging buffers to external DDR3 memory. A drawback is tat HLS tools, suc as te ere used Vivado HLS, do not support tis from witin te tool, but require an FPGA support design to facilitate tis. We add input and output arguments for te streams to be externalized to te function definition and specify teir type as AXI4-Streaming (). In tis way, we obtain a ig-performance interface to te FPGA fabric for eac data connection and do not need to make extensive modifications to te actual accelerator source code. Te FPGA support design uses an interconnect (IC) built on top of a virtual FIFO as an abstraction to an off-cip DDR3 memory. A structural overview of te design is sown in Figure 2. Te virtual FIFO is an IP core from Xilinx and can be configured to support up to eigt full duplex data cannels using word widts of up to 128 bytes. Te core uses round robin arbitration between te cannels wic can be weigted in terms of ow many data bursts are executed in sequence before arbitrating to te next cannel. Te size of te data burst defines te maximum amount of memory space tat can be allocated to a eac cannel. For our application, we use a 64 byte word widt, as tis is also te word widt supported by te underlying DDR3 memory and select te smallest available burst size of 128 bytes. As te individual cannels ave different data production rates, we assign different weigts for appropriate bandwidt allocation. To allow uninterrupted data excange between te DDR3 and te accelerator, te AXI IC aggregates data to a very large word widt of 64 bytes before passing it to te Virtual FIFO. To adjust te data from te accelerator to te requirements for buffering on te off-cip memory, we ave designed an AXI4-Streaming interconnect, as sown in Figure 2. It uses an internal 64 byte data bus and is situated in te same clock domain as te memory interface. Incoming data streams are first adjusted to te internal data widt before tey are transferred to te internal clock domain using asyncronous FIFO buffers. An switc merges te data stream onto te interface of te virtual FIFO, wic stores te data in te cannel s memory region according to te destination identifier of te data stream. An equal pat is used in reverse order to transfer data from te external memory back to te output ports of te interconnect and from tere to te accelerator. Table 2 sows evaluation results of te ardware syntesis from Vivado HLS, wic give a roug estimate a conservation approximation of te resource requirements of te design. Indeed, after externalizing te Table 2: HLS resource estimates for te multigrid solver design comparing on-cip and external buffering. Resource On-Cip External Available FF LUT RAM DSP top four largest buffers for results and RHS, te design can be fit onto te cip. Table 3 lists te post place and route (PPnR) results after integrating te modified accelerator into te described FPGA support system and sows tat te multigrid solver can be implemented on a Kintex 7 and

6 Figure 2: Structural representation of te FPGA support design. AXI4 Interconnect AXI_IC_IN AXI_IC_OUT HLS_Out MHz) HLS_Out MHz) Data In Coupler 1 DataWidtConv DataFIFO Data In Coupler 8 DataWidtConv DataFIFO Switc 8:1 200MHz Virtual FIFO Switc 1:8 200MHz Data Out Coupler 1 DataFIFO DataWidtConv Data Out Coupler 8 DataFIFO DataWidtConv HLS_In 1 200MHz) HLS_In 8 200MHz) AXI4 MIG DDR3 te design acieves a maximum clock frequency of over 200 MHz. We ave evaluated te PPnR ardware results on te Table 3: PPnR resource requirements of te complete multigrid solver design. LUT FFs DSPs RAMs Slices F max[mhz] data. Altoug te cosen mid-range Kintex-7 FPGA cannot provide te necessary amount of logic and memory resources, switcing to a larger FPGA, suc as a member of te Virtex-7 family, ere an XC7VX485t, can easily solve tis issue. To underline tis, we ave generated source code for HLS of te multigrid solver using double precision floating point aritmetic, for wic te estimated resource requirements are listed in Table 5. Kintex-7 FPGA to measure te performance in terms of ow many clock cycles it takes it takes to process a grid of floating point values. In combination wit te clock frequency of te design, tis yields an accurate measurement of te performance. Contrasting to a software solution, te pipelining principle also applies ere, tus, it is not necessary to wait until te result is ready, but we can start processing a new grid, as soon as all of te input values of te previous grid ave been consumed. Using te concept of data streaming, we can also ide te communication time wit a ost tat supplies te input data completely, as current ig-speed serial interconnects, suc as PCI express and oters, can acieve data rates tat exceed te trougput of te accelerator by far. In order to compare te ardware accelerator to state-ofte-art approaces, we ave used te specification of te multigrid solver in ExaSlang to generate C++ code for a single macine. Te evaluation was done on a single core of an Intel i7-3770, wic is clocked at 3.40 GHz and features L2 and L3 cace sizes of 1 M, respectively 8 M. For te CPU, AVX vectorization was enabled. Table 4 lists te performance results in terms of latency in milliseconds for processing a single iteration of te V-cycle and te trougput in terms of ow many iterations of te V-cycle can be processed per second ([Vps]), on average. It is also wort Table 4: Comparison of te performance of te multigrid solver on different ardware targets. Target Latency [ms] Trougput [Vps] Kintex Intel i mentioning tat it is irrelevant to te performance of te accelerator, weter te input data uses only single-precision floating point or is implemented for double-precision input Table 5: HLS syntesis estimates for implementation of te multigrid solver using double precision aritmetic. Values are given as percentage of available resource type. FPGA LUT FFs DSPs RAMs F max[mhz] Kintex Virtex CONCLUSIONS In tis work, we ave presented an approac to map descriptions of multigrid algoritms in a Domain-Specific Language to ardware designs for execution on FPGA by generating C++ code tat can be used wit C-based Hig- Level Syntesis tools. Furtermore, we ave outlined te specifics of implementing stencil-based calculations on FP- GAs via HLS tools and igligted differences from te process of code generation for traditional CPU-based programs. We verified our approac by syntesizing a multigrid-based solver for Poisson s equation onto two different ardware targets, a Kintex-7 FPGA and an Intel i7 CPU. ot implementations were generated from te same code base in ExaSlang. Additionally, evaluation numbers sow tat employing FPGAs in HPC is a promising approac to increase computing power, wile, at te same, to reduce te energy footprint of clusters. 8. FUTURE WORK Wile ExaSlang can be used to describe multigrid-based solvers for tree and more dimensions and code generation works, te underlying concept of streaming and buffering needs some refinement. For large datasets in iger dimensions tan 2D, current generation FPGAs boards lack sufficient large memories. Instead, data will ave to be stored

7 in te ost s memory, utilizing te PCI express bus and drawing a uge performance penalty. Neverteless, we will re-evaluate our case study wit newer generations of FPGAs boards. Additionally, to improve dataset sizes by incorporating onboard DDR3 RAM, memory bandwidts could be improved by employing multiple memory controllers into te design. Automatic insertion of split kernels to duplicate streams is currently unavailable, as data flow and depency analysis as part of te ExaStencils transformation framework is not yet finised. From suc information, an information-ric call grap could be build up, easing generation of linearized kernel executions. Partitioning of data for multiple FPGAs similar to te way data is partitioned across cluster nodes is anoter area wort looking into, especially in te ligt of HPC, were large datasets are common. 9. ACKNOWLEDGMENTS Tis work is supported by te German Researc Foundation (DFG), as part of te Priority Programme 1648 Software for Exascale Computing in project in project under contracts TE 163/17-1 and RU 422/15-1. References [1] M. Cristen, O. Scenk, and H. urkart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarcitectures. In: Proc. IEEE Int. Parallel & Distributed Processing Symp. (IPDPS). IEEE, 2011, pp [2] Z. DeVito et al. Liszt: A domain specific language for building portable mes-based PDE solvers. In: Proc. Conf. on Hig Performance Computing Networking, Storage and Analysis (SC). Paper 9, 12 pp. ACM, [3] M. Frigo and S. G. Jonson. Te design and implementation of FFTW3. In: Proc. IEEE 93.2 (Feb. 2005), pp [4] F. Hannig, H. Ruckdescel, H. Dutta, and J. Teic. PARO: Syntesis of ardware accelerators for multi-dimensional dataflow-intensive applications. In: Proc. of te Fourt International Worksop on Applied Reconfigurable Computing (ARC). (London, United Kingdom). Vol Lecture Notes in Computer Science (LNCS). Springer, Mar , 2008, pp [5] F. Hannig, M. Scmid, J. Teic, and H. Hornegger. A deeply pipelined and parallel arcitecture for denoising medical images. In: Proc. of te IEEE International Conference on Field Programmable Tecnology (FPT). (eijing, Cina). IEEE, Dec. 8 10, 2010, pp [6] J. Hegarty, J. runaver, Z. DeVito, J. Ragan-Kelley, N. Coen, S. ell, A. Vasilyev, M. Horowitz, and P. Hanraan. Darkroom: Compiling ig-level image processing code into ardware pipelines. In: ACM Transactions on Grapics (TOG) 33.4 (July 2014), 144:1 144:11. [7] C. Lengauer et al. ExaStencils: Advanced Stencil-Code Engineering First Project Report. Tec. rep. MIP Department of Computer Science and Matematics, University of Passau, June [8] R. Membart, O. Reice, C. Scmitt, F. Hannig, J. Teic, M. Stürmer, and H. Köstler. Towards a performanceportable description of geometric multigrid algoritms using a domain-specific language. In: Journal of Parallel and Distributed Computing (Nov. 2014), 32 pp. [9] M. Püscel, F. Francetti, and Y. Voronenko. SPIRAL. In: Encyclopedia of Parallel Computing. Ed. by D. A. Padua. Springer, 2011, pp [10] J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinge, and F. Durand. Decoupling algoritms from scedules for easy optimization of image processing pipelines. In: ACM Transactions on Grapics (TOG) 31.4 (July 2012), 32:1 32:12. [11] M. Scmid, N. Apelt, F. Hannig, and J. Teic. An image processing library for C-based ig-level syntesis. In: Proceedings of te 24t International Conference on Field Programmable Logic and Applications (FPL). (Munic, Germany). Sept. 2 4, [12] M. Scmid, O. Reice, C. Scmitt, F. Hannig, and J. Teic. Code generation for ig-level syntesis of multiresolution applications on fpgas. In: Proceedings of te First International Worksop on FPGAs for Software Programmers (FSP). (Munic, Germany). Sept. 1, 2014, pp arxiv: [13] C. Scmitt, S. Kuckuk, F. Hannig, H. Köstler, and J. Teic. Exaslang: a domain-specific language for igly scalable multigrid solvers. In: Proceedings of te 4t International Worksop on Domain-Specific Languages and Hig-Level Frameworks for Hig Performance Computing (WOLFHPC). To appear. (New Orleans, LA, USA). IEEE, Nov. 17, 2014, 10 pp. [14] Y. Tang, R. A. Cowdury,. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. Te Pocoir stencil compiler. In: Proc. 23rd ACM Symp. on Parallelism in Algoritms and Arcitectures (SPAA). ACM, 2011, pp [15] R. C. Waley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and te ATLAS project. In: Parallel Computing 27.1 (2001), pp [16] Xilinx Inc. Vivado Design Suite User Guide HLS. User Guide

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System