Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Size: px
Start display at page:

Download "Generation of Multigrid-based Numerical Solvers for FPGA Accelerators"

Transcription

1 Tis is te autor s version of te work. Te definitive work was publised in Proceedings of te 2nd International Worksop on Hig-Performance Stencil Computations (HiStencils), Amsterdam, Te Neterlands, Jan 20, Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Cristian Scmitt, Moritz Scmid, Frank Hannig, Jürgen Teic, Sebastian Kuckuk, and Harald Köstler Hardware/Software Co-Design, Department of Computer Science System Simulation, Department of Computer Science Friedric-Alexander-Universität Erlangen-Nürnberg (FAU) ASTRACT Not only in te field of Hig-Performance Computing (HPC), Field Programmable Gate Arrays (FPGAs) are a soaringly popular accelerator tecnology. However, tey increase te eterogeneity of clusters, wic migt be equipped already today wit accelerators, suc as GPUs. Tis results in aving to combine expertise from different fields, e. g., matematical, algoritmic and tecnical experts are needed to create numerical solvers for suc systems. To bridge tis programmability gap, Domain-Specific Languages (DSLs) are a popular coice to generate low-level implementations from an abstract algoritm description. In tis work, we demonstrate te generation of implementations of numerical solvers based on te multigrid metod for FPGAs from te same codebase tat is also used to generate code for CPUs using a ybrid parallelization of MPI and OpenMP. Our approac yields in a ardware design tat can compute up to 12 V-cycles per second wit an input grid size of on a mid-range FPGA, beating vectorized, single-treaded execution on an Intel i7 by a factor of almost tree. 1. INTRODUCTION Already today, a large percentage of clusters and supercomputers are equipped wit accelerators and we expect tat, in order to acieve exascale performance, te use of suc tecnologies will intensify and different solutions will be employed at te same time. However, te implementation of numerical solvers tat unleas suc a macine s full potential poses a great callenge, even for programming experts. Tey would not only require excellent knowledge of te different tecnologies but also of te matematical and algoritmic implementation details. A common solution to tis callenge are Domain-Specific Languages (DSLs). Tey decouple te algoritm from its implementation, allowing to formulate a description of te program in native terms. Tis description ten is trans- HiStencils 2015 Second International Worksop on Hig-Performance Stencil Computations January 20, 2015, Amsterdam, Te Neterlands In conjunction wit HiPEAC ttp:// formed into a (binary) program by a DSL compiler. To add new optimizations, e. g., support for upcoming accelerator tecnologies, only te compiler as to be extended. Ten, it merely takes a recompilation step of te original DSL program to benefit from te compiler s improvements, were in traditional approaces te program as to be extended or often even re-written from scratc in order to be efficiently parallelized and optimized towards te specifics of a novel arcitecture. Tis approac is used by project ExaStencils [7], wic researc te feasibility of generating igly scalable numerical solvers based on te multigrid metod. As input, a multilayered DSL is proposed, making it possible to tailor eac layer especially towards a certain user group. Furtermore, tis approac in combination wit a description of te target platform and profound built-in domain knowledge enables a great variety of possible optimizations tat can be done automatically by te DSL compiler. FPGAs ave been a popular coice for te implementation of signal processing for a long time. Due to teir ig computational power in combination wit excellent energy efficiency, tey are increasingly drawing interest from users of oter domains. Furtermore, newer approaces to supersede traditional and-coding of Register Transfer Level (RTL) designs ave been matured: Hig-Level Syntesis (HLS) frameworks often provide an equal quality of results wile acieving significant productivity gains troug generating syntesizable ardware descriptions from beavioral algoritm descriptions on a iger abstraction level, e. g., C code. However, describing algoritms using suc HLS-specific C code is still very specific towards a certain implementation, wereas DSLs allow to formulate algoritms in a muc more abstract manner and tus enable even iger productivity improvements. Regardless of its development, te end product of suc a ardware development is syntesized into a so-called Intellectual Property (IP) core, wic ten can be loaded onto a FPGA or integrated as part of a Application Specific Integrated Circuit (ASIC). In tis work, we demonstrate te feasibility of generating ardware-based multigrid solvers from a Domain-Specific Language via Hig-Level Syntesis. We substantiate our claims by providing results of a solver tat as been generated from our DSL, called ExaSlang 4, and processed by Vivado HLS. However, te presented approac is applicable to any C-based Hig-Level Syntesis in general.

2 Te rest of tis work is structured as follows: In section 2, related work is reviewed. In section 3, a brief introduction to multigrid metods is given. We provide a brief overview of our DSL and present its programming model in section 4. In section 5, te callenges and solutions arising from te sift towards code generation for FPGAs are expounded, wereas in section 6 evaluation results of te actual ardware implementation are presented. Lastly, conclusions of te presented researc are drawn in section RELATED WORK For HLS, numerous solutions ave been developed. A popular approac is to start from a simple imperative programming language, e. g., a subset of C, and ten translate it by stepwise refinement into a syntesizable ardware description language (HDL). Commercial examples include, besides te aforementioned Vivado HLS by Xilinx, Catapult by Calypto, Forte (now Cadence) Cyntesizer and Synopsys Synpony C Compiler. For specific application fields, programming aids in te form of libraries are available and often are sipped wit HLS frameworks. For te domain of image processing, a partial port of te computer vision library OpenCV is sipped wit Vivado HLS [16]. However, extending suc a library can become quite a burden and poses problems wen porting to new ardware. In contrast, DSL-based approaces separate algoritms from teir implementation and provide greater flexibility by allowing easier extension to new platforms. PARO [4], for instance, is a HLS environment for te domain of image processing and provides domain-specific augmentations for border treatment and reductions suc as median filtering. It is also capable of adaptive multiresolution filtering in medical imaging [5]. In previous work, te benefits of domain-specific optimization ave been sown in various domains. SPIRAL [9], for example, is a widely recognized framework for te generation of ard- and software implementations of digital signal processing algoritms (linear transformations, suc as FIR filtering, FFT, and DCT). In case of ardware generation, soft IP cores in syntesizable RTL Verilog are emitted. AT- LAS [15] and FFTW [3] are examples for te generation of matematical code from abstract descriptions for specific applications suc as FFTs, were furter optimizations are selected via auto-tuning. In te field of scientific computing and especially for stencil computations, numerous approaces building upon Domain- Specific Languages and code generation exist. Examples include Liszt [2], wic adds abstractions to Java to ease stencil computations for unstructured problems, and Pocoir [14], wic employs a divide-and-conquer skeleton on top of te parallel C extension Cilk to make stencil computations cace-oblivious. PATUS [1] uses auto-tuning tecniques to improve performance. However, none of te aforementioned approaces support code generation for FPGAs. Computations in image processing often are similar to stencil computations and also anoter popular area for DSLs. Halide [10] generates, among oters, CUDA and OpenCL code from a DSL embedded into C++. Te same description can be transformed to Verilog by Darkroom [6]. HIPAcc [8] provides native support for multigrid metods by offering appropriate language elements. It generates lowlevel accelerator suc as CUDA, OpenCL and Renderscript code from a DSL embedded into C++ and was recently extended to emit code tat can be processed to IP cores using Vivado HLS [12]. 3. MULTIGRID METHODS In scientific computing, multigrid metods are a popular coice for te solution of large systems of linear equations tat may stem from te discretization of partial differential equations (PDEs). One of te most researced PDEs is Poisson s equation wic is used for modeling diffusion processes, e. g., in te simulation of temperature distributions. Te V-cycle, one variant of a multigrid metod, is sown in Algoritm 1. In te pre- and post-smooting steps, igfrequency components of te error are damped by smooters suc as te Jacobi or te Gauss-Seidel metods. In te algoritm, ν 1 and ν 2 denote te number of smooting steps tat are applied. Low-frequency components are transformed into ig-frequency components by restricting tem to a coarser level, tus making tem good targets for te smooter once more. On te coarsest level, direct solving of te remaining linear system of equations is possible due to its low number of unknowns. However, it is also possible to apply a number of smooter iterations. In te case of a single unknown, one smooter iteration corresponds to directly solving te problem. if coarsest level ten solve A u = f exactly or by many smooting iterations else ū (k) = Sν 1 r = f A ū (k) ( u (k), A, f ) r H = Rr e H = V H (0, A H, r H, ν 1, ν 2 ) e = P e H ũ (k) = ū(k) u (k+1) = S ν 2 {pre-smooting} {compute residual} {restrict residual} {recursion} {interpolate error} + e {coarse grid correction} ( ) ũ (k), A, f {post-smooting} end Algoritm 1: Recursive V-cycle ( to solve u (k+1) = V u (k), A, f, ν 1, ν 2 ). 4. PROGRAMMING MODEL ExaStencils is a project researcing te generation of efficient and scalable numerical solvers based on multigrid metods from a description of te problem formulated in a Domain-Specific Language. During te translation process, domain-specific and ardware-specific optimizations are applied to generate ig-performance, scalable C++ code. ExaSlang sort for ExaStencils language is a Domain- Specific Language consisting of four layers of abstractness, geared towards different classes of users from diverse domains. ExaSlang 4 constitutes te most concrete layer and allows to specify standard and custom multigrid cycles by providing appropriate language elements suc as fields and stencils. It is a procedural programming language, featuring control structures suc as functions, loops and conditions. Furtermore, tis layer is explicitly parallel by providing very simple communication statements to specify data to be

3 1 Stencil { 2 [ 0, 0] => [ 1, 0] => [-1, 0] => [ 0, 1] => [ 0, -1] => } 8 9 Function () : Unit { 10 loop over { 11 = + ((( 1.0 / * 0.8) * - * 12 } 13 } Listing 1: Example of a stencil definition and its use as part of a weigted Jacobi smooter in 2D. communicated. A more toroug description of te language and its code generation and transformation framework can be found in [13]. 4.1 Language Elements eing a language tat targets te description of multigridbased numerical solvers, ExaSlang 4 combines elements of procedural languages, suc as functions and loops, wit domain-specific elements, e. g., stencils and communicationenabled memory arrays. ExaSlang 4 features special data types tat we call Algoritmic data types, as tey stem from te language s domain. Terefore, Stencil and Field can only be as part of numerical calculations. Tey are explained in more detail in te next paragraps. Stencils are defined by listing te weigts around te center using relative addressing. Weigts can be represented by any type of expression, including binary expressions and function calls, but, naturally, constant values are possible as well. An example can be seen in Listing 1, were a weigted Jacobi smooter is described. Furtermore, tis example sowcases te loop over statement, a language element tat is used to instantiate an iteration over te computational domain (or a part of it) by defining a Field tat is used to determine te computational bounds. Fields represent te actual data tat stencils are applied to. Tey may be given by te user, e. g., as te rigt-and side of a calculation, or te unknown to be solved for. A field is defined by providing a layout wic specifies furter options, e. g., enables communication among te domain partitions. Furtermore, an underlying numerical data type as to be provided. A key element of ExaSlang 4 are Level Specifications. Tey allow te definition of objects suc as functions, variables, stencils and fields depending on multigrid levels and can be used to override defaults at specific levels, as illustrated in Listing 2. Here, a smooter tat is different from te one called at oter levels is implemented for te secondfinest level. Identifiers suc as current and finest can be used to ease modifying te dept of te multigrid recursion witout rewriting large parts of te DSL code. Consequently, 1 Function () : Unit { 2 /*... */ 3 } 4 5 Function - 1) () : Unit { 6 /*... */ 7 } 8 9 Function () : Unit { 10 repeat 3 times { 11 () 12 } 13 /*... */ 14 } Listing 2: Example of overriding te default smooter function at a specific multigrid level. references to level-dependent entities, e. g., function calls, need to include te target level. 4.2 Code Generation Multigrid algoritms described in ExaSlang 4 are transformed into C++ code by a transformation framework written in Scala. Te input file is parsed and transformed into an abstract syntax tree (AST), wic target platform specific alterations and optimizations are applied to. We call tis stage, were most of te transformations take place, te intermediate representation (IR), since it is only available in te compiler instance. After numerous alterations, te IR is emitted as C++ code tat can be compiled wit standard compilers suc as gcc, Clang, IM XL and MSVC. 5. MAPPING TO HARDWARE To generate C++ code tat can be mapped and syntesized efficiently to a ardware arcitecture for FPGAs, te transformation cain as to respect a number of specifics. 5.1 Computational Model Te conventional way to stencil computations is to allocate a continuous cunk of memory and apply te stencils by iterating sequentially over te memory, i. e., a multigrid algoritm is realized as a sequence of smooter, residual calculation, restriction and prolongation stencils. Usually, tese kernels are executed linearly, were application of a new kernel starts only after completion of te previous one. As a consequence, to improve te overall performance, kernel execution times ave to be reduced. In contrast, FPGAs offer a massively parallel ardware arcitecture wic can acieve te best results in combination wit data streaming. Te concept of implementing te multigrid algoritm as a sequence of stencils can be carried over to te FPGA arcitecture by converting te computational kernels into ardware modules and laying tem out in parallel on te cip. Te modules are ten interconnected by data streams to form a pipeline, troug wic data is streamed from one entity to anoter. Once te pipeline is completely filled, all of te computations are carried out in parallel, providing a continuous output flow of results from a continuous delivery of input data. A key concept in ardware development is to design a component once and replicate it as often as necessary. For multigrid algoritms,

4 sol8 residual8 down8 sol7 rs8 sol7 sol7 res8 rs7 rs7 rs7 data_in smoot8 restrict8 restrict2 smoot1 II=1 II=4 II= STAGE MULTIGRID SOLVER II=16384 data_out smoot8 prolong8 prolong2 corr8 correct8 up8 sm7 uffer Figure 1: Structural representation of te multigrid algoritm implementation. we can make use of tis principle by designing one stage of te algoritm and replicate it to implement te recursion levels. An important fact to consider is tat te lower stages always only ave to process a fraction of te data of te next iger stage, e. g., a quarter in te case of 2D. Altoug tis could be exploited in ardware by lowering te clock frequency of te lower stages, a muc more sopisticated approac is to increase te pipeline interval, also often called Iteration Interval (II), wic describes te amount of clock cycles between te arrival of new data elements. To acieve a ig performance, te top most level uses a pipeline interval of one, wic means te arcitecture can accept new input data in every clock cycle. In consequence, it also produces results in every clock cycle, after a certain latency. To acieve tis, owever, a dedicated operator must be instantiated for eac operation of te algoritm, and terefore te implementation requires a large amount of ardware resources. If te pipeline interval is increased on te lower levels, ardware operators can be sared among te operations, wic leads to significantly lower resource requirements. An overview of te structure for a multigrid solver is sown in Figure 1. In addition to te actual implementation of eac stage, te figure also sows te II for te stages, data streams, and indicates wic connections require buffering (depicted as ). Altoug it is possible to instantiate buffers on every stream, tere are actually only tree cases were te interconnection requires buffering. Tese are (a) after down samplers (part of restrict), (b) before up samplers (part of prolong), (c) before nodes tat combine data streams and ave different pat lengts. Te necessity for buffering after downsampling and before up sampling is due to different iteration intervals between te stages. Te buffer requirement for combining nodes, suc as te correction step of te prolongation, becomes evident from te structure of te accelerator. For example, te connection between restriction and prolongation requires a very large buffer, since prolongation must wait until data arrives after aving traversed all of te lower stages. Oter interconnections in te arcitecture can be set to simple registered andsake connections, wic will lower te total amount of ardware resources required for buffer implementation. A limitation of te current HLS up8 approac is tat te re-use of streams is problematic. In case a data stream is needed as input for multiple kernels, e. g., te result of te residual computation is used for downsampling and for correction, it as to be duplicated accordingly by inserting kernels tat copy, or split, te stream into te required number of output streams. Currently, HLS tools do not automatically duplicate suc streams or insert appropriate copy operations. Tus, te required split kernels need to be identified and placed into te pipeline at te correct positions. Proceeding in tis section, we will explain ow code generation must be altered to acieve efficient ardware accelerators by HLS. 5.2 Stencils and Kernels As already explained, computations are implemented by kernels and connected via streams tat represent input and output elements. In previous work, we ave introduced a library of standard components to facilitate code generation for C-based HLS [11]. In addition to support point and local operators wit an arbitrary number of input and output ports, te library also provides operations to andle data streams in complex pipelines. Stencils can be expressed by local operators, were te center corresponds to te output element being calculated and neigboring points are addressed in a relative manner, similar to te way stencils are defined in ExaSlang 4. As a consequence of te sift towards te stream processing model, iterations over te computational domain need to be transformed into separate kernels. As an iteration is declared by te loop over statement and all computational domain sizes are known at compile time, a corresponding kernel function wit te correct number of stream elements can be derived directly and transformed into a computational kernel. Te original loop statement is replaced wit an instantiation and call of te kernel. Vivado HLS syntesizes eac instantiated kernel into a dedicated ardware module and generates te data streaming interconnect fabric. During te generation of CPU code, stencil weigts are resolved directly into te calculations to reduce memory accesses and enable furter optimizations. Te same approac is used for te HLS code generation, were placing te coefficients directly into te code yields a lower number of memory streams tat need to be processed, i. e., streams tat only provide constants are not generated. 5.3 Loops and Recursion To enable pipelining and parallel execution of te kernels, te loop and recursion constructs of ExaSlang 4 must be unrolled and flattened. Te principle can be easily applied to te repeat N times loop, as N is a constant integer. An example for tis is te repeated application of te smooter. Since te number of applications is known at compile time, we can simply unroll te loop and generate appropriate kernels. A control structure tat requires more attention is recursion. For example, it is used to define te V-cycle, as depicted in algoritm 1. However, due to ExaSlang s static approac, also tis information is available at compile time wic enables us to unroll te recursion and instantiate appropriate kernels. In addition, we must adjust te iteration interval and te loop bounds of te individual kernels, in-

5 stantiate restriction and prolongation operators, as well as duplicate data streams were necessary. Moreover, ExaSlang contains repeat until loops, wic are executed until a certain criteria is fulfilled. Supporting tese constructs inside of te V-cycle, for example for solving te coarsest grid is difficult, since it would require te repeated sequential execution of kernel, wic would interrupt te streaming pipeline and require extensive buffering of te results of te iger levels. Anoter use case for te construct is te repeated application of te V-cycle until desired precision as been reaced. However, tis does not affect ardware generation. 6. CASE STUDY To evaluate our approac, we consider a typical example problem: A solution to a finite differences (FD) discretization of Poisson s equation wit Diriclet boundary conditions. We ave implemented a typical multigrid solver in ExaSlang 4 using a V(2,2) cycle and a recursion dept of 8. For smooting, a weigted Jacobi wit a pre-calculated optimal ω is used. Solution on te coarsest grid of is approximated by multiple smooter steps. Te code generated by ExaSlang 4 was used to infer a ardware description of te multigrid solver using Xilinx Vivado HLS v14.2. As evaluation ardware, we ave cosen a mid-range Xilinx Kintex 7 (xc7k325t) FPGA. Altoug interconnecting individual modules using FIFO streams and executing tese concurrently is supported by te tool using te data flow directive, te buffer size is statically set to 1 and tere is no mecanism for automatic determination of te necessary dept of te buffer. To allow uninterrupted execution, buffers on all levels must provide enoug space to not produce back-pressure on te pipeline, wic migt result in a deadlock, once te upper most level is affected. In addition to static delays between te interconnected functions, due to te latencies of te ardware modules, a complex pipeline consisting of multiple up- and down-sampling steps also produces runtime delays, wic vary according to te cosen arcitecture and are ard to anticipate. As under provisioning may at least decrease te performance and over provisioning may waste resources, we ave used logic simulation to determine te necessary buffer sizes, wic are sown in Table 1. For te cosen grid size of floating point values and 8 recursion levels, te buffers on te top four levels become very large and would overwelm te amount of available resources, even on very large FPGAs. A solution to allow te fastest possible execution and keep witin te maximum amount of available Level RHS buffer Result buffer 4K Pages Stage Stage Stage Stage Stage NA Stage NA Stage NA Stage NA Table 1: uffer sizes for interconnecting te stages of te V-cycle. resources is to offload te most callenging buffers to external DDR3 memory. A drawback is tat HLS tools, suc as te ere used Vivado HLS, do not support tis from witin te tool, but require an FPGA support design to facilitate tis. We add input and output arguments for te streams to be externalized to te function definition and specify teir type as AXI4-Streaming (). In tis way, we obtain a ig-performance interface to te FPGA fabric for eac data connection and do not need to make extensive modifications to te actual accelerator source code. Te FPGA support design uses an interconnect (IC) built on top of a virtual FIFO as an abstraction to an off-cip DDR3 memory. A structural overview of te design is sown in Figure 2. Te virtual FIFO is an IP core from Xilinx and can be configured to support up to eigt full duplex data cannels using word widts of up to 128 bytes. Te core uses round robin arbitration between te cannels wic can be weigted in terms of ow many data bursts are executed in sequence before arbitrating to te next cannel. Te size of te data burst defines te maximum amount of memory space tat can be allocated to a eac cannel. For our application, we use a 64 byte word widt, as tis is also te word widt supported by te underlying DDR3 memory and select te smallest available burst size of 128 bytes. As te individual cannels ave different data production rates, we assign different weigts for appropriate bandwidt allocation. To allow uninterrupted data excange between te DDR3 and te accelerator, te AXI IC aggregates data to a very large word widt of 64 bytes before passing it to te Virtual FIFO. To adjust te data from te accelerator to te requirements for buffering on te off-cip memory, we ave designed an AXI4-Streaming interconnect, as sown in Figure 2. It uses an internal 64 byte data bus and is situated in te same clock domain as te memory interface. Incoming data streams are first adjusted to te internal data widt before tey are transferred to te internal clock domain using asyncronous FIFO buffers. An switc merges te data stream onto te interface of te virtual FIFO, wic stores te data in te cannel s memory region according to te destination identifier of te data stream. An equal pat is used in reverse order to transfer data from te external memory back to te output ports of te interconnect and from tere to te accelerator. Table 2 sows evaluation results of te ardware syntesis from Vivado HLS, wic give a roug estimate a conservation approximation of te resource requirements of te design. Indeed, after externalizing te Table 2: HLS resource estimates for te multigrid solver design comparing on-cip and external buffering. Resource On-Cip External Available FF LUT RAM DSP top four largest buffers for results and RHS, te design can be fit onto te cip. Table 3 lists te post place and route (PPnR) results after integrating te modified accelerator into te described FPGA support system and sows tat te multigrid solver can be implemented on a Kintex 7 and

6 Figure 2: Structural representation of te FPGA support design. AXI4 Interconnect AXI_IC_IN AXI_IC_OUT HLS_Out MHz) HLS_Out MHz) Data In Coupler 1 DataWidtConv DataFIFO Data In Coupler 8 DataWidtConv DataFIFO Switc 8:1 200MHz Virtual FIFO Switc 1:8 200MHz Data Out Coupler 1 DataFIFO DataWidtConv Data Out Coupler 8 DataFIFO DataWidtConv HLS_In 1 200MHz) HLS_In 8 200MHz) AXI4 MIG DDR3 te design acieves a maximum clock frequency of over 200 MHz. We ave evaluated te PPnR ardware results on te Table 3: PPnR resource requirements of te complete multigrid solver design. LUT FFs DSPs RAMs Slices F max[mhz] data. Altoug te cosen mid-range Kintex-7 FPGA cannot provide te necessary amount of logic and memory resources, switcing to a larger FPGA, suc as a member of te Virtex-7 family, ere an XC7VX485t, can easily solve tis issue. To underline tis, we ave generated source code for HLS of te multigrid solver using double precision floating point aritmetic, for wic te estimated resource requirements are listed in Table 5. Kintex-7 FPGA to measure te performance in terms of ow many clock cycles it takes it takes to process a grid of floating point values. In combination wit te clock frequency of te design, tis yields an accurate measurement of te performance. Contrasting to a software solution, te pipelining principle also applies ere, tus, it is not necessary to wait until te result is ready, but we can start processing a new grid, as soon as all of te input values of te previous grid ave been consumed. Using te concept of data streaming, we can also ide te communication time wit a ost tat supplies te input data completely, as current ig-speed serial interconnects, suc as PCI express and oters, can acieve data rates tat exceed te trougput of te accelerator by far. In order to compare te ardware accelerator to state-ofte-art approaces, we ave used te specification of te multigrid solver in ExaSlang to generate C++ code for a single macine. Te evaluation was done on a single core of an Intel i7-3770, wic is clocked at 3.40 GHz and features L2 and L3 cace sizes of 1 M, respectively 8 M. For te CPU, AVX vectorization was enabled. Table 4 lists te performance results in terms of latency in milliseconds for processing a single iteration of te V-cycle and te trougput in terms of ow many iterations of te V-cycle can be processed per second ([Vps]), on average. It is also wort Table 4: Comparison of te performance of te multigrid solver on different ardware targets. Target Latency [ms] Trougput [Vps] Kintex Intel i mentioning tat it is irrelevant to te performance of te accelerator, weter te input data uses only single-precision floating point or is implemented for double-precision input Table 5: HLS syntesis estimates for implementation of te multigrid solver using double precision aritmetic. Values are given as percentage of available resource type. FPGA LUT FFs DSPs RAMs F max[mhz] Kintex Virtex CONCLUSIONS In tis work, we ave presented an approac to map descriptions of multigrid algoritms in a Domain-Specific Language to ardware designs for execution on FPGA by generating C++ code tat can be used wit C-based Hig- Level Syntesis tools. Furtermore, we ave outlined te specifics of implementing stencil-based calculations on FP- GAs via HLS tools and igligted differences from te process of code generation for traditional CPU-based programs. We verified our approac by syntesizing a multigrid-based solver for Poisson s equation onto two different ardware targets, a Kintex-7 FPGA and an Intel i7 CPU. ot implementations were generated from te same code base in ExaSlang. Additionally, evaluation numbers sow tat employing FPGAs in HPC is a promising approac to increase computing power, wile, at te same, to reduce te energy footprint of clusters. 8. FUTURE WORK Wile ExaSlang can be used to describe multigrid-based solvers for tree and more dimensions and code generation works, te underlying concept of streaming and buffering needs some refinement. For large datasets in iger dimensions tan 2D, current generation FPGAs boards lack sufficient large memories. Instead, data will ave to be stored

7 in te ost s memory, utilizing te PCI express bus and drawing a uge performance penalty. Neverteless, we will re-evaluate our case study wit newer generations of FPGAs boards. Additionally, to improve dataset sizes by incorporating onboard DDR3 RAM, memory bandwidts could be improved by employing multiple memory controllers into te design. Automatic insertion of split kernels to duplicate streams is currently unavailable, as data flow and depency analysis as part of te ExaStencils transformation framework is not yet finised. From suc information, an information-ric call grap could be build up, easing generation of linearized kernel executions. Partitioning of data for multiple FPGAs similar to te way data is partitioned across cluster nodes is anoter area wort looking into, especially in te ligt of HPC, were large datasets are common. 9. ACKNOWLEDGMENTS Tis work is supported by te German Researc Foundation (DFG), as part of te Priority Programme 1648 Software for Exascale Computing in project in project under contracts TE 163/17-1 and RU 422/15-1. References [1] M. Cristen, O. Scenk, and H. urkart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarcitectures. In: Proc. IEEE Int. Parallel & Distributed Processing Symp. (IPDPS). IEEE, 2011, pp [2] Z. DeVito et al. Liszt: A domain specific language for building portable mes-based PDE solvers. In: Proc. Conf. on Hig Performance Computing Networking, Storage and Analysis (SC). Paper 9, 12 pp. ACM, [3] M. Frigo and S. G. Jonson. Te design and implementation of FFTW3. In: Proc. IEEE 93.2 (Feb. 2005), pp [4] F. Hannig, H. Ruckdescel, H. Dutta, and J. Teic. PARO: Syntesis of ardware accelerators for multi-dimensional dataflow-intensive applications. In: Proc. of te Fourt International Worksop on Applied Reconfigurable Computing (ARC). (London, United Kingdom). Vol Lecture Notes in Computer Science (LNCS). Springer, Mar , 2008, pp [5] F. Hannig, M. Scmid, J. Teic, and H. Hornegger. A deeply pipelined and parallel arcitecture for denoising medical images. In: Proc. of te IEEE International Conference on Field Programmable Tecnology (FPT). (eijing, Cina). IEEE, Dec. 8 10, 2010, pp [6] J. Hegarty, J. runaver, Z. DeVito, J. Ragan-Kelley, N. Coen, S. ell, A. Vasilyev, M. Horowitz, and P. Hanraan. Darkroom: Compiling ig-level image processing code into ardware pipelines. In: ACM Transactions on Grapics (TOG) 33.4 (July 2014), 144:1 144:11. [7] C. Lengauer et al. ExaStencils: Advanced Stencil-Code Engineering First Project Report. Tec. rep. MIP Department of Computer Science and Matematics, University of Passau, June [8] R. Membart, O. Reice, C. Scmitt, F. Hannig, J. Teic, M. Stürmer, and H. Köstler. Towards a performanceportable description of geometric multigrid algoritms using a domain-specific language. In: Journal of Parallel and Distributed Computing (Nov. 2014), 32 pp. [9] M. Püscel, F. Francetti, and Y. Voronenko. SPIRAL. In: Encyclopedia of Parallel Computing. Ed. by D. A. Padua. Springer, 2011, pp [10] J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinge, and F. Durand. Decoupling algoritms from scedules for easy optimization of image processing pipelines. In: ACM Transactions on Grapics (TOG) 31.4 (July 2012), 32:1 32:12. [11] M. Scmid, N. Apelt, F. Hannig, and J. Teic. An image processing library for C-based ig-level syntesis. In: Proceedings of te 24t International Conference on Field Programmable Logic and Applications (FPL). (Munic, Germany). Sept. 2 4, [12] M. Scmid, O. Reice, C. Scmitt, F. Hannig, and J. Teic. Code generation for ig-level syntesis of multiresolution applications on fpgas. In: Proceedings of te First International Worksop on FPGAs for Software Programmers (FSP). (Munic, Germany). Sept. 1, 2014, pp arxiv: [13] C. Scmitt, S. Kuckuk, F. Hannig, H. Köstler, and J. Teic. Exaslang: a domain-specific language for igly scalable multigrid solvers. In: Proceedings of te 4t International Worksop on Domain-Specific Languages and Hig-Level Frameworks for Hig Performance Computing (WOLFHPC). To appear. (New Orleans, LA, USA). IEEE, Nov. 17, 2014, 10 pp. [14] Y. Tang, R. A. Cowdury,. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. Te Pocoir stencil compiler. In: Proc. 23rd ACM Symp. on Parallelism in Algoritms and Arcitectures (SPAA). ACM, 2011, pp [15] R. C. Waley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and te ATLAS project. In: Parallel Computing 27.1 (2001), pp [16] Xilinx Inc. Vivado Design Suite User Guide HLS. User Guide

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

Code Generation for High-Level Synthesis of Multiresolution Applications on FPGAs

Code Generation for High-Level Synthesis of Multiresolution Applications on FPGAs Code Generation for Hig-Level Syntesis of Multiresolution Applications on FPGAs Moritz Scmid, Oliver Reice, Cristian Scmitt, Frank Hannig, and Jürgen Teic Hardware/Software Co-Design, Department of Computer

More information

Towards Domain-specific Computing for Stencil Codes in HPC

Towards Domain-specific Computing for Stencil Codes in HPC Tis is te autor s version of te work. Te definitive work was publised in Proceedings of te 2nd International Worksop on Domain-Specific Languages and Hig-Level Frameworks for Hig Performance Computing

More information

Preprint Version. Reconfigurable Hardware Generation of Multigrid Solvers with Conjugate Gradient Coarse-Grid Solution

Preprint Version. Reconfigurable Hardware Generation of Multigrid Solvers with Conjugate Gradient Coarse-Grid Solution Preprint Version Reconfigurable Hardware Generation of Multigrid Solvers with Conjugate Gradient Coarse-Grid Solution CHRISTIAN SCHMITT, MORITZ SCHMID, SEBASTIAN KUCKUK, HARALD KÖSTLER, JÜRGEN TEICH, and

More information

A Cost Model for Distributed Shared Memory. Using Competitive Update. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science

A Cost Model for Distributed Shared Memory. Using Competitive Update. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science A Cost Model for Distributed Sared Memory Using Competitive Update Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, Texas, 77843-3112, USA E-mail: fjkim,vaidyag@cs.tamu.edu

More information

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential

More information

Bounding Tree Cover Number and Positive Semidefinite Zero Forcing Number

Bounding Tree Cover Number and Positive Semidefinite Zero Forcing Number Bounding Tree Cover Number and Positive Semidefinite Zero Forcing Number Sofia Burille Mentor: Micael Natanson September 15, 2014 Abstract Given a grap, G, wit a set of vertices, v, and edges, various

More information

Parallel Simulation of Equation-Based Models on CUDA-Enabled GPUs

Parallel Simulation of Equation-Based Models on CUDA-Enabled GPUs Parallel Simulation of Equation-Based Models on CUDA-Enabled GPUs Per Ostlund Department of Computer and Information Science Linkoping University SE-58183 Linkoping, Sweden per.ostlund@liu.se Kristian

More information

PYRAMID FILTERS BASED ON BILINEAR INTERPOLATION

PYRAMID FILTERS BASED ON BILINEAR INTERPOLATION PYRAMID FILTERS BASED ON BILINEAR INTERPOLATION Martin Kraus Computer Grapics and Visualization Group, Tecnisce Universität Müncen, Germany krausma@in.tum.de Magnus Strengert Visualization and Interactive

More information

Two Modifications of Weight Calculation of the Non-Local Means Denoising Method

Two Modifications of Weight Calculation of the Non-Local Means Denoising Method Engineering, 2013, 5, 522-526 ttp://dx.doi.org/10.4236/eng.2013.510b107 Publised Online October 2013 (ttp://www.scirp.org/journal/eng) Two Modifications of Weigt Calculation of te Non-Local Means Denoising

More information

A UPnP-based Decentralized Service Discovery Improved Algorithm

A UPnP-based Decentralized Service Discovery Improved Algorithm Indonesian Journal of Electrical Engineering and Informatics (IJEEI) Vol.1, No.1, Marc 2013, pp. 21~26 ISSN: 2089-3272 21 A UPnP-based Decentralized Service Discovery Improved Algoritm Yu Si-cai*, Wu Yan-zi,

More information

Density Estimation Over Data Stream

Density Estimation Over Data Stream Density Estimation Over Data Stream Aoying Zou Dept. of Computer Science, Fudan University 22 Handan Rd. Sangai, 2433, P.R. Cina ayzou@fudan.edu.cn Ziyuan Cai Dept. of Computer Science, Fudan University

More information

An Algorithm for Loopless Deflection in Photonic Packet-Switched Networks

An Algorithm for Loopless Deflection in Photonic Packet-Switched Networks An Algoritm for Loopless Deflection in Potonic Packet-Switced Networks Jason P. Jue Center for Advanced Telecommunications Systems and Services Te University of Texas at Dallas Ricardson, TX 75083-0688

More information

Vector Processing Contours

Vector Processing Contours Vector Processing Contours Andrey Kirsanov Department of Automation and Control Processes MAMI Moscow State Tecnical University Moscow, Russia AndKirsanov@yandex.ru A.Vavilin and K-H. Jo Department of

More information

Lehrstuhl für Informatik 10 (Systemsimulation)

Lehrstuhl für Informatik 10 (Systemsimulation) FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lerstul für Informatik 10 (Systemsimulation) PDE based Video Compression in Real

More information

2 The Derivative. 2.0 Introduction to Derivatives. Slopes of Tangent Lines: Graphically

2 The Derivative. 2.0 Introduction to Derivatives. Slopes of Tangent Lines: Graphically 2 Te Derivative Te two previous capters ave laid te foundation for te study of calculus. Tey provided a review of some material you will need and started to empasize te various ways we will view and use

More information

Minimizing Memory Access By Improving Register Usage Through High-level Transformations

Minimizing Memory Access By Improving Register Usage Through High-level Transformations Minimizing Memory Access By Improving Register Usage Troug Hig-level Transformations San Li Scool of Computer Engineering anyang Tecnological University anyang Avenue, SIGAPORE 639798 Email: p144102711@ntu.edu.sg

More information

Multi-Stack Boundary Labeling Problems

Multi-Stack Boundary Labeling Problems Multi-Stack Boundary Labeling Problems Micael A. Bekos 1, Micael Kaufmann 2, Katerina Potika 1 Antonios Symvonis 1 1 National Tecnical University of Atens, Scool of Applied Matematical & Pysical Sciences,

More information

Fast Calculation of Thermodynamic Properties of Water and Steam in Process Modelling using Spline Interpolation

Fast Calculation of Thermodynamic Properties of Water and Steam in Process Modelling using Spline Interpolation P R E P R N T CPWS XV Berlin, September 8, 008 Fast Calculation of Termodynamic Properties of Water and Steam in Process Modelling using Spline nterpolation Mattias Kunick a, Hans-Joacim Kretzscmar a,

More information

Symmetric Tree Replication Protocol for Efficient Distributed Storage System*

Symmetric Tree Replication Protocol for Efficient Distributed Storage System* ymmetric Tree Replication Protocol for Efficient Distributed torage ystem* ung Cune Coi 1, Hee Yong Youn 1, and Joong up Coi 2 1 cool of Information and Communications Engineering ungkyunkwan University

More information

Investigating an automated method for the sensitivity analysis of functions

Investigating an automated method for the sensitivity analysis of functions Investigating an automated metod for te sensitivity analysis of functions Sibel EKER s.eker@student.tudelft.nl Jill SLINGER j..slinger@tudelft.nl Delft University of Tecnology 2628 BX, Delft, te Neterlands

More information

Asynchronous Power Flow on Graphic Processing Units

Asynchronous Power Flow on Graphic Processing Units 1 Asyncronous Power Flow on Grapic Processing Units Manuel Marin, Student Member, IEEE, David Defour, and Federico Milano, Senior Member, IEEE Abstract Asyncronous iterations can be used to implement fixed-point

More information

Our Calibrated Model has No Predictive Value: An Example from the Petroleum Industry

Our Calibrated Model has No Predictive Value: An Example from the Petroleum Industry Our Calibrated Model as No Predictive Value: An Example from te Petroleum Industry J.N. Carter a, P.J. Ballester a, Z. Tavassoli a and P.R. King a a Department of Eart Sciences and Engineering, Imperial

More information

Computer Physics Communications. Multi-GPU acceleration of direct pore-scale modeling of fluid flow in natural porous media

Computer Physics Communications. Multi-GPU acceleration of direct pore-scale modeling of fluid flow in natural porous media Computer Pysics Communications 183 (2012) 1890 1898 Contents lists available at SciVerse ScienceDirect Computer Pysics Communications ournal omepage: www.elsevier.com/locate/cpc Multi-GPU acceleration

More information

Distributed and Optimal Rate Allocation in Application-Layer Multicast

Distributed and Optimal Rate Allocation in Application-Layer Multicast Distributed and Optimal Rate Allocation in Application-Layer Multicast Jinyao Yan, Martin May, Bernard Plattner, Wolfgang Mülbauer Computer Engineering and Networks Laboratory, ETH Zuric, CH-8092, Switzerland

More information

Utilizing Call Admission Control to Derive Optimal Pricing of Multiple Service Classes in Wireless Cellular Networks

Utilizing Call Admission Control to Derive Optimal Pricing of Multiple Service Classes in Wireless Cellular Networks Utilizing Call Admission Control to Derive Optimal Pricing of Multiple Service Classes in Wireless Cellular Networks Okan Yilmaz and Ing-Ray Cen Computer Science Department Virginia Tec {oyilmaz, ircen}@vt.edu

More information

Real-Time Wireless Routing for Industrial Internet of Things

Real-Time Wireless Routing for Industrial Internet of Things Real-Time Wireless Routing for Industrial Internet of Tings Cengjie Wu, Dolvara Gunatilaka, Mo Sa, Cenyang Lu Cyber-Pysical Systems Laboratory, Wasington University in St. Louis Department of Computer

More information

On the Use of Radio Resource Tests in Wireless ad hoc Networks

On the Use of Radio Resource Tests in Wireless ad hoc Networks Tecnical Report RT/29/2009 On te Use of Radio Resource Tests in Wireless ad oc Networks Diogo Mónica diogo.monica@gsd.inesc-id.pt João Leitão jleitao@gsd.inesc-id.pt Luis Rodrigues ler@ist.utl.pt Carlos

More information

Optimal In-Network Packet Aggregation Policy for Maximum Information Freshness

Optimal In-Network Packet Aggregation Policy for Maximum Information Freshness 1 Optimal In-etwork Packet Aggregation Policy for Maimum Information Fresness Alper Sinan Akyurek, Tajana Simunic Rosing Electrical and Computer Engineering, University of California, San Diego aakyurek@ucsd.edu,

More information

H-Adaptive Multiscale Schemes for the Compressible Navier-Stokes Equations Polyhedral Discretization, Data Compression and Mesh Generation

H-Adaptive Multiscale Schemes for the Compressible Navier-Stokes Equations Polyhedral Discretization, Data Compression and Mesh Generation H-Adaptive Multiscale Scemes for te Compressible Navier-Stokes Equations Polyedral Discretization, Data Compression and Mes Generation F. Bramkamp 1, B. Gottsclic-Müller 2, M. Hesse 1, P. Lamby 2, S. Müller

More information

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Spring

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Spring Has-Based Indexes Capter 11 Comp 521 Files and Databases Spring 2010 1 Introduction As for any index, 3 alternatives for data entries k*: Data record wit key value k

More information

Extended Synchronization Signals for Eliminating PCI Confusion in Heterogeneous LTE

Extended Synchronization Signals for Eliminating PCI Confusion in Heterogeneous LTE 1 Extended Syncronization Signals for Eliminating PCI Confusion in Heterogeneous LTE Amed H. Zaran Department of Electronics and Electrical Communications Cairo University Egypt. azaran@eecu.cu.edu.eg

More information

Linear Interpolating Splines

Linear Interpolating Splines Jim Lambers MAT 772 Fall Semester 2010-11 Lecture 17 Notes Tese notes correspond to Sections 112, 11, and 114 in te text Linear Interpolating Splines We ave seen tat ig-degree polynomial interpolation

More information

Fault Localization Using Tarantula

Fault Localization Using Tarantula Class 20 Fault localization (cont d) Test-data generation Exam review: Nov 3, after class to :30 Responsible for all material up troug Nov 3 (troug test-data generation) Send questions beforeand so all

More information

Cubic smoothing spline

Cubic smoothing spline Cubic smooting spline Menu: QCExpert Regression Cubic spline e module Cubic Spline is used to fit any functional regression curve troug data wit one independent variable x and one dependent random variable

More information

Asynchronous Power Flow on Graphic Processing Units

Asynchronous Power Flow on Graphic Processing Units Asyncronous Power Flow on Grapic Processing Units Manuel Marin, David Defour, Federico Milano To cite tis version: Manuel Marin, David Defour, Federico Milano. Asyncronous Power Flow on Grapic Processing

More information

2.8 The derivative as a function

2.8 The derivative as a function CHAPTER 2. LIMITS 56 2.8 Te derivative as a function Definition. Te derivative of f(x) istefunction f (x) defined as follows f f(x + ) f(x) (x). 0 Note: tis differs from te definition in section 2.7 in

More information

12.2 Techniques for Evaluating Limits

12.2 Techniques for Evaluating Limits 335_qd /4/5 :5 PM Page 863 Section Tecniques for Evaluating Limits 863 Tecniques for Evaluating Limits Wat ou sould learn Use te dividing out tecnique to evaluate its of functions Use te rationalizing

More information

Low-complexity Image-based 3D Gaming

Low-complexity Image-based 3D Gaming Low-complexity Image-based 3D Gaming Ingo Bauermann and Eckeard Steinbac Institute of Communication Networks, Media Tecnology Group Tecnisce Universität Müncen Munic, Germany {ingo.bauermann,eckeard.steinbac}@tum.de

More information

Network Coding to Enhance Standard Routing Protocols in Wireless Mesh Networks

Network Coding to Enhance Standard Routing Protocols in Wireless Mesh Networks Downloaded from vbn.aau.dk on: April 7, 09 Aalborg Universitet etwork Coding to Enance Standard Routing Protocols in Wireless Mes etworks Palevani, Peyman; Roetter, Daniel Enrique Lucani; Fitzek, Frank;

More information

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Fall

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Fall Has-Based Indexes Capter 11 Comp 521 Files and Databases Fall 2012 1 Introduction Hasing maps a searc key directly to te pid of te containing page/page-overflow cain Doesn t require intermediate page fetces

More information

Multi-Objective Particle Swarm Optimizers: A Survey of the State-of-the-Art

Multi-Objective Particle Swarm Optimizers: A Survey of the State-of-the-Art Multi-Objective Particle Swarm Optimizers: A Survey of te State-of-te-Art Margarita Reyes-Sierra and Carlos A. Coello Coello CINVESTAV-IPN (Evolutionary Computation Group) Electrical Engineering Department,

More information

An Anchor Chain Scheme for IP Mobility Management

An Anchor Chain Scheme for IP Mobility Management An Ancor Cain Sceme for IP Mobility Management Yigal Bejerano and Israel Cidon Department of Electrical Engineering Tecnion - Israel Institute of Tecnology Haifa 32000, Israel E-mail: bej@tx.tecnion.ac.il.

More information

Tuning MAX MIN Ant System with off-line and on-line methods

Tuning MAX MIN Ant System with off-line and on-line methods Université Libre de Bruxelles Institut de Recerces Interdisciplinaires et de Développements en Intelligence Artificielle Tuning MAX MIN Ant System wit off-line and on-line metods Paola Pellegrini, Tomas

More information

Experimental Studies on SMT-based Debugging

Experimental Studies on SMT-based Debugging Experimental Studies on SMT-based Debugging Andre Sülflow Görscwin Fey Rolf Drecsler Institute of Computer Science University of Bremen 28359 Bremen, Germany {suelflow,fey,drecsle}@informatik.uni-bremen.de

More information

CESILA: Communication Circle External Square Intersection-Based WSN Localization Algorithm

CESILA: Communication Circle External Square Intersection-Based WSN Localization Algorithm Sensors & Transducers 2013 by IFSA ttp://www.sensorsportal.com CESILA: Communication Circle External Square Intersection-Based WSN Localization Algoritm Sun Hongyu, Fang Ziyi, Qu Guannan College of Computer

More information

MULTIPLE TOKEN DISTRIBUTED LOOP LOCAL AREA NETWORKS: ANALYSIS

MULTIPLE TOKEN DISTRIBUTED LOOP LOCAL AREA NETWORKS: ANALYSIS ULTIPLE TOKEN DISTRIBUTED LOOP LOCAL AREA NETWORKS: ANALYSIS Nimmagadda Calamaia æ Dept. of CSE JNTU College of Engineering Kakinada, India 533 003. calm@cse.iitkgp.ernet.in Badrinat Ramamurty y Dept.

More information

Author's personal copy

Author's personal copy Autor's personal copy Information Processing Letters 09 (009) 868 875 Contents lists available at ScienceDirect Information Processing Letters www.elsevier.com/locate/ipl Elastic tresold-based admission

More information

The Euler and trapezoidal stencils to solve d d x y x = f x, y x

The Euler and trapezoidal stencils to solve d d x y x = f x, y x restart; Te Euler and trapezoidal stencils to solve d d x y x = y x Te purpose of tis workseet is to derive te tree simplest numerical stencils to solve te first order d equation y x d x = y x, and study

More information

Unsupervised Learning for Hierarchical Clustering Using Statistical Information

Unsupervised Learning for Hierarchical Clustering Using Statistical Information Unsupervised Learning for Hierarcical Clustering Using Statistical Information Masaru Okamoto, Nan Bu, and Tosio Tsuji Department of Artificial Complex System Engineering Hirosima University Kagamiyama

More information

Numerical Derivatives

Numerical Derivatives Lab 15 Numerical Derivatives Lab Objective: Understand and implement finite difference approximations of te derivative in single and multiple dimensions. Evaluate te accuracy of tese approximations. Ten

More information

Packet Switching Networks. Jonathan S. Turner. Computer and Communications Research Center. the result of user connections that pass through the

Packet Switching Networks. Jonathan S. Turner. Computer and Communications Research Center. the result of user connections that pass through the Fluid Flow Loading Analysis of Packet Switcing Networks Jonatan S. Turner Computer and Communications Researc Center Wasington University, St. Louis, MO 63130 Abstract Recent researc in switcing as concentrated

More information

13.5 DIRECTIONAL DERIVATIVES and the GRADIENT VECTOR

13.5 DIRECTIONAL DERIVATIVES and the GRADIENT VECTOR 13.5 Directional Derivatives and te Gradient Vector Contemporary Calculus 1 13.5 DIRECTIONAL DERIVATIVES and te GRADIENT VECTOR Directional Derivatives In Section 13.3 te partial derivatives f x and f

More information

Intra- and Inter-Session Network Coding in Wireless Networks

Intra- and Inter-Session Network Coding in Wireless Networks Intra- and Inter-Session Network Coding in Wireless Networks Hulya Seferoglu, Member, IEEE, Atina Markopoulou, Member, IEEE, K K Ramakrisnan, Fellow, IEEE arxiv:857v [csni] 3 Feb Abstract In tis paper,

More information

SLOTTED-RING LOCAL AREA NETWORKS WITH MULTIPLE PRIORITY STATIONS. Hewlett-Packard Company East Mission Avenue. Bogazici University

SLOTTED-RING LOCAL AREA NETWORKS WITH MULTIPLE PRIORITY STATIONS. Hewlett-Packard Company East Mission Avenue. Bogazici University SLOTTED-RING LOCAL AREA NETWORKS WITH MULTIPLE PRIORITY STATIONS Sanuj V. Sarin 1, Hakan Delic 2 and Jung H. Kim 3 1 Hewlett-Packard Company 24001 East Mission Avenue Spokane, Wasington 99109, USA 2 Signal

More information

An Analytical Approach to Real-Time Misbehavior Detection in IEEE Based Wireless Networks

An Analytical Approach to Real-Time Misbehavior Detection in IEEE Based Wireless Networks Tis paper was presented as part of te main tecnical program at IEEE INFOCOM 20 An Analytical Approac to Real-Time Misbeavior Detection in IEEE 802. Based Wireless Networks Jin Tang, Yu Ceng Electrical

More information

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg

More information

Piecewise Polynomial Interpolation, cont d

Piecewise Polynomial Interpolation, cont d Jim Lambers MAT 460/560 Fall Semester 2009-0 Lecture 2 Notes Tese notes correspond to Section 4 in te text Piecewise Polynomial Interpolation, cont d Constructing Cubic Splines, cont d Having determined

More information

MATH 5a Spring 2018 READING ASSIGNMENTS FOR CHAPTER 2

MATH 5a Spring 2018 READING ASSIGNMENTS FOR CHAPTER 2 MATH 5a Spring 2018 READING ASSIGNMENTS FOR CHAPTER 2 Note: Tere will be a very sort online reading quiz (WebWork) on eac reading assignment due one our before class on its due date. Due dates can be found

More information

Traffic Pattern-based Adaptive Routing for Intra-group Communication in Dragonfly Networks

Traffic Pattern-based Adaptive Routing for Intra-group Communication in Dragonfly Networks Traffic Pattern-based Adaptive Routing for Intra-group Communication in Dragonfly Networks Peyman Faizian, Md Safayat Raman, Md Atiqul Molla, Xin Yuan Department of Computer Science Florida State University

More information

Design of PSO-based Fuzzy Classification Systems

Design of PSO-based Fuzzy Classification Systems Tamkang Journal of Science and Engineering, Vol. 9, No 1, pp. 6370 (006) 63 Design of PSO-based Fuzzy Classification Systems Cia-Cong Cen Department of Electronics Engineering, Wufeng Institute of Tecnology,

More information

Announcements SORTING. Prelim 1. Announcements. A3 Comments 9/26/17. This semester s event is on Saturday, November 4 Apply to be a teacher!

Announcements SORTING. Prelim 1. Announcements. A3 Comments 9/26/17. This semester s event is on Saturday, November 4 Apply to be a teacher! Announcements 2 "Organizing is wat you do efore you do someting, so tat wen you do it, it is not all mixed up." ~ A. A. Milne SORTING Lecture 11 CS2110 Fall 2017 is a program wit a teac anyting, learn

More information

Alternating Direction Implicit Methods for FDTD Using the Dey-Mittra Embedded Boundary Method

Alternating Direction Implicit Methods for FDTD Using the Dey-Mittra Embedded Boundary Method Te Open Plasma Pysics Journal, 2010, 3, 29-35 29 Open Access Alternating Direction Implicit Metods for FDTD Using te Dey-Mittra Embedded Boundary Metod T.M. Austin *, J.R. Cary, D.N. Smite C. Nieter Tec-X

More information

Coarticulation: An Approach for Generating Concurrent Plans in Markov Decision Processes

Coarticulation: An Approach for Generating Concurrent Plans in Markov Decision Processes Coarticulation: An Approac for Generating Concurrent Plans in Markov Decision Processes Kasayar Roanimanes kas@cs.umass.edu Sridar Maadevan maadeva@cs.umass.edu Department of Computer Science, University

More information

SlidesGen: Automatic Generation of Presentation Slides for a Technical Paper Using Summarization

SlidesGen: Automatic Generation of Presentation Slides for a Technical Paper Using Summarization Proceedings of te Twenty-Second International FLAIRS Conference (2009) SlidesGen: Automatic Generation of Presentation Slides for a Tecnical Paper Using Summarization M. Sravanti, C. Ravindranat Cowdary

More information

12.2 TECHNIQUES FOR EVALUATING LIMITS

12.2 TECHNIQUES FOR EVALUATING LIMITS Section Tecniques for Evaluating Limits 86 TECHNIQUES FOR EVALUATING LIMITS Wat ou sould learn Use te dividing out tecnique to evaluate its of functions Use te rationalizing tecnique to evaluate its of

More information

Computer Physics Communications

Computer Physics Communications Computer Pysics Communications 233 (2018) 156 166 Contents lists available at ScienceDirect Computer Pysics Communications journal omepage: www.elsevier.com/locate/cpc Afivo: A framework for quadtree/octree

More information

CORDIC-Based LMMSE Equalizer for Software Defined Radio

CORDIC-Based LMMSE Equalizer for Software Defined Radio CORDIC-Based LMMSE Equalizer for Software Defined Radio Murugappan SENTHILVELAN, Javier HORMIGO, Joon Hwa CHUN, Miai SIMA *, Daniel IANCU, Micael SCHULTE, Jon GLOSSNER Optimum Semiconductor Tecnologies,

More information

3.6 Directional Derivatives and the Gradient Vector

3.6 Directional Derivatives and the Gradient Vector 288 CHAPTER 3. FUNCTIONS OF SEVERAL VARIABLES 3.6 Directional Derivatives and te Gradient Vector 3.6.1 Functions of two Variables Directional Derivatives Let us first quickly review, one more time, te

More information

An Effective Sensor Deployment Strategy by Linear Density Control in Wireless Sensor Networks Chiming Huang and Rei-Heng Cheng

An Effective Sensor Deployment Strategy by Linear Density Control in Wireless Sensor Networks Chiming Huang and Rei-Heng Cheng An ffective Sensor Deployment Strategy by Linear Density Control in Wireless Sensor Networks Ciming Huang and ei-heng Ceng 5 De c e mbe r0 International Journal of Advanced Information Tecnologies (IJAIT),

More information

Arrays in a Lazy Functional Language a case study: the Fast Fourier Transform

Arrays in a Lazy Functional Language a case study: the Fast Fourier Transform Arrays in a Lazy Functional Language a case study: te Fast Fourier Transform Pieter H. Hartel and Willem G. Vree Department of Computer Systems University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam,

More information

Forward End-To-End delay Analysis for AFDX networks

Forward End-To-End delay Analysis for AFDX networks Forward End-To-End delay Analysis for AFDX networks Nassima Benammar, Frederic Ridouard, Henri Bauer and Pascal Ricard e-mail: {nassima.benammar,frederic.ridouard, pascal.ricard}@univ-poitiers.fr enri.bauer@ensma.fr

More information

Proceedings of the 8th WSEAS International Conference on Neural Networks, Vancouver, British Columbia, Canada, June 19-21,

Proceedings of the 8th WSEAS International Conference on Neural Networks, Vancouver, British Columbia, Canada, June 19-21, Proceedings of te 8t WSEAS International Conference on Neural Networks, Vancouver, Britis Columbia, Canada, June 9-2, 2007 3 Neural Network Structures wit Constant Weigts to Implement Dis-Jointly Removed

More information

4.2 The Derivative. f(x + h) f(x) lim

4.2 The Derivative. f(x + h) f(x) lim 4.2 Te Derivative Introduction In te previous section, it was sown tat if a function f as a nonvertical tangent line at a point (x, f(x)), ten its slope is given by te it f(x + ) f(x). (*) Tis is potentially

More information

UNSUPERVISED HIERARCHICAL IMAGE SEGMENTATION BASED ON THE TS-MRF MODEL AND FAST MEAN-SHIFT CLUSTERING

UNSUPERVISED HIERARCHICAL IMAGE SEGMENTATION BASED ON THE TS-MRF MODEL AND FAST MEAN-SHIFT CLUSTERING UNSUPERVISED HIERARCHICAL IMAGE SEGMENTATION BASED ON THE TS-MRF MODEL AND FAST MEAN-SHIFT CLUSTERING Raffaele Gaetano, Giuseppe Scarpa, Giovanni Poggi, and Josiane Zerubia Dip. Ing. Elettronica e Telecomunicazioni,

More information

George Xylomenos and George C. Polyzos. with existing protocols and their eæciency in terms of

George Xylomenos and George C. Polyzos. with existing protocols and their eæciency in terms of IP MULTICASTING FOR WIRELESS MOBILE OSTS George Xylomenos and George C. Polyzos fxgeorge,polyzosg@cs.ucsd.edu Computer Systems Laboratory Department of Computer Science and Engineering University of California,

More information

Grid Adaptation for Functional Outputs: Application to Two-Dimensional Inviscid Flows

Grid Adaptation for Functional Outputs: Application to Two-Dimensional Inviscid Flows Journal of Computational Pysics 176, 40 69 (2002) doi:10.1006/jcp.2001.6967, available online at ttp://www.idealibrary.com on Grid Adaptation for Functional Outputs: Application to Two-Dimensional Inviscid

More information

The (, D) and (, N) problems in double-step digraphs with unilateral distance

The (, D) and (, N) problems in double-step digraphs with unilateral distance Electronic Journal of Grap Teory and Applications () (), Te (, D) and (, N) problems in double-step digraps wit unilateral distance C Dalfó, MA Fiol Departament de Matemàtica Aplicada IV Universitat Politècnica

More information

A Novel QC-LDPC Code with Flexible Construction and Low Error Floor

A Novel QC-LDPC Code with Flexible Construction and Low Error Floor A Novel QC-LDPC Code wit Flexile Construction and Low Error Floor Hanxin WANG,2, Saoping CHEN,2,CuitaoZHU,2 and Kaiyou SU Department of Electronics and Information Engineering, Sout-Central University

More information

Software Fault Prediction using Machine Learning Algorithm Pooja Garg 1 Mr. Bhushan Dua 2

Software Fault Prediction using Machine Learning Algorithm Pooja Garg 1 Mr. Bhushan Dua 2 IJSRD - International Journal for Scientific Researc & Development Vol. 3, Issue 04, 2015 ISSN (online): 2321-0613 Software Fault Prediction using Macine Learning Algoritm Pooja Garg 1 Mr. Busan Dua 2

More information

A VOLUME-BASED METHOD FOR DENOISING ON CURVED SURFACES

A VOLUME-BASED METHOD FOR DENOISING ON CURVED SURFACES A VOLUME-BASED METHOD FOR DENOISING ON CURVED SURFACES Harry Biddle Double Negative Visual Effects London, UK b@dneg.com Ingrid von Glen, Colin B. Macdonald, Tomas März Matematical Institute University

More information

An Evaluation of Domain-Specific Language Technologies for Code Generation

An Evaluation of Domain-Specific Language Technologies for Code Generation An Evaluation of Domain-Specific Language Technologies for Code Generation Christian Schmitt, Sebastian Kuckuk, Harald Köstler, Frank Hannig, Jürgen Teich Hardware/Software Co-Design, System Simulation,

More information

Brief Contributions. A Hybrid Flash File System Based on NOR and NAND Flash Memories for Embedded Devices 1 INTRODUCTION

Brief Contributions. A Hybrid Flash File System Based on NOR and NAND Flash Memories for Embedded Devices 1 INTRODUCTION 1002 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 7, JULY 2008 Brief Contributions A Hybrid Flas File System Based on NOR and NAND Flas Memories for Embedded Devices Cul Lee, Student Member, IEEE, Sung

More information

Haar Transform CS 430 Denbigh Starkey

Haar Transform CS 430 Denbigh Starkey Haar Transform CS Denbig Starkey. Background. Computing te transform. Restoring te original image from te transform 7. Producing te transform matrix 8 5. Using Haar for lossless compression 6. Using Haar

More information

Application of a Key Value Paradigm to Logic Factoring

Application of a Key Value Paradigm to Logic Factoring INVITED PAPER Application of a Key Value Paradigm to Logic Factoring Te autor first revisits a classic algoritm for algebraic factoring to establis a stronger connection to te functional intent rater tan

More information

Analytical CHEMISTRY

Analytical CHEMISTRY ISSN : 974-749 Grap kernels and applications in protein classification Jiang Qiangrong*, Xiong Zikang, Zai Can Department of Computer Science, Beijing University of Tecnology, Beijing, (CHINA) E-mail:

More information

Computing geodesic paths on manifolds

Computing geodesic paths on manifolds Proc. Natl. Acad. Sci. USA Vol. 95, pp. 8431 8435, July 1998 Applied Matematics Computing geodesic pats on manifolds R. Kimmel* and J. A. Setian Department of Matematics and Lawrence Berkeley National

More information

Comparison of the Efficiency of the Various Algorithms in Stratified Sampling when the Initial Solutions are Determined with Geometric Method

Comparison of the Efficiency of the Various Algorithms in Stratified Sampling when the Initial Solutions are Determined with Geometric Method International Journal of Statistics and Applications 0, (): -0 DOI: 0.9/j.statistics.000.0 Comparison of te Efficiency of te Various Algoritms in Stratified Sampling wen te Initial Solutions are Determined

More information

Redundancy Awareness in SQL Queries

Redundancy Awareness in SQL Queries Redundancy Awareness in QL Queries Bin ao and Antonio Badia omputer Engineering and omputer cience Department University of Louisville bin.cao,abadia @louisville.edu Abstract In tis paper, we study QL

More information

Integrating Multimedia Applications in Hard Real-Time Systems

Integrating Multimedia Applications in Hard Real-Time Systems Integrating Multimedia Applications in Hard Real-Time Systems Luca Abeni and Giorgio Buttazzo Scuola Superiore S. Anna, Pisa luca@arti.sssup.it, giorgio@sssup.it Abstract Tis paper focuses on te problem

More information

A Statistical Approach for Target Counting in Sensor-Based Surveillance Systems

A Statistical Approach for Target Counting in Sensor-Based Surveillance Systems Proceedings IEEE INFOCOM A Statistical Approac for Target Counting in Sensor-Based Surveillance Systems Dengyuan Wu, Decang Cen,aiXing, Xiuzen Ceng Department of Computer Science, Te George Wasington University,

More information

FOUR POINT HIGH ORDER COMPACT ITERATIVE SCHEMES FOR THE SOLUTION OF THE HELMHOLTZ EQUATION TENG WAI PING UNIVERSITI SAINS MALAYSIA

FOUR POINT HIGH ORDER COMPACT ITERATIVE SCHEMES FOR THE SOLUTION OF THE HELMHOLTZ EQUATION TENG WAI PING UNIVERSITI SAINS MALAYSIA FOUR POINT HIGH ORDER COMPACT ITERATIVE SCHEMES FOR THE SOLUTION OF THE HELMHOLTZ EQUATION TENG WAI PING UNIVERSITI SAINS MALAYSIA 2015 FOUR POINT HIGH ORDER COMPACT ITERATIVE SCHEMES FOR THE SOLUTION

More information

Network Coding-Aware Queue Management for Unicast Flows over Coded Wireless Networks

Network Coding-Aware Queue Management for Unicast Flows over Coded Wireless Networks Network Coding-Aware Queue Management for Unicast Flows over Coded Wireless Networks Hulya Seferoglu, Atina Markopoulou EECS Dept, University of California, Irvine {seferog, atina}@uci.edu Abstract We

More information

4.1 Tangent Lines. y 2 y 1 = y 2 y 1

4.1 Tangent Lines. y 2 y 1 = y 2 y 1 41 Tangent Lines Introduction Recall tat te slope of a line tells us ow fast te line rises or falls Given distinct points (x 1, y 1 ) and (x 2, y 2 ), te slope of te line troug tese two points is cange

More information

CE 221 Data Structures and Algorithms

CE 221 Data Structures and Algorithms CE Data Structures and Algoritms Capter 4: Trees (AVL Trees) Text: Read Weiss, 4.4 Izmir University of Economics AVL Trees An AVL (Adelson-Velskii and Landis) tree is a binary searc tree wit a balance

More information

SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

SEATTLE: A Scalable Ethernet Architecture for Large Enterprises TRC00349 ACM (Typeset by SPi, Manila, Pilippines) 1 of 35 February 23, 2011 14:31 SEATTLE: A Scalable Eternet Arcitecture for Large Enterprises 1 CHANGHOON KIM, Microsoft MATTHEW CAESAR, University of

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Nonlinear Diffusion vs. Wavelet Based Noise Reduction in CT Using Correlation Analysis

Nonlinear Diffusion vs. Wavelet Based Noise Reduction in CT Using Correlation Analysis Nonlinear Diffusion vs. Wavelet Based Noise Reduction in CT Using Correlation Analysis M. Mayer, A. Borsdorf, H. Köstler, J. Hornegger, and U. Rüde Lerstul für Mustererkennung & Lerstul für Systemsimulation

More information

Test Generation for Acyclic Sequential Circuits with Hold Registers

Test Generation for Acyclic Sequential Circuits with Hold Registers Test Generation for Acyclic Sequential Circuits wit Hold Registers Tomoo Inoue, Debes Kumar Das, Ciio Sano, Takairo Miara, and Hideo Fujiwara Faculty of Information Sciences Computer Science and Engineering

More information