Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Similar documents
Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Preprint Version. Reconfigurable Hardware Generation of Multigrid Solvers with Conjugate Gradient Coarse-Grid Solution

Software design for highly scalable numerical algorithms

Challenges in Fully Generating Multigrid Solvers for the Simulation of non-newtonian Fluids

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Large scale Imaging on Current Many- Core Platforms

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

An Evaluation of Domain-Specific Language Technologies for Code Generation

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Towards Generating Solvers for the Simulation of non-newtonian Fluids. Harald Köstler, Sebastian Kuckuk FAU Erlangen-Nürnberg

How to Optimize Geometric Multigrid Methods on GPUs

A Multi-layered Domain-specific Language for Stencil Computations

Introduction to Multigrid and its Parallelization

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

EXPERIMENTS ON OPTIMIZING THE PERFORMANCE OF STENCIL CODES WITH SPL CONQUEROR

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling

Smoothers. < interactive example > Partial Differential Equations Numerical Methods for PDEs Sparse Linear Systems

Massively Parallel Phase Field Simulations using HPC Framework walberla

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Accelerating image registration on GPUs

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Ted N. Booth. DesignLinx Hardware Solutions

Parallel graph traversal for FPGA

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Advanced FPGA Design Methodologies with Xilinx Vivado

CMPE 415 Programmable Logic Devices Introduction

Automatic Optimization of Hardware Accelerators for Image Processing

simulation framework for piecewise regular grids

Numerical Algorithms on Multi-GPU Architectures

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Exploring Automatically Generated Platforms in High Performance FPGAs

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Systems of Partial Differential Equations in ExaSlang

FPGA. Agenda 11/05/2016. Scheduling tasks on Reconfigurable FPGA architectures. Definition. Overview. Characteristics of the CLB.

Modeling Multigrid Algorithms for Variational Imaging

Hardware/Software Codesign of Schedulers for Real Time Systems

smooth coefficients H. Köstler, U. Rüde

Programmable Logic Devices HDL-Based Design Flows CMPE 415

A Highly Efficient and Comprehensive Image Processing Library for C ++ -based High-Level Synthesis

Lehrstuhl für Informatik 10 (Systemsimulation)

EITF35: Introduction to Structured VLSI Design

An Overlay Architecture for FPGA-based Industrial Control Systems Designed with Functional Block Diagrams

FPGAs: High Assurance through Model Based Design

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

II. LITERATURE SURVEY

RTL Coding General Concepts

Simplify System Complexity

Simplify System Complexity

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Modeling a 4G LTE System in MATLAB

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool

FPGA design with National Instuments

Exploring OpenCL Memory Throughput on the Zynq

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Vivado HLx Design Entry. June 2016

SDSoC: Session 1

FPGA Based Digital Design Using Verilog HDL

Precise Continuous Non-Intrusive Measurement-Based Execution Time Estimation. Boris Dreyer, Christian Hochberger, Simon Wegener, Alexander Weiss

S-COR-10 IMAGE STABILITHATION IP CORE Programmer manual

Midterm Exam. Solutions

Reconfigurable Computing. Design and Implementation. Chapter 4.1

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware

OUTLINE RTL DESIGN WITH ARX

INTRODUCTION TO FPGA ARCHITECTURE

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University

RFNoC Neural-Network Library using Vivado HLS (rfnoc-hls-neuralnet) EJ Kreinar Team E to the J Omega

Yet Another Implementation of CoRAM Memory

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Optimize DSP Designs and Code using Fixed-Point Designer

An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs

Energy scalability and the RESUME scalable video codec

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

FPGA: What? Why? Marco D. Santambrogio

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Hardware/Software Codesign

International Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation

PINE TRAINING ACADEMY

Developing Dynamic Profiling and Debugging Support in OpenCL for FPGAs

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS

Introduction to Field Programmable Gate Arrays

Table 1: Example Implementation Statistics for Xilinx FPGAs

Performance Verification for ESL Design Methodology from AADL Models

Transcription:

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System Simulation, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) HiStencils, Amsterdam, The Netherlands; January 20, 2015

Motivation Multigrid methods are widely used Solution of discretized PDEs Preconditioners for other iterative solvers and highly scalable O(N) operations but also on a device scale: from embedded hardware to supercomputers! But: Most efficient implementation varies greatly on numerical problem and target architecture! Our research Generation of multigrid-based solvers and automatic application of domain- and target-specific optimizations Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 1

Motivation Code generator works for supercomputers, e. g., on JUQUEEN (TOP500 #8) 1 : 10 total runtime [s] 8 6 4 2 0 512 1k 2k 4k 8k 16k 32k 64k 128k 256k number of cores but can it also work on the other end of the scale, i. e., for energy-efficient embedded devices such as FPGAs? 1 Christian Schmitt et al. ExaSlang: A Domain-Specific Language for Highly Scalable Multigrid Solvers. In: Proceedings of the 4th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC). (New Orleans, LA, USA). IEEE Computer Society, Nov. 17, 2014, pp. 42 51. ISBN: 978-1-4799-7020-9. DOI: 10.1109/WOLFHPC.2014.11 Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 2

Basic Multigrid Ideas Multigrid method 4 1. Pre-smoothing 3 2. Calculation of residual 3. Restriction 4. Recursive call(s) or solve (at coarsest level) 5. Prolongation 6. Correction 7. Post-smoothing level 4 3 2 level 2 1 0 time 1 0 time Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 3

Basic Multigrid Ideas Residual on fine grid Smoother applied Residual on coarse grid Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 4

FPGA Basics Field Programmable Gate Arrays Array of lookup tables and registers configurable logic blocks (CLBs) Switch matrices to connect CLBs Trade-off between performance of hardware (ASIC) and flexibility of software Programming via Hardware Description Language, e. g., VHDL, Verilog Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 5

FPGA Basics Spread across chip: Hard-IP cores Block RAM Distributed memory ~1000, 1-2 kb each Very high memory bandwidth DSP Blocks Dedicated multiplier/adder units Typically 16 16 bit Clock: ~500 MHz High-speed serial I/O PCIe hard-ip for communication with host PC Off-chip DDR3 Soft IP support for various protocols (Infiniband, Ethernet,... ) Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 6

FPGA Basics Basic principle: Spatial Computing, e. g., stream processing Temporal Computing: Sequential execution time Spatial Computing: Parallel execution time Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 7

FPGA Basics Traditional Workflow Hand-coding HDL Register Transfer Level (RTL) Post Place & Route (PPnR) Upload configuration file to FPGA s t r u c t Smoother_1Kernel { double operator ( ) ( double rhsdata_1 [3][3], double solutiondata_1 [ 3 ] [ 3 ] ) { double temp1 = 4.0 f * solutiondata_1 [1][1] solutiondata_1 [2][1] solutiondata_1 [0][1] solutiondata_1 [1][2] solutiondata_1 [ 1 ] [ 0 ] ; double temp2 = rhsdata_1 [ 1 ] [ 1 ] temp1 ; double temp3 = solutiondata_1 [ 1 ] [ 1 ] + temp2 / 2 ; r e t u r n temp3 ; } } ; void Smoother_1 ( hls : : stream<double>& rhs_in, hls : : stream<double >& data, hls : : stream<double>& sol, hls : : stream< double >& rhs_out ) { s t r u c t Smoother_1Kernel Smoother_1_inst ; processmimo<16384, 32, 32, 3> ( rhs_in, data, sol, rhs_out, 32, 32, Smoother_1_inst, BorderPadding : :BORDER_CLAMP) ; } High-level Synthesis (HLS) Behavioral description (algorithm, math. model), often in a subset of C/C++ Conversion to structured description Register-transfer level (RTL) Connected blocks Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 8

ExaSlang Multi-layered DSL Description of multigrid-based numerical solvers Layer 4 aimed at computer scientists Explicitly parallel by providing simple communication statements Definition of arrays, stencils, loops Explicit addressing of different multigrid levels Level Specifications Referencing of multigrid levels Absolute: @0, @coarsest Relative: @current, @coarser, @finer Used to implement multigrid recursion. Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 9

ExaSlang Fields and Layouts Represent multi-dimensional arrays Size(s) determined automatically Layout determines datatype, communication Multiple fields can have the same layout Layout NoComm <Real> @all { ghostlayers = [ 0, 0] duplicatelayers = [ 1, 1 ] } Field Solution <global, NoComm, 0.0> @all Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 10

ExaSlang Stencils Stencil Laplace @all { [ 0, 0] => 4.0, [ 1, 0] => -1.0, [-1, 0] => -1.0 [ 0, 1] => -1.0, [ 0, -1] => -1.0 } Loops loop over Solution @current { Solution2 @current = Solution @current + ((( 1.0 / diag(laplace @current)) * 0.8) * (RHS @current - Laplace @current * Solution @current)) } Bounds of loop determined by field One loop can be mapped to one kernel function Arguments for function via dependency analysis Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 11

Mapping to FPGAs Resolve stencil applications per multigrid levels Map loop over statements to separate IP core Dependency analysis: Add fields to IP core inputs/outputs Calculate field (stream) sizes for IP core Replace loop over statements with process statements Connect IP cores with streams and insert copy/split kernels to duplicate streams Add iteration intervals from simulation Resource sharing optimizations Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 12

Mapping to FPGAs Kernels connected via streams: pre8 data_in rhs_in smooth8_1 data_out sol8_1 rhs8 post8 smooth8_2 sol8_2 rhs8 residual8 res8 cpy8 res8_1 rhs8 res8_2 smooth8_4 rhs8 sm8 smooth8_3 corr8 correction up8 downsample8 upsample8 FIFO buffers needed between cores for different stream sizes, i. e., downsampling and upsampling Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 13

Mapping to FPGAs Stages can be reused due to FIFO buffering: sol8 rhs8 data_in data_out II=1 corr8 residual8 smooth8 smooth8 res8 correct8 down8 restrict8 B prolong8 up8 sol7 rhs7 smooth7 B smooth7 II=4 B smooth7 up8 sm7 sol7 smooth7 rhs7 ~ ~ restrict2 II=4096 B ~ prolong2 ~ sol7 rhs7 B II=16384 B smooth1 8 STAGE MULTIGRID SOLVER B Buffer Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 14

Results Setup V(2,2) solver for Poisson s equation 8 multigrid levels fixed Jacobi smoothers Jacobi applied multiple times for coarse grid solving Input grid size of 4096 4096 High-level synthesis Xilinx Vivado HLS v14.2 Small support library to help with IP core instantiation 2 Buffer sizes calculated using external simulation tool 2 Moritz Schmid et al. An Image Processing Library for C-based High-Level Synthesis. In: Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL). (Munich, Germany). Sept. 2 4, 2014 Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 15

Results Resource usage on FPGAs for double precision: FPGA LUTs FFs DSPs BRAMs F max [MHz] Kintex-7 3 140% 43% 111% 124% 232.0 Virtex-7 4 73% 29% 33% 53% 229.4 Sharing of stages due to FIFO buffers Increase coarser stages iteration intervals (II) for resource sharing Double precision not possible for Kintex-7 More stages could be added for Virtex-7 3 Estimation by Vivado HLS. Place & Route not possible due resource constraints. 4 PPnR Result Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 16

Results Performance figures for a single V-cycle Target Runtime [ms] Throughput [Vps 5 ] FPGA 6 83.1 12.3 Intel i7 7 223.1 4.5 5 V-cycles per second 6 Performance is the same for Kintex-7 (XC7VX485T) and Virtex-7 (XC7K325T). Single precision on Kintex-7, double precision on Virtex-7. 7 Intel i7-3770, 3.40 GHz, single thread. Besides AVX, no optimization where applied by our code generator. Double precision. Code is memory-bandwidth bound. Compiled with -O3. Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 17

Summary Conclusions Code generation for FPGAs based on HDL ExaStencils code generator flexible enough to emit code for a fundamentally different computing model Performance of mid-range FPGAs already promising Future work Research (algorithmical) optimization potential Smarter grid traversal for 3D Automatic calculation of buffer sizes Partitioning among multiple FPGA boards Automatic application of HLS and hardware synthesizing Automatic re-configuration at runtime if convergence prediction insufficient Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 18

Thanks for listening. Questions? E astencils ExaStencils Advanced Stencil Code Engineering http://www.exastencils.org ExaStencils is funded by the German Research Foundation (DFG) as part of the Priority Program 1648 (Software for Exascale Computing). Christian Schmitt FAU Generation of Multigrid-based Numerical Solvers for FPGA Accelerators HiStencils15 19