OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

Size: px

Start display at page:

Download "OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch"

Octavia Richardson
5 years ago
Views:

1 OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch

2 Motivation Increasing use of heterogeneous systems to overcome CPU power limitations OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 2

3 Motivation Increasing use of heterogeneous systems to overcome CPU power limitations FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 32

4 Motivation Increasing use of heterogeneous systems to overcome CPU power limitations FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) Programming of heterogeneous systems is non-trivial OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 42

5 Motivation Increasing use of heterogeneous systems to overcome CPU power limitations FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) Programming of heterogeneous systems is non-trivial Desirable: Programming with a single, portable code base OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 52

6 OpenMP Device Offloading #pragma omp target \ map(to:x[0:size]) \ map(tofrom:y[0:size]) { #pragma omp parallel for[...] for(i=0; i<size; i++){ Target Region y[i] = a*x[i]+y[i]; } } Denote target regions to execute on device OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 63

7 OpenMP Device Offloading #pragma omp target \ map(to:x[0:size]) \ map(tofrom:y[0:size]) { #pragma omp parallel for[...] for(i=0; i<size; i++){ Target Region y[i] = a*x[i]+y[i]; } } Denote target regions to execute on device Specify which and how data is transferred to device memory OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 73

8 OpenMP Device Offloading #pragma omp target \ map(to:x[0:size]) \ map(tofrom:y[0:size]) { #pragma omp parallel for[...] for(i=0; i<size; i++){ y[i] = a*x[i]+y[i]; } } Denote target regions to execute on device Specify which and how data is transferred to device memory Use additional parallel constructs inside target region (also targetspecific, e.g. teams, distribute,...) OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 83

9 Goal Implement mapping of target regions to FPGA accelerators in LLVM Clang Preserve FPGA-specific pragmas (e.g. Vivado HLS) Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 94

10 Goal Implement mapping of target regions to FPGA accelerators in LLVM Clang Preserve FPGA-specific pragmas (e.g. Vivado HLS) Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable Extend LLVM OpenMP Runtime Manage data-transfers between host and FPGA Control device execution on the FPGA accelerator OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 104

11 ThreadPoolComposer Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems Synthesize accelerator from kernel code TPC is available as open source: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 115

12 ThreadPoolComposer Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems Assemble (multiple) instances of different kernels in top-level design, combined with standardized host- and memory connection TPC is available as open source: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 125

13 ThreadPoolComposer Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems Control execution and datatransfer using two-layered API Higher-level TPC API is device/platform-agnostic, allows for portable implementation (write once, run everywhere) TPC is available as open source: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 135

14 Compilation Flow Start from a single, portable source file OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 146

15 Compilation Flow Start from a single, portable source file Standard hostcompilation, including fallback if offloading fails OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 156

16 Compilation Flow Start from a single, portable source file Standard hostcompilation, including fallback if offloading fails One device-specific compilation flow per device type Limited to extracted target regions OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 166

17 Compilation Flow Custom Clang toolchain for TPC-based offloading to FPGA accelerators Identified with new LLVM target triple Preserves FPGAspecific pragmas, e.g. Vivado HLS pragmas Yields three artifacts OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 176

18 Compilation Flow TPC-specific software binary Entry point for FPGA device execution Transfers kernel arguments and launches hardware execution using TPC API OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 186

19 Compilation Flow TPC-specific software binary Entry point for FPGA device execution Transfers kernel arguments and launches hardware execution using TPC API Included in the combined binary OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 196

20 Compilation Flow Hardware kernel code extracted from target region OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 206

21 Compilation Flow Hardware kernel code extracted from target region Description of input argument types OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 216

22 Compilation Flow Hardware kernel code extracted from target region Description of input argument types TPC automates synthesis from kernel code and description to full FPGA design No additional user input required! OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 226

23 Compilation Flow Hardware kernel code extracted from target region Description of input argument types TPC automates synthesis from kernel code and description to full FPGA design Resulting bitstream features standardized host- and memory connection OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 236

24 Host Runtime Flow Components: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 247

25 Host Runtime Flow Components: LLVM OpenMP Runtime Infrastructure OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 257

26 Runtime Flow Host Components: LLVM OpenMP Runtime Infrastructure TPC-based plugin for LLVM OpenMP Runtime OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 267

27 Runtime Flow Host Components: LLVM OpenMP Runtime Infrastructure TPC-based plugin for LLVM OpenMP Runtime TPC-specific software binary resulting from compilation OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 277

28 Runtime Flow Host Components: LLVM OpenMP Runtime Infrastructure TPC-based plugin for LLVM OpenMP Runtime TPC-specific software binary resulting from compilation FPGA abstraction as provided by TPC OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 287

29 Host Runtime Flow Host-centric: Execution starts on the host If target region is encountered: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 297

30 Runtime Flow Host Host-centric: Execution starts on the host If target region is encountered: Transfer data to FPGA memory OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 307

31 Runtime Flow Host Host-centric: Execution starts on the host If target region is encountered: Transfer data to FPGA memory Invoke binary Sets kernel arguments Launches hardware execution OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 317

32 Runtime Flow Host Host-centric: Execution starts on the host If target region is encountered: Transfer data to FPGA memory Invoke binary Sets kernel arguments Launches hardware execution Transfer data back to host OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 327

33 Evaluation Proof-of-concept implementation based on development version of LLVM/Clang/LLVM OpenMP Runtime Evaluation using integer BLAS kernels from Adept benchmark suite AXPY Vector scaling Vector dot product Dense matrix vector multiplication Euclidean norm for vector OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 338

34 Evaluation Using Xilinx Vivado HLS Kernels annotated with Vivado HLS pipeline pragma 250 MHz kernel operation frequency #pragma omp target \ map(to:x[0:size]) \ map(tofrom:y[0:size]) { #pragma omp parallel for[...] for(i=0; i<size; i++){ #pragma HLS PIPELINE II=1 y[i] = a*x[i]+y[i]; } } OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 349

35 Evaluation FPGA-board: VC709 (Virtex 7), 4 GiB on-board memory Single instance of each kernel ( single-core ) Comparison against x86-cpu (i7-6700k, 16 GiB DDR4- RAM) running 4 OpenMP threads OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 35 10

36 Insights Fully functional implementation of OpenMP offloading to FPGAs Custom compilation flow mapping target region to hardware kernel without additional user input Extension of runtime library based on ThreadPoolComposer API OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 36 11

37 Insights Fully functional implementation of OpenMP offloading to FPGAs Custom compilation flow mapping target region to hardware kernel without additional user input Extension of runtime library based on ThreadPoolComposer API Program a FPGA-based heterogeneous system with a single, portable code-base OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 37 11

38 Insights Fully functional implementation of OpenMP offloading to FPGAs Custom compilation flow mapping target region to hardware kernel without additional user input Extension of runtime library based on ThreadPoolComposer API Program a FPGA-based heterogeneous system with a single, portable code-base Offloading overhead primarily dependent on size of data transfered to/from device memory OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 38 11

39 Insights Fully functional implementation of OpenMP offloading to FPGAs Custom compilation flow mapping target region to hardware kernel without additional user input Extension of runtime library based on ThreadPoolComposer API Program a FPGA-based heterogeneous system with a single, portable code-base Offloading overhead primarily dependent on size of data transfered to/from device memory Pipelining results in 2x speedup over non-pipelined kernel OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 39 11

40 Insights Single PE execution slower than quad-core X86 (6.7x/3.4x) OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 40 12

41 Insights Single PE execution slower than quad-core X86 (6.7x/3.4x) Kernels very area-efficient (1-5% logic utilization) OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 412

42 Insights Single PE execution slower than quad-core X86 (6.7x/3.4x) Kernels very area-efficient (1-5% logic utilization) Future work: Make use of coarse-grain parallelism (e.g., teams distribute) Distribute computation across multiple PEs OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 42 12

43 Stop by at our poster Questions? Get in touch: Download open-source release of ThreadPoolComposer: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 43 13

OpenMP Device Offloading to FPGA Accelerators

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer Embedded Systems and Applications Group sommer@esa.tu-darmstadt.de Jens Korinth Computer Systems Group korinth@rs.tu-darmstadt.de Andreas Koch