FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

Similar documents
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Porting Performance across GPUs and FPGAs

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

HIERARCHICAL AND SCALABLE BUS ARCHITECTURE GENERATION ON FPGAS WITH HIGH-LEVEL SYNTHESIS YING CHEN THESIS

Lecture 21: High-level Synthesis (2)

Exploring OpenCL Memory Throughput on the Zynq

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Copyright 2014 Xilinx

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

SDSoC: Session 1

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Vivado HLx Design Entry. June 2016

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

HEAD HardwarE Accelerated Deduplication

Performance Verification for ESL Design Methodology from AADL Models

Creating a Processor System Lab

ECE 5775 (Fall 17) High-Level Digital Design Automation. Hardware-Software Co-Design

Analyzing the Generation and Optimization of an FPGA Accelerator using High Level Synthesis

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

Dr. Yassine Hariri CMC Microsystems

FPGA Entering the Era of the All Programmable SoC

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Yet Another Implementation of CoRAM Memory

Introduction to Embedded System Design using Zynq

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

81920**slide. 1Developing the Accelerator Using HLS

Design methodology for multi processor systems design on regular platforms

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

MATLAB/Simulink 기반의프로그래머블 SoC 설계및검증

Burrows-Wheeler Short Read Aligner on AWS EC2 F1 Instances

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Introduction Warp Processors Dynamic HW/SW Partitioning. Introduction Standard binary - Separating Function and Architecture

Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices

Exploring Automatically Generated Platforms in High Performance FPGAs

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Paralization on GPU using CUDA An Introduction

Simulation, prototyping and verification of standards-based wireless communications

Versal: AI Engine & Programming Environment

Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices. Dr Jose Luis Nunez-Yanez University of Bristol

CUDA Programming Model

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Yafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Towards Optimal Custom Instruction Processors

Extending Model-Based Design for HW/SW Design and Verification in MPSoCs Jim Tung MathWorks Fellow

Hardware-Software Co-Design and Prototyping on SoC FPGAs Puneet Kumar Prateek Sikka Application Engineering Team

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

Integrated Workflow to Implement Embedded Software and FPGA Designs on the Xilinx Zynq Platform Puneet Kumar Senior Team Lead - SPC

Implementation of Hardware Accelerators on Zynq

Extending the Power of FPGAs to Software Developers:

XPU A Programmable FPGA Accelerator for Diverse Workloads

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Hardware Software Co-design and SoC. Neeraj Goel IIT Delhi

Designing a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio

LegUp: Accelerating Memcached on Cloud FPGAs

Designing a Multi-Processor based system with FPGAs

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

TP : System on Chip (SoC) 1

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Addressing Heterogeneity in Manycore Applications

FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers

An OpenCL-based Framework for Rapid Virtual Prototyping of Heterogeneous Architectures

OpenMP Device Offloading to FPGA Accelerators

Hybrid Implementation of 3D Kirchhoff Migration

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

Extending the Power of FPGAs

Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration

Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs

Hardware/Software Co-design

Tutorial on Software-Hardware Codesign with CORDIC

Elaborazione dati real-time su architetture embedded many-core e FPGA

Efficient Hardware Context- Switch for Task Migration between Heterogeneous FPGAs

Requirement ZYNQ SOC Development Board: Z-Turn by MYiR ZYNQ-7020 (XC7Z020-1CLG400C) Vivado and Xilinx SDK TF Card Reader (Micro SD) Windows 7

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Multilevel Granularity Parallelism Synthesis on FPGAs

Co-synthesis and Accelerator based Embedded System Design

FPGA design with National Instuments

Embedded HW/SW Co-Development

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

GPUfs: Integrating a file system with GPUs

The Use of Cloud Computing Resources in an HPC Environment

借助 SDSoC 快速開發複雜的嵌入式應用

Ultra-Fast NoC Emulation on a Single FPGA

Transcription:

1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen, swathi.g, k.rupnow}@adsc.com.sg 2 Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA dchen@illinois.edu

2 Motivations GPUs and FPGAs: accelerators in HPC NVIDIA GPUs (Google, Baidu, Supercomputers, ) FPGAs (Microsoft Bing search, ) NVIDIA GPUs Massive parallelism thanks to the CUDA parallel programming model Power hungry FPGAs Low energy consumption, high flexibility Low programming abstraction High Level Synthesis But parallelism extraction may be limited

3 FCUDA History FCUDA Introduction CUDA-to-FPGA compiler [ASAP 09], [FCCM 11], [DAC 13] Front-end: CUDA-to-C (Cetus + MCUDA) Back-end: C-to-RTL (High Level Synthesis tool --Vivado HLS) Source-to-source transformation Does not employ NVIDIA compiler infrastructure (NVCC, NVVM IR,...) CUDA C kernel Pragma annotations CUDA C kernel (annot.) CUDA-to-C Transformation Synthesizable C kernel High Level Synthesis RTL

4 FCUDA History FCUDA Introduction CUDA-to-FPGA compiler [ASAP 09], [FCCM 11], [DAC 13] Front-end: CUDA-to-C (Cetus + MCUDA) Back-end: C-to-RTL (High Level Synthesis tool --Vivado HLS) Source-to-source transformation Does not employ NVIDIA compiler infrastructure (NVCC, NVVM IR,...) Why FCUDA? Enable CUDA kernel's execution on FPGA CUDA s SIMD programming model is attractive to FPGA

5 FCUDA System Backend Prior FCUDA works focused on building a prototype Single top-level function with centralized control of cores No discussion on a full system integration of cores vs. external memory System implementation works: FCUDA NoC: Mesh-based NoC [TVLSI 15] FCUDA HB: Hierarchical AXI Bus (under review) FCUDA SoC (this work)

6 Why SoC FPGA? ARM core(s) tightly coupled with FPGA fabric via AXI interconnect ARM cores for control, FPGA for accelerator Communication supported by Xilinx software (drivers) a suitable candidate for full system implementation!

7 FCUDA Open-source Available to download at: http://dchen.ece.illinois.edu/tools.html The open-source software includes: FCUDA CUDA-to-C compiler FCUDA Benchmarks: set of benchmarks for testing FCUDA FCUDA SoC tool flow: scripts for the automation of FCUDA system on SoC platform Future plan: FCUDA NoC, FCUDA HB, Design Space Exploration framework

FCUDA SoC An overview flow 8

FCUDA SoC Preproc. flow 9

FCUDA SoC HW flow 10

FCUDA SoC SW flow 11

12 FCUDA SoC Preproc. flow 1. Separation of CUDA kernels from Host code 2. Pragma annotations FCUDA pragmas Vivado HLS interface pragmas 3. CUDA-to-C Transformation 4. Generation of wrappers

13 FCUDA SoC Pragma annotations CUDA code Gen. C code Pragmas #pragma FCUDA TRANSFER #pragma FCUDA COMPUTE #pragma HLS INTERFACE ap_bus port=x #pragma HLS RESOURCE core=axi4m variable=x #pragma HLS INTERFACE ap_none register port=x #pragma HLS RESOURCE core=axi4lites variable=x #pragma HLS RESOURCE core=axi4lites variable=return Purpose Translate assignment statement to memcpy() Insert thread loops at synchronization points Apply ap_bus protocol to pointer X Create AXI4M bus wrapper for pointer X Apply ap_register protocol to scalar x Create AXI4LiteS bus wrapper for scalar x Generate driver files for the HLS IP

14 FCUDA SoC An Example C = A * x + B * y

15 FCUDA SoC CUDA-to-C transformation Workload distribution

16 FCUDA SoC An Example Communication interface pointers: ap_bus read/write data to/from memory scalars: register static during core s execution

FCUDA SoC Single core design 17

18 FCUDA SoC Single core design HLS IP

19 FCUDA SoC Single core design Zynq Processing System

20 FCUDA SoC Single core design AXI Interconnects IPs (data + control)

21 FCUDA SoC Multi-core design Fixed core design (original FCUDA design) Even workloads Odd workloads

22 FCUDA SoC Multi-core design Fixed core design (original FCUDA design) Each core is a different IP! different drivers Even threadblocks Odd threadblocks

23 FCUDA SoC Multi-core design Flexible core design Core is parameterized

24 FCUDA SoC Multi-core design Instantiate cores into the design with different core_id and same num_cores Issues: Cores will have separate control! sequential control of cores The number of AXI Interconnect increases Connection of AXI4M (data) Connection of AXI4LiteS (control) Solution: grouping all the cores into one single IP All cores share the same control signal num_cores=2, core_id = 0 num_cores=2, core_id = 1

25 FCUDA SoC Multi kernels Applications which have multiple dependent kernels KERNEL 1 Read Write Read Memory KERNEL 2 Write

26 FCUDA SoC Multi kernels Note: we do not use dynamic partial reconfiguration Problem: AXI4LiteS # AXI ports ~ resource consumption Kernel 2 s ports are unused when Kernel 1 is active (and vice versa) AXI4M AXI4LiteS AXI4M KERNEL 1 Wrapper KERNEL 1 Wrapper KERNEL 2 Wrapper KERNEL 2 Wrapper

27 FCUDA SoC Multi kernels Solution: merging 2 Kernels to share ports AXI4LiteS AXI4M KERNEL 1 Wrapper KERNEL 2 Wrapper

28 FCUDA SoC HW flow 5. Generation of HLS IP Find analytical max. number of cores based on HLS report of 1 core and kernels workloads 6. System Integration Zynq PS, AXI Interconnect, etc. 7. Perform Synthesis and Place & Route Reiterate until achieving max. frequency 8. Generate bitstream and configure FPGA

FCUDA SoC System integration 29

30 FCUDA SoC SW flow 9. Edit host code Remove CUDA's syntax Add driver control code 10. Compile and run on FPGA Bare-metal vs. OS

31 Experiments Two FPGA SoC devices: Z-7020 FF: 106400, LUT: 53200, BRAM: 140, DSP: 220 ARM Freq: 666 MHz Z-7045 FF: 218600, LUT: 437200, BRAM: 545, DSP: 900 ARM Freq: 800 MHz 16 CUDA benchmarks collected from NVIDIA SDK, Parboil suite, Rodinia suite Measure total execution time Compare against the CPU implementation (For GPU s comparison, please refer to our paper)

Speedup over CPU implementation s execution time 32 Experiments Performance Comparison of ARM + FPGA vs. ARM only on Z-7020 Numbers above the columns show the total instantiated cores in the design

Speedup over CPU implementation s execution time 33 Experiments Better than CPU

Speedup over CPU implementation s execution time 34 Experiments Worse than CPU + no shared memory + random access + limited board resource

Speedup over CPU implementation s execution time 35 Experiments Performance Comparison of ARM + FPGA vs. ARM only on Z-7020 (red) and Z-7045 (blue)

Speedup over CPU implementation s execution time 36 Experiments Performance Comparison of ARM + FPGA vs. ARM only on Z-7020 (red) and Z-7045 (blue)

Speedup over CPU implementation s execution time 37 Experiments Performance Comparison of ARM + FPGA vs. ARM only on Z-7020 (red) and Z-7045 (blue)

38 FCUDA SoC Best practices CPU control: Having a single IP --- keep things simple reduce management overhead Interconnect network consideration Partition signals between interfaces: data ports High Performance ports, control ports General Purpose ports Merge data ports of a core, shared AXI ports between kernels Good CUDA programming practices Coalescing memory access, local data reused, etc. Design Space Exploration should be carried out [FCCM 11] Find the best number of cores

39 Conclusion A design study of how to build a complete hardware/software system of a CUDA-to-FPGA flow targeting SoC FPGA Verified with 16 CUDA benchmarks Discussed key design decisions, best practices Our tool flow is open-source and can be found at: http://dchen.ece.illinois.edu/tools.html