Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Similar documents
SDSoC: Session 1

Optimization of Behavioral IPs in Multi-Processor System-on- Chips

Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures

Higher Level Programming Abstractions for FPGAs using OpenCL

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Altera SDK for OpenCL

Design Space Explorer for FPGAs

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Intel HLS Compiler: Fast Design, Coding, and Hardware

FPGA Acceleration of 3D Component Matching using OpenCL

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

OS impact on performance

AN 831: Intel FPGA SDK for OpenCL

Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Exploring OpenCL Memory Throughput on the Zynq

Performance of computer systems

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Co-synthesis and Accelerator based Embedded System Design

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

Parallelism. CS6787 Lecture 8 Fall 2017

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Basics of Performance Engineering

Parallel Computing Concepts. CSInParallel Project

Intel FPGA SDK for OpenCL

Heuristic Optimisation Methods for System Partitioning in HW/SW Co-Design

Microprocessor Trends and Implications for the Future

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Final Lecture. A few minutes to wrap up and add some perspective

Cover TBD. intel Quartus prime Design software

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Design Space Exploration Using Parameterized Cores

Moore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016

FYS Data acquisition & control. Introduction. Spring 2018 Lecture #1. Reading: RWI (Real World Instrumentation) Chapter 1.

CS 6534: Tech Trends / Intro

Towards Optimal Custom Instruction Processors

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Fundamentals of Quantitative Design and Analysis

Exploring Automatically Generated Platforms in High Performance FPGAs

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Vivado HLx Design Entry. June 2016

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Cover TBD. intel Quartus prime Design software

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Welcome. Altera Technology Roadshow 2013

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

Chapter 18 - Multicore Computers

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COE 561 Digital System Design & Synthesis Introduction

Exploring GPU Architecture for N2P Image Processing Algorithms

Visualization of OpenCL Application Execution on CPU-GPU Systems

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Lecture #10 Context Switching & Performance Optimization

Carlo Cavazzoni, HPC department, CINECA

Lecture 1: Gentle Introduction to GPUs

Accelerating Data Center Workloads with FPGAs

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Graphics Hardware 2008

Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Heterogeneous Computing with a Fused CPU+GPU Device

Introduction to GPGPU and GPU-architectures

Low-power Architecture. By: Jonathan Herbst Scott Duntley

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Lecture 2: Performance

Auto-tunable GPU BLAS

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

GPU Fundamentals Jeff Larkin November 14, 2016

Dr. Yassine Hariri CMC Microsystems

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

! Readings! ! Room-level, on-chip! vs.!

Introduction to cache memories

Memory Hierarchy Y. K. Malaiya

Why Multiprocessors?

Performance, Power, Die Yield. CS301 Prof Szajda

Characterization of OpenCL on a Scalable FPGA Architecture

Evolutionary Algorithm for Embedded System Topology Optimization. Supervisor: Prof. Dr. Martin Radetzki Author: Haowei Wang

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Application parallelization for multi-core Android devices

High Performance Computing with

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen

Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks

Advanced ALTERA FPGA Design

Transcription:

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17

Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA 3.High Level Synthesis (HLS) & OpenCL 4.Hardware Acceleration on FPGA 5.Design Space Exploration (DSE) 6.Conclusions & Future work

Motivation Quest for Performance & Power efficient hardware platforms to meet processing needs Trends in Hardware Moore s Law led to increase in Transistor density, Frequency and Power. By Breakdown of Dennard s scaling, it is no more feasible to increase frequency due to power constraints What Next? SoC FPGA Platforms Performance improved by offloading a CPU from computationally intensive parts to an FPGA. Acquisition of Altera by Intel is also mainly motivated by this. Intel has recently announced that it is targeting the production of around 30% of the servers with FPGAs in data-centers by 2020

Objective 1. Hardware Acceleration using OpenCL Accelerate Computationally Intensive Applications on SoC-FPGA Study effects on acceleration with different attributes. Perform DSE using HLS. 2. Automated & Faster Design Space Exploration (DSE) method DSE is exploiting the different micro-architectures based on parameters of interest. DSE is multi objective optimization Problem. The search space is extremely large and Time constrained. Genetic Algorithm based meta heuristic to automate DSE is implemented. Design Space Exploration

Configurable SoC -FPGA SoC-FPGA devices integrate both processor and FPGA architectures into a single chip. higher bandwidth for communication between Processor & FPGA Performance & Power efficiency SoC- FPGA Off-chip Integration On-chip Integration

High Level Synthesis (HLS) What is HLS? Automatic conversion of behavioral, untimed descriptions into hardware that implements the behavior. Main Steps 1.Allocating 2.Scheduling 3.Binding Why HLS? Raises abstraction level of Languages for design. Less Coding, less verification, less bugs Design Productivity Meet Time to Market Design Space Exploration ASIC/FPGA Design Flow

OpenCL Open Computing Language OpenCL (Open Computing Language) is a standard framework which allows parallel programming of heterogeneous systems. Programming the devices on the heterogeneous platform Application Programming Interface (API) to control the communication between the compute devices. OpenCL Platform Model Hardware abstraction Layers Host OpenCL Device Compute Units Processing Elements Kernel Program Host Program CPU OpenCL Execution Units 1. Host Program Manages Workload division and Communication among the Compute Units. GPU,CPU,DSP,FPGA 2. Kernel program Execute the computational part of the OpenCL application.

OpenCL Execution Model Workload Division : Work-groups & Work-items The entire work space to be executed is decomposed into Work-groups and Work-items 1. Work-item : Each independent element of execution in the entire workload is called a work-item. 2. Work-Group: A set of work-items are grouped to into a Work-Group Each Work-group is assigned to a Compute units. Work-items in a work group are executed by the same Compute unit.

OpenCL Memory Hierarchy Memory hierarchy is structured to support data sharing,communication and synchronization of the work items. Memory Type Global Memory Local Memory Private Memory Scope Host Compute Units Processing Elements Compute Units Processing Elements Processing Elements Accessibility All work groups All workitems All Work-items in a workgroup Not shared to other workgroups Exclusive to Work-item Not shared to other work-item Capacity decreases Latency decreases

Hardware Acceleration on FPGA using OpenCL

System Description 1. System Hardware Cyclone V Terasic DE1-SoC Board Altera Cyclone V FPGA 2. Software Tools Intel FPGA SDK for OpenCL Altera Quartus II Intel SoC-FPGA Embedded Development Suite (SoC EDS) cyclone V Kernel Host

3. System Memory Model Communication Host & Kernel Read buffers Write Buffers Global Memory : Off-Chip DDR, High Capacity and High latency, Host to Kernel Interface Local Memory: On-Chip memory,higher Bandwidth & lower latency than Global Memory. Private memory: Registers & Block RAM on FPGA, Lowest latency at cost of area utilization.

4.Programming the System Kernel and Host Programs Computationally intensive processing is off-loaded from ARM processor to FPGA

Optimization of the Design for Acceleration T h Time spent for part of the main code running on Host ARM T f Time spent on the Unaccelerated Function ARM + FPGA Part of the code that can be accelerated T c T af Time spent for communication between Host and Accelerator Time spent on the accelerated Function A S System Acceleration = Ts/ Tas Timing Metrics for Acceleration 1. Reduce T af : Increase Number of Parallel Operations Data level Parallelism : Duplicating Processing Units and launch in parallel Instruction level Parallelism: Pipelining Instructions and loop Iterations Task Level Parallelism : Independent tasks run parallel on Individual Kernels 2. Reduce T c : Reduce Communication Time Caching frequently used data to Local memory from Global Memory. Coalesce Memory access patterns when possible. Using Channels and pipes to communicate between kernels instead of Global memory

Kernel Architectures: Single Task Kernel & ND Range Kernel 1. Single Task Kernel The entire kernel is executed as single thread on single Compute unit Loop Pipelining : Loop iterations are Pipelined Data dependencies are handled using logic cells and registers on FPGA Used when there is dependency between work-items, No parallelism possible. Ex. Decimation, FIR etc. for ( i=1; i<n; i++) C[ i ]=C[ i-1 ] + b[ i ]; Initiation Interval ii =1 C [ 0 ] Exec Time = ((Num_Iterations * ii) + Loop_latency )* Time_period ii = Initiation Interval Dependencies are handled through Data-feedbacks

2. ND Range Kernel Throughput achieved by Data level Parallelism Work-items are executed in Parallel, by replicating Hardware units on the Kernel (ie multiple Compute Units (CUs), Single Instruction Multiple Data units (SIMDs)) Increases Memory Bandwidth and Logic Utilization Used when there are few or no data, memory dependencies between the Work-items. Ex. AES, Matrix multiplication etc CUs vs SIMDs CUs work on different Work-Groups Replicates Data Paths and Control path. Memory accesses patterns are scattered. SIMDs work on different Work-items of same Work-Groups Replicates Data Paths only. Control path is shared Memory accesses can be Coalesced. Cannot be used when Work-items have different Control Paths

Optimization Attributes & Pragmas 1. num_compute_units(n) 2. num_simd_work(n) 3. #pragma unroll < N > 4. max work group size(n) 5. reqd work group size(x; y; z) OpenCL Benchmarks 6 OpenCL applications Kernel Type chosen based on application Experimented on both Unaccelerated System( ARM ) and Accelerated System (ARM+FPGA) to compare performance BENCHMARK Kernel Type Pipeline Initiation Interval Logic (%) Sobel Single Task 2 20 FIR Single Task 1 20 ADPCM Single Task 40 20 Decimation Single Task 1 81 Interpolation Single Task 1 28 AES NDRange CU=2,SIMD=2 Not Pipelined 84

Acceleration Acceleration Acceleration Acceleration Acceleration Acceleration Results: Acceleration Plots of Acceleration vs data size FIR 1.2 1 0.8 0.6 0.95 0.4 0.2 0 0 5000 10000 15000 20000 Data Size Sobel 7.4 7.3 7.35 7.2 7.1 7 6.9 6.8 0 1000000 2000000 3000000 4000000 Image size (Num of Bytes) 3 2.5 2 1.5 1 0.5 0 Acceleration tends to saturate at high data sizes Decimation 2.8 0 5000 10000 15000 20000 25000 Data Size 3 2.5 2 1.5 1 Interpolation 2.8 3 2.5 2 1.5 1 ADPCM 2.8 8 7 6 5 4 3 AES 7 0.5 0 0 5000 10000 15000 20000 25000 Data Size 0.5 0 0 10000 20000 30000 40000 Data Size 2 1 0 0 500 1000 1500 Data Size

Results: Communication Overhead Communication Overhead: Most of the time on the accelerated system is spent for data Communication between the host and Kernel Initially, the acceleration tends to increase with data size due to growing computation complexity. The acceleration ceases beyond a point because of no immediate data is available for processing due to communication overhead and limited Communication Data buffer size between Host and Kernel.

Initiation Interval(II) Initiation Intervel (II) Exec Time (ms) Exec Time (ms) Observations: Acceleration effects due to attributes 2. Plots of Execution Time vs Loop Unroll Factor 18 16 14 12 10 8 6 4 2 0 SOBEL decreases 0 1 2 3 4 5 Loop unroll Factor 320 300 280 260 240 220 200 ADPCM increases 0 2 4 6 8 10 Loop Unroll Factor 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 1 2 3 4 5 Loop Unroll Factor 400 350 300 250 200 150 100 50 0 0 2 4 6 8 10 Loop Unroll Factor In ADPCM, Execution Time Increased with Loop Unrolling as a result of increase in Initiation Interval (II) Which is caused due unrolling iterations with dependencies or memory stalls.

Observations: Acceleration effects due to attributes 1. Acceleration of AES by varying the number of CU and SIMD attributes across different data sizes Number of Workgroups = 2, Number of Workitems = (Num_inputs)/2 Compute Units SIMD Units 256 Inputs Acceleration 512 Inputs 1024 Inputs 1 1 2.5 4.6 6.6 1 2 2.7 5.4 6.6 1 4 2.4 5.4 6.9 2 1 4 4.6 6.9 2 2 2.6 5.2 6.9 Decreased trade off between Data processing efficiency and Bandwidth requirement Increased Each Attributes has various trade-off affects on the performance,memory access,bandwidth requirement,logic utilization etc.

Design Space Exploration Exhaustive Search vs Fast Heuristic

DSE by Exhaustive Search methodology Exhaustive search DSE involves analyzing of all possible search combinations. Pareto Optimal Solutions is the set of dominant solutions, for which no parameter can be improved without sacrificing at least one other Parameter. Area and Execution time parameters are used. 1. Add Tunable attributes 2. Generate possible Solutions 3. Compile Solutions 4. Design Space 5. Pareto Optimal Solutions The main disadvantage is that the design space is typically large and grows exponentially with the number of exploration options.

Exec Time (ms) Example of Design Space and Pareto Optimal Solution 18 Exhaustive DSE for Sobel 16 14 Exaustive sln Optimum_Sln 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 100 Logic Utilization (%)

DSE by Genetic Algorithm 1. Parent Attributes selection 2. Random Crossover & Mutation (M) 3. Cost Function based Survival 4. Convergence Criteria (N) 5. Termination Criteria (P) 6. Repeat for various cost factors (a,b) 7. Find the Pareto dominant trade off curve cost = a area area max + b time time max Repeat for several Cost factors (a,b)

Efficiency Metrics Metrics to measure the quality of solutions in Genetic Algorithm based Heuristic : 1. Pareto Dominance (Dom) : The fraction of total number of solutions in the Pareto set being evaluated,also present in the reference Pareto set. 2. Average Distance from Reference Set (ADRS): ADRS measures the average distance between the Heuristic approximated front and the Pareto Optimal Front. 3. Speed up: Determines speed up in the compilation time to find the Pareto dominant Front compared to the Exhaustive search. 1.2 1 Pareto optimum GA_1_5_10 0.8 0.6 Dominance = 2/4 = 0.5 0.4 0.2 ADRS = 0.64 0 0 0.2 0.4 0.6 0.8 1

NORMALIZED AREA NORMALIZED AREA NORMALIZED AREA NORMALIZED AREA Comparison of Results : Exhaustive Search vs Genetic Algorithm 1.2 1 0.8 0.6 0.4 0.2 0 1.2 1 0.8 FIR Pareto_Optimal GA_1_5_10 0 0.2 0.4 0.6 0.8 1 1.2 NORMALIZED TIME INTERPOLATION Pareto optimum GA_1_5_10 1.2 1 0.8 0.6 0.4 0.2 0 0.4 0.35 SOBEL 0 0.2 0.4 0.6 0.8 1 1.2 NORMALIZED TIME ADPCM Pareto_Optimal GA_1_5_10 Pareto Optimal GA_1_5_10 0.6 0.3 0.4 0.2 0.25 0 0 0.2 0.4 0.6 0.8 1 1.2 NORMALIZED TIME 0.2 0 0.2 0.4 0.6 0.8 1 1.2 NORMALIZED EXECUTION TIME System Exploration Trade-off Curves Pareto optimal Front of Exhaustive DSE vs Pareto Dominant Front of Genetic Algorithm

Results : Genetic Algorithm Efficiency Metrics Results Summary : Average Dominance = 0.7 Average ADRS = 0.2 Average speedup = 6 Genetic algorithm heuristic can determine about 70% of the optimal dominant solutions Solutions can be within a 20% range in design space around the optimal solutions

Conclusions We developed set of OpenCL benchmarks to study the trend in acceleration as a result of various attributes. A fast and heuristic method to explore the design space is implemented. Its performance is analyzed & compared with the reference solution set. Based on the experiments, an average dominance of 0.7, average ADRS of 0.2 at average speed up of 6 times compared to the exhaustive DSE search is observed. Future Work Experimenting with wider range of benchmarks. Upgrade benchmarks to multiple FPGA Platforms. Other fast Heuristic methods like Simulation Annealing or Machine Learning algorithms can be used for exploration of design space.

Thank You!

Questions?