Higher Level Programming Abstractions for FPGAs using OpenCL

Similar documents
Altera SDK for OpenCL

FPGAs Provide Reconfigurable DSP Solutions

Cover TBD. intel Quartus prime Design software

Cover TBD. intel Quartus prime Design software

FPGAs in 2032: Challenges & Opportunities in the Next 20 Years

DSP Builder Handbook Volume 1: Introduction to DSP Builder

ASYNCHRONOUS SHADERS WHITE PAPER 0

Welcome. Altera Technology Roadshow 2013

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Modern Processor Architectures. L25: Modern Compiler Design

DSP Builder Handbook Volume 1: Introduction to DSP Builder

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

4K Format Conversion Reference Design

GPU Programming Using NVIDIA CUDA

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Nios II Performance Benchmarks

Parallel Computing. Parallel Computing. Hwansoo Han

White Paper Assessing FPGA DSP Benchmarks at 40 nm

Parallel Computing: Parallel Architectures Jin, Hai

WHY PARALLEL PROCESSING? (CE-401)

Ten Reasons to Optimize a Processor

AN 831: Intel FPGA SDK for OpenCL

Introduction to GPU computing

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallelism. CS6787 Lecture 8 Fall 2017

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Nios II Family of Configurable Soft-core Processors

FPGA for Software Engineers

Five Ways to Build Flexibility into Industrial Applications with FPGAs

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Pactron FPGA Accelerated Computing Solutions

3-D Accelerator on Chip

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Digital Systems Design. System on a Programmable Chip

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Take GPU Processing Power Beyond Graphics with Mali GPU Computing

! Readings! ! Room-level, on-chip! vs.!

The Future of GPU Computing

Practical Hardware Debugging: Quick Notes On How to Simulate Altera s Nios II Multiprocessor Systems Using Mentor Graphics ModelSim

Heterogeneous Computing

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Embedded Computing Platform. Architecture and Instruction Set

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Transaction-Level Modeling Definitions and Approximations. 2. Definitions of Transaction-Level Modeling

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

GPUs and Emerging Architectures

OpenCL on FPGAs - Creating custom accelerated solutions

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION SIMPLIFIES THE DISTRIBUTION OF I/O DATA THROUGHOUT THE CHIP, AND IS ALWAYS

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Simplify Software Integration for FPGA Accelerators with OPAE

«Real Time Embedded systems» Multi Masters Systems

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Benefits of Programming Graphically in NI LabVIEW

Benefits of Programming Graphically in NI LabVIEW

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Memory Systems IRAM. Principle of IRAM

Simplifying FPGA Design for SDR with a Network on Chip Architecture

Designing with Nios II Processor for Hardware Engineers

The S6000 Family of Processors

04 - DSP Architecture and Microarchitecture

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

Course Overview Revisited

Trends and Challenges in Multicore Programming

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Tesla GPU Computing A Revolution in High Performance Computing

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

PCI Express Multi-Channel DMA Interface

Intel Quartus Prime Pro Edition User Guide

Measurement of real time information using GPU

Full Linux on FPGA. Sven Gregori

Research Challenges for FPGAs

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Chapter 18 - Multicore Computers

Embedded Systems. 7. System Components

Copyright Khronos Group Page 1

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Khronos Connects Software to Silicon

Arria 10 Transceiver PHY User Guide

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Introduction to Parallel Programming Models

Growth outside Cell Phone Applications

CS 2630 Computer Organization. What did we accomplish in 8 weeks? Brandon Myers University of Iowa

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

9. Functional Description Example Designs

Advanced CUDA Optimization 1. Introduction

AN 464: DFT/IDFT Reference Design

Transcription:

Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center

! Technology scaling favors programmability CPUs."#/0$*12'$-* 2 DSPs FPGAs!"#$%&'("#$)* +''(,-*

! Technology scaling favors programmability and parallelism CPUs."#/0$*12'$-* 3 DSPs Multi-Cores 3708%12'$-* 12('-$%&'("#$)* 159-*(#)*:.5-* Arrays 12('-$%&'("#$)* 3(--"4$0,* 5('(00$0* 5'26$--2'* +''(,-* FPGAs!"#$%&'("#$)* 3(--"4$0,* 5('(00$0* +''(,-*

Programmable and Sequential! Reaching the limit After four decades of success Program Instructions CPU 4

The End of Denial Architecture - William J. Dally [DAC 2009]! Single thread processors are in denial about parallelism and locality! They provide two illusions:! Serial execution Denies parallelism Tries to exploit parallelism with ILP limited scalability! Flat memory Denies locality Tries to provide illusion with caches very inefficient when working set doesn t fit in the cache

Programmable and Parallel! Exploit parallelism on a chip Take advantage of Moore s law Processors not getting faster, just wider Keep the power consumption down! Use more transistors for information processing Processor Memory Processor Memory Processor Memory Processor Memory Shared External Memory

FPGA : Ultimately Configurable Multicore Processor Memory! Many coarse-grained processors Different Implementation Options " Small soft scalar processor " or Larger vector processor " or Customized hardware pipeline Each with local memory Processor Processor Memory Memory! Each processor can exploit the fine grained parallelism of the FPGA to more efficiently implement it s program! Possibly heterogeneous Optimized for different tasks! Customizable to suit the needs of a particular application 7

Processor Possibilities Scalar Soft Proc + VLIW / Vector / TTA Soft Proc Data Memory Custom Pipeline NiosII C2H Load/Store Unit Load/Store Unit Integer ALU Integer ALU Float ALU load Arbiter Arbiter Integer RF Float RF Boolean RF Instruction Unit Immediate Unit Instruction Memory Program memory Data memory Data memory Traditional µprocessor Dedicated RTL Circuitry 8

Our challenges! Generally, programmers have difficulty using FPGAs as massive multi-core devices to accelerate parallel applications! Need a programming model that allows the designer to think about the FPGA as a configurable multi-core device! Today, the FPGA s programming model revolves around RTL ( VHDL / Verilog ) State machines, datapaths, arbitration, buffering, etc. 9

An ideal programming environment! Has the following characteristics: Based on a standard multicore programming model rather than something which is FPGA-specific Abstracts away the underlying details of the hardware " VHDL / Verilog are similar to assembly language programming Useful in rare circumstances where the highest possible efficiency is needed The price of abstraction is not too high " Still need to efficiently use the FPGA s resources to achieve high throughput / low area Allows for software-like compilation & debug cycles " Faster compile times " Profiling & user feedback 10

OPENCL : BRIEF INTRODUCTION 11

What is OpenCL! OpenCL is a programming model developed by the Khronos group to support silicon acceleration! An industry consortium creating open API standards! Enables software to leverage silicon acceleration! Commitment to royalty-free standards Making money from enabled products not from the standards themselves 12

OpenCL! OpenCL is a parallel language that provides us with two distinct advantages Parallelism is declared by the programmer " Data parallelism is expressed through the notion of parallel threads which are instances of computational kernels " Task parallelism is accomplished with the use of queues and events that allow us to coordinate the coarse grained control flow Data storage and movement is explicit " Hierarchical Memory model Registers Local Memory Global off-chip memory " It is up to the programmer to manage their memories and bandwidth efficiently 13

OpenCL Structure! Natural separation between the code that runs on accelerators* and the code that manages those accelerators The management or host code is pure software that can be executed on any sort of conventional microprocessor " Soft processor, Embedded hard processor, external x86 processor The kernel code is C with a minimal set of extensions that allows for the specification of parallelism and memory hierarchy " Likely only a small fraction of the total code in the application " Used only for the most computationally intensive portions * = Processor + Memory combo 14

OpenCL Host Program! Pure software written in standard C! Communicates with the Device via a set of library routines which abstract the communication between the host processor and the kernels Copy data from Host to FPGA Ask the FPGA to run a particular kernel Copy data from FPGA to Host!"#$%&' (' ''')*"+,+"-",.)/!,.#0*%'1'&2' '''!"$#$340"-*,+"-"%'1'&2' '''!"#$%&'&'()*+',&--')%'1'&2' '''!"#$%&'&'./01%15'!6,7*)$*05'1&2' ' '''!"#$%&'&'2'/3,&--')%'1'&2' '''+#830"6,)*840-,-/,48*)%'1'&2' 9' 15

OpenCL Kernels! Data-parallel function Defines many parallel threads of execution Each thread has an identifier specified by get_global_id Contains keyword extensions to specify parallelism and memory hierarchy! Executed by compute object CPU GPU -"6/+89/8:' -"6/+8978:' -"6/+89/$0;')8:' 441')$'"':/#+' 84!%445"67/"';/$8-'.0/"-'<"5' 445"67/"';/$8-'.0/"-'<=5' 445"67/"'.0/"-'<"$8>*)&' (' #$-'?#+'@'A*-,A0/="0,#+%B&2' "$8>*)C?#+D'@'"C?#+D'E'=C?#+D2' 9' 0 1 2 3 4 5 6 7 7 6 5 4 3 2 1 0 441')$'"':/#+'84!%'1'&2' 7 7 7 7 7 7 7 7 16

Altera supported system configurations External host Host CPU User program OpenCL runtime FPGA Embedded Embedded CPU User program OpenCL runtime FPGA 17

OpenCL FPGA Target DDR* PCIe Embedded Soft or Hard Processor External Memory Controller & PHY SOPC / QSYS Host Program Kernels / Datatapath Computation Processor Memory / Datatapath Computation Processor Memory / Datatapath Computation Processor Memory Interface IP 18 Application Specific External Protocols

OpenCL : FPGA Programming Model! OpenCL is a standard multi-core programming model that can be used to provide a higher-level layer of abstraction for FPGAs! Research challenges abound Need to collaborate with academics, third parties and members of the Khronos group " Require libraries, kernel compilers, debugging tools, pre-defined templates, etc. We have to consider that our competition is no longer just other FPGA vendors " A broad spectrum of programmable multi-core devices targeting different market segments 19

Thank You ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation and registered in the United States and are trademarks or registered trademarks in other countries.

OpenCL Driving Forces! Attempt at driving an industry standard parallel language across platforms CPUs, GPUs, Cell Processors, DSPs, and even FPGAs! So far driven by applications from: The consumer space " Image Processing & Video Encoding " 1080p video processing on mobile devices " Augmented reality & Computational Photography Game programming " More sophisticated rendering algorithms Scientific / High Performance Computing " Financial, Molecular Dynamics, Bioinformatics, etc. 21

Challenges! OpenCL s compute model targets an abstract machine that is not an FPGA Hierarchical array of processing elements Corresponding hierarchical memory structure! It is more difficult to target an OpenCL program to an FPGA than targeting more natural hardware platforms such as CPUs and GPUs These problems are research opportunities # 22