Coarse Grain Reconfigurable Arrays are Signal Processing Engines!

Similar documents
All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks

Design of Reusable Context Pipelining for Coarse Grained Reconfigurable Architecture

Reconfigurable Computing. Introduction

ENERGY EFFICIENCY EXPLORATION OF COARSE-GRAIN RECONFIGURABLE ARCHITECTURE WITH EMERGING NONVOLATILE MEMORY

Coarse Grained Reconfigurable Architecture

The extreme Adaptive DSP Solution to Sensor Data Processing

CONTACT: ,

The S6000 Family of Processors

Two-level Reconfigurable Architecture for High-Performance Signal Processing

M.TECH VLSI IEEE TITLES

Benchmarking Processors for DSP Applications

Reconfigurable VLSI Communication Processor Architectures

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Benchmarking Multithreaded, Multicore and Reconfigurable Processors

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

Design and Development from Single Core Reconfigurable Accelerators to a Heterogeneous Accelerator-Rich Platform

Higher Level Programming Abstractions for FPGAs using OpenCL

Reconfigurable Cell Array for DSP Applications

Cut DSP Development Time Use C for High Performance, No Assembly Required

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

General Purpose Signal Processors

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

MPSoC Design Space Exploration Framework

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

REAL TIME DIGITAL SIGNAL PROCESSING

VLSI Design & Implementation of Bus Arbiter 2009 E09VL33 Circuitry

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Digital Integrated Circuits

REAL TIME DIGITAL SIGNAL PROCESSING

A New CDMA Encoding/Decoding Method for on- Chip Communication Network

Flexible wireless communication architectures

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Simplifying FPGA Design for SDR with a Network on Chip Architecture

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

The Nios II Family of Configurable Soft-core Processors

A PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO

The Efficient Implementation of Numerical Integration for FPGA Platforms

White Paper Using Cyclone III FPGAs for Emerging Wireless Applications

Programmable Logic Devices UNIT II DIGITAL SYSTEM DESIGN

Embedded Computation

Mapping and Performance of DSP Benchmarks on a Medium-Grain Reconfigurable Architecture

Coarse-Grained Reconfigurable Array Architectures

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 7, JULY

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Hardware Design with VHDL PLDs I ECE 443. FPGAs can be configured at least once, many are reprogrammable.

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

A Survey on various Reconfigurable Architectures for Wireless communication Systems

FABRICATION TECHNOLOGIES

Experiment 3. Getting Start with Simulink

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Vertex Shader Design I

Energy Optimizations for FPGA-based 2-D FFT Architecture

Floating-Point Butterfly Architecture Based on Binary Signed-Digit Representation

INTRODUCTION TO FPGA ARCHITECTURE

Improved Convolutional Coding and Decoding of IEEE802.11n Based on General Purpose Processors

Choosing a Processor: Benchmarks and Beyond (S043)

Enabling the design of multicore SoCs with ARM cores and programmable accelerators

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

ERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing

Microprocessor Extensions for Wireless Communications

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

Algorithm-Architecture Co- Design for Efficient SDR Signal Processing

Coarse-Grained Reconfigurable Computing for Power Aware Applications

DESIGN METHODOLOGY. 5.1 General

Independent DSP Benchmarks: Methodologies and Results. Outline

Embedded Computing Platform. Architecture and Instruction Set

Rapid Prototyping System for Teaching Real-Time Digital Signal Processing

University of California, Davis Department of Electrical and Computer Engineering. EEC180B DIGITAL SYSTEMS Spring Quarter 2018

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

An introduction to Digital Signal Processors (DSP) Using the C55xx family

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

White Paper Assessing FPGA DSP Benchmarks at 40 nm

Development and synthesis of adaptive multi-grained i reconfigurable hardware architecture for dynamic functions patterns (AMURHA)

90A John Muir Drive Buffalo, New York Tel: Fax:

Chapter 5 Embedded Soft Core Processors

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

High performance, power-efficient DSPs based on the TI C64x

Qsys and IP Core Integration

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Modeling a 4G LTE System in MATLAB

REAL TIME DIGITAL SIGNAL PROCESSING

FPGAs: THE HIGH-END ALTERNATIVE FOR DSP APPLICATIONS. By Dr. Chris Dick

Flexible Architecture Research Machine (FARM)

PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs

FPGAs: FAST TRACK TO DSP

Transcription:

Coarse Grain Reconfigurable Arrays are Signal Processing Engines! Advanced Topics in Telecommunications, Algorithms and Implementation Platforms for Wireless Communications, TLT-9707 Waqar Hussain Researcher waqar.hussain@tut.fi Tampere University of Technology, Finland Electronic Products Multifunction devices are becoming popular besides their reliability and durability Example Mobile Phone The key selling features of a cell phone are size, weight, longer battery times, audio/video streaming and several games running onto it Adaptability to many communication standards Expectations for Real Time performance No Limits to Human Desire 2

Embedded Technology The embedded technology empowers a mobile phone to carry all these features. Intended for a specific use which consist of a hardware capable to perform a set of different tasks with the help of software Example Embedded System = RISC + Accelerator(s) 3 Why Coarse Grain Reconfigurable Arrays? Answer : Computationally Intensive Kernels (CIK) need to be accelerated in a Signal Processing System. Examples of CIKs 1. FIR Filtering 2. Encoding and Decoding a) Viterbi b) Reed-Solomon 3. Matrix-Vector Multiplication 4. Fast Fourier Transform 4

Why Coarse Grain Reconfigurable Arrays? Question: So why CGRA, why not traditional accelerators? Its more desirable to use devices that could accelerate multiple kernels than typical traditional accelerators that were designed to accelerate only a single kernel. Thanks to Reconfigurability! 5 Why CGRAs are Powerful Engines? Answer: Due to its structure! CGRAs offer high parallelism and throughput due to its arraybased structure. Algorithms containing parallelism are most suitable to be mapped on a CGRA. It can process large streams of data. Unit of Structure of a CGRA is an ALU, called Processing Elements (PE). Each PE is connected to other PEs using point-to-point or a Network on Chip (NoC). 6

CGRA in an Embedded System An Example of Embedded System is RISC + Accelerator(s) RISC = COFFEE Accelerator = BUTTER Both COFFEE and BUTTER were designed at the Department of Computer Systems, Tampere University of Technology, Finland BUTTER A general purpose Coarse Grain Reconfigurable Array (CGRA) which is a martix of processing elements (PEs). Each PE is capable to perform a set of different tasks and connected with each other using point to point interconnections. BUTTER was capable to process many computationally intensive kernels. 7 Problems with BUTTER! BUTTER s presence in the system was expensive if it is not used most of the time BUTTER occupies a large number of hardware resources A General Purpose CGRA requires a few million gates of FPGA 8

Solution CREMA A parameterized general purpose CGRA to generate special purpose accelerators. 9 Category of Interconnections

Processing Elements in CREMA Two Operand Registers Decoder for Operation Selection Supports Integer and Floating point operations Blocks with dashed border are scalable and selectable for instantiation LUT for logical operations Processing Element Template CREMA based System COFFEE for general purpose processing CREMA generated accelerator for CIK Network of Switched Interconnections ti for faster data transfer between modules 12

CGRAs to be made Scalable 13 Scalability in Software A fixed hardware can be used to process a variable length algorithm For example: A single FFT butterfly can be used to process 4, 8, 16, 64, 128, 256 and higher points of FFT In this case, the hardware (FFT Butterfly) is fixed but we can scale the software as required to process different lengths of FFTs Another example can be matrix-vector multiplication Arithmetic resources required by 4 th order matrix-vector multiplication can be used to process higher order matrix-vector multiplication. 14

Scalability in CGRA 15 Why to Scale Hardware? An Example Wireless LAN 16

How to Scale the Hardware? The resources required by a set of applications can give an idea about to scale the hardware In short, nature of applications has to drive the dimensioning in hardware For a small set of applications, it might be easier but for a large set of applications, it might be difficult A method needs to be defined??? 17 Case Study Applications Driving Dimensioning Matrix-Vector Multiplication Radix-4 FFT Processing Target Platform under Dimensioning CREMA, a Coarse-Grain Reconfigurable Array consisting of 4x8 processing elements Scaling Order 1. Matrix-Vector Multiplication From 4x8 to 6x8 and 4x16 PEs CGRA 2. Radix-4, FFT Processing From 4x8 to 9x8 and 4x16 PEs CGRA Scaling Influence on Design Strategies Rapid Prototyping and System Integration Global Optimum Implementation for Area and Speed 18

Applications Mapped on CREMA and BUTTER Integer and Floating-point Matrix-Vector Multiplication Execution Time Compared with RISC and DSP 2D-Low Pass Image Filtering based on Averaging Window FFT Satisfied Execution Time Constraints for SISO and MIMO OFDM Applications Resource utilization and execution time was compared with other state-of-the-art W-CDMA cell search Execution time compared with a RISC core In all of the above applications, CREMA as a templatebased device required lesser resources for its generated accelerator than BUTTER 19 Thank You *Questions**