COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

Similar documents
Implementation of DSP Algorithms

The Nios II Family of Configurable Soft-core Processors

ARM Processors for Embedded Applications

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Reconfigurable Computing. Introduction

Coarse Grain Reconfigurable Arrays are Signal Processing Engines!

Topics in computer architecture

The Xilinx XC6200 chip, the software tools and the board development tools

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Embedded Computation

CSCI 402: Computer Architectures

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Vertex Shader Design I

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

A Reconfigurable Multifunction Computing Cache Architecture

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Design and Implementation of a Super Scalar DLX based Microprocessor

COFFEE A Core for Free

Altera FLEX 8000 Block Diagram

Computer System Architecture

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

One instruction specifies multiple operations All scheduling of execution units is static

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

3.1 Description of Microprocessor. 3.2 History of Microprocessor

REAL TIME DIGITAL SIGNAL PROCESSING

Quixilica Floating Point FPGA Cores

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

RISC Processors and Parallel Processing. Section and 3.3.6

Copyright 2007 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol.

CPE300: Digital System Architecture and Design

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

ISSN Vol.02, Issue.11, December-2014, Pages:

DSP Resources. Main features: 1 adder-subtractor, 1 multiplier, 1 add/sub/logic ALU, 1 comparator, several pipeline stages

General Purpose Signal Processors

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

PowerPC 740 and 750

Floating Point Arithmetic

Digital Systems Design. System on a Programmable Chip

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

GAISLER. IEEE-STD-754 Floating Point Unit GRFPU / GRFPU-FT CompanionCore Data Sheet

Pipelining and Vector Processing

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Introduction to Embedded System Processor Architectures

Contents of this presentation: Some words about the ARM company

Advanced processor designs

ECE 471 Embedded Systems Lecture 2

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

Chapter 13 Reduced Instruction Set Computers

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

System-on Solution from Altera and Xilinx

VLSI Design Automation

Design of Transport Triggered Architecture Processor for Discrete Cosine Transform

GAISLER. IEEE-STD-754 Floating Point Unit GRFPU Lite / GRFPU-FT Lite CompanionCore Data Sheet

VHDL IMPLEMENTATION OF IEEE 754 FLOATING POINT UNIT

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Low Power Design Techniques

Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

Vector IRAM: A Microprocessor Architecture for Media Processing

CISC Attributes. E.g. Pentium is considered a modern CISC processor

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

REAL TIME DIGITAL SIGNAL PROCESSING

Higher Level Programming Abstractions for FPGAs using OpenCL

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

EEL 4783: Hardware/Software Co-design with FPGAs

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

The ARM10 Family of Advanced Microprocessor Cores

Instruction Set extensions to X86. Floating Point SIMD instructions

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

CS6303-COMPUTER ARCHITECTURE UNIT I OVERVIEW AND INSTRUCTIONS PART A

Advanced Computer Architecture

FPGAs: FAST TRACK TO DSP

INTRODUCTION TO FPGA ARCHITECTURE

An introduction to Digital Signal Processors (DSP) Using the C55xx family

DIGITAL ARITHMETIC. Miloš D. Ercegovac Computer Science Department University of California Los Angeles and

Double Precision Floating-Point Multiplier using Coarse-Grain Units

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Highly Scalable Dynamically Reconfigurable Systolic Ring-Architecture for DSP applications

MATH CO-PROCESSOR 8087

TKT-3526 Processor Design ECTS credits Periods III & IV (weeks 1-8 & 10-18) Lectures & Seminars: Thursdays 12-14

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Full Linux on FPGA. Sven Gregori

Spiral 2-8. Cell Layout

High-Performance Linear Algebra Processor using FPGA

VLSI Signal Processing

VHDL for Synthesis. Course Description. Course Duration. Goals

04 - DSP Architecture and Microarchitecture

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

The S6000 Family of Processors

Transcription:

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm Processor Architecture and different approached to acceleration Requirements of applications for hardware coprocessor Numeric coprocessors Various type of Reconfigurable Accelerators Milk coprocessor Butter Accelerator

How to improve the performance of a microprocessor system? Choose a faster version of your microprocessor Add additional computational units that are perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator

Hardware Accelerator If the overall performance of a uni-processor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator! The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor An Accelerator is NOT a COPROCESSOR A co-processor is connected to the CPU and executes special instructions. Instructions are dispatched by the CPU. An accelerator appears as a device on the bus

Accelerators and different types of parallelism li One of the key properties that can be exploited is the parallelism Instruction level parallelism Loop level parallelism, Task level parallelism Program level parallelism, Data parallelism

Processor architectures and different approaches to acceleration DSP processors RISC microprocessors CISC microprocessors fact that applications and protocols change fast, so having a programmable core in the system is recommendable to guarantee general validity and flexibility to the platform. One possible way of accelerating a programmable core exploiting instruction and/or data parallelism of applications by providing the processor with VLIW or SIMD extensions; another way consists in adding special functional units MAC circuits, barrel shifter, other special components designed to speed up the execution of DSP algorithms) in the datapath of the programmable core The design and verification issues related to coprocessors can be faced independently from the ones related to the main processor: this way it is possible to parallelize the design activities, saving then time.

Requirements of applications for hardware coprocessors Different application domains call for different kinds of accelerators: For example, applications require floating-point computation robotics, atomation automation, Dolby digital audio, 3D graphics making thus the insertion of FPU very useful and sometimes even necessary very effective way of solving this problem which is widely accepted nowadays is to make those architectures run-time reconfigurable. means that the hardware is done so that the datapath of the architecture can be changed by modifying the value of special bits, named configuration bits or configware.

Numeric coprocessors: floating-point units Commonly required: floating-point arithmetic : leads to higher complexity P.S.The area of the FPUs is usually quite large; this point usually discouraged d designers to include them into their systems There are different existing typologies of FPU, ranging g from proprietary p to open-source ones, supporting the IEEE-754 standard or not, able of single-precision or double precision computation, for usage with CISC or RISC machines

Numeric coprocessors: floating-point units [cont.] RISC cores, one of the most important examples is given by FPUs for ARM, called VFP-9, VFP-10 and VFP-11, Pipelined, with some software configurable functions, powerful, vector FPUs, supporting also double precision to enhance accuracy in calculation MEIKO is an FPU developed at SUN open source RISC core developed at Gaisler Research Used with Leon processor The FPU from Jidan Al-Eryani is a complete coprocessor, which features a hardware logic to handle denormal operands, even though it does not support parallel execution of the instructions.

Various types of reconfigurable accelerators

Butter Co-Processor [overview] NxM array of reconfigurable processing elements (cells) Each cell features integer and floating-point arithmetic operations, shift and LUT-based operations Flexible interconnect schemes between the cells, providing nearest-neighbor and global communication Nearest-neighbor interconnections are anyway sufficient to implement the simplest DSP algorithms, the global ones are more useful for matrix-multiplications and 3D graphics algorithms Dedicated input and output in addition to the system bus (or network!) interface which is mainly used for configuration purposes

Butter Accelerator a coarse grain reconfigurable Coarse-Grained Parallelism Maximizing the performance in the elaboration of multimedia, signal processing, 3D applications. A parametric VHDL model Infrequent data communication, after larger amounts of computation IMPACT: However, The mapping of VHDL on standard-cells technologies implies more area on chip lower clock frequencies

Butter Accelerator [cont.] execution of applications detecting the parts specialized hardware Butter is a coprocessor attached to the system bus Configuration bits are stored in a dedicated memory inside Butter, and can be written the core or via DMA transfers. Direct memory access (DMA) is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit.

Butter Processing Element: Cells Butter is organized as a matrix of processing elements called cells two inputs ports to read 32-bit wide operands; 6-bit wide input port (Configuration bits) control the internal registers reset enable input are used to of the cell. two 32-bit output ports for each cell 64-bit result of a 32-bit multiplication, li or a generic 32-bit result coming from another functional unit Input registers inside the cells are used to sample the operands Introduces the pipeline Can be disabled to avoid useless dynamic power consumption special input register is used to keep constant values inside the cell, so that they can be used during the elaboration with no need to re-route them.

Butter Processing Element: Cells [cont.] Inside each cell there are three functional units a multiplier, an adder, a barrel shifter small memory (4 cells 32-bit wide) used as lookuptable (LUT) A special functional unit (floating-point it multiplications) ltili 3D graphics benefit from fast, low precision floating-point operations results produced by the adder and the multiplier, rounding them to be stored in the floating-point format a dedicated block inside the cells: (with three portions) calculates the amount of leading zeros for each of the operands, the sign of the result, packs the internal number into the final format.

Internal Architecture of a Cell of Butter Accelerator The first row of cells read their operands from global vertical interconnections; The results of the elaboration are put as output accessible from the underlying rows. The final result can be read externally of Butter either from its last row at the bottom of the device, or from the rightmost column: results can be accessed as soon as they are produced, with no need to wait that they go through all the rows.

Different kinds of interconnection inside Butter

Interconnections in Butter The interleaved interconnection is useful (for example) to propagate the 64-bit result of multiplications splitting their processing over two adjacent rows. They are useful in easing and enhancing the mapping of some algorithms, and in reducing the amount of cells used. Thanks to the interleaved connections it is possible to implement the FIR algorithm using only three rows of the array: the first row executes the multiplications, the second row the additions of the least significant bits of the products, the third row the addition of the most significant bits. Global Interconnections: connecting the output of each cell to every input of the cells laying on the row below algorithms like matrix matrix multiplications and matrix vector multiplications?

Butter Co-Processor Requirements Butter was synthesized on FPGA : operating frequency 57 MHz 90 nm Standard-cells technology: Operating Frequency: 280MHz Thanks to its wide datapath, high parallelism and pipelined nature Butter can run algorithms using a very limited amount of clock cycles; for example, an FIR filter takes 16 cycles, ` a matrix vector multiplication takes 4 cycles, and a 2D IDCT 54 cycles.

Milk Coprocessor Design And Verification of a VHDL Model of a Floating-Point Unit for a RISC Microprocessor

Solutions to Improve Performance pipelined architecture, to deliver up to one result per clock cycle parallel elaboration of instructions High Parallelism different functional units commit their elaboration simultaneously, a multi-port register file allows the concurrent write back of their results. fast internal bus switching hardware support for denormal operands handling Scalability & Adaptability functional units can be inserted or removed from the architecture in an immediate way Modularity to the Functional Unit Hardware logic for register locking and to stall the core The GCC compiler s support. Parallel elaboration of instructions is made so that some fast instructions can be run while a heavier one is still in progress; the compiler can then provide a significant improvement in the execution of algorithms by making a good scheduling of the instructions, reducing this way unused clock cycles and increasing global computation efficiency. any non-zero number which is smaller than the smallest normal number is denormal'.

Milk co-processor external interface Coffee RISC core supports up Pins Interfacing to four coprocessors two signals (c-index [ 1.. 0 ]) 1. wr_cop are used to select which coprocesser is currently being 2. rd_cop addressed 3. c_index[1, 0] specify the daia exchange direction (input or output) 4. r_index[3,0] It has 4-bit address used for internal It h 4 bit dd d f i t l registers addressing: signal cop-exc indicates internal 5. cop_exc bit r_index [3] logical high: a special exceptions: they are register is being indexed (r-index [0] concurrent writes on the 6. data(31,0) then selects among status register or Coprocessor register file, by the control register) internal functional units and the bit r_index [3] logical low: one among the eight general purpose registers is being indexed processor core arithmetical exceptions: overflow, underhow, inexact result, invalid operand, division by zero illegal instruction code (the current instrutioni is not supported by the coprocessor).

Milk Coprocessor Internal Architecture

MILK CO-Processor Requirements It requires105 K gates The operating frequency 400 MHz on a 90 nm standard cells technology 20K Logic Elements running at 67 MHz on an Altera Stratix FPGA. It is capable of completing instructions in a very small number of clock cycles: 3 for multiplications, 5 for additions, 8 for square root, 11 for divisions 2 for conversions and 1 for all the other ones

QUESTIONS?