COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design
Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm Processor Architecture and different approached to acceleration Requirements of applications for hardware coprocessor Numeric coprocessors Various type of Reconfigurable Accelerators Milk coprocessor Butter Accelerator
How to improve the performance of a microprocessor system? Choose a faster version of your microprocessor Add additional computational units that are perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator
Hardware Accelerator If the overall performance of a uni-processor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator! The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor An Accelerator is NOT a COPROCESSOR A co-processor is connected to the CPU and executes special instructions. Instructions are dispatched by the CPU. An accelerator appears as a device on the bus
Accelerators and different types of parallelism li One of the key properties that can be exploited is the parallelism Instruction level parallelism Loop level parallelism, Task level parallelism Program level parallelism, Data parallelism
Processor architectures and different approaches to acceleration DSP processors RISC microprocessors CISC microprocessors fact that applications and protocols change fast, so having a programmable core in the system is recommendable to guarantee general validity and flexibility to the platform. One possible way of accelerating a programmable core exploiting instruction and/or data parallelism of applications by providing the processor with VLIW or SIMD extensions; another way consists in adding special functional units MAC circuits, barrel shifter, other special components designed to speed up the execution of DSP algorithms) in the datapath of the programmable core The design and verification issues related to coprocessors can be faced independently from the ones related to the main processor: this way it is possible to parallelize the design activities, saving then time.
Requirements of applications for hardware coprocessors Different application domains call for different kinds of accelerators: For example, applications require floating-point computation robotics, atomation automation, Dolby digital audio, 3D graphics making thus the insertion of FPU very useful and sometimes even necessary very effective way of solving this problem which is widely accepted nowadays is to make those architectures run-time reconfigurable. means that the hardware is done so that the datapath of the architecture can be changed by modifying the value of special bits, named configuration bits or configware.
Numeric coprocessors: floating-point units Commonly required: floating-point arithmetic : leads to higher complexity P.S.The area of the FPUs is usually quite large; this point usually discouraged d designers to include them into their systems There are different existing typologies of FPU, ranging g from proprietary p to open-source ones, supporting the IEEE-754 standard or not, able of single-precision or double precision computation, for usage with CISC or RISC machines
Numeric coprocessors: floating-point units [cont.] RISC cores, one of the most important examples is given by FPUs for ARM, called VFP-9, VFP-10 and VFP-11, Pipelined, with some software configurable functions, powerful, vector FPUs, supporting also double precision to enhance accuracy in calculation MEIKO is an FPU developed at SUN open source RISC core developed at Gaisler Research Used with Leon processor The FPU from Jidan Al-Eryani is a complete coprocessor, which features a hardware logic to handle denormal operands, even though it does not support parallel execution of the instructions.
Various types of reconfigurable accelerators
Butter Co-Processor [overview] NxM array of reconfigurable processing elements (cells) Each cell features integer and floating-point arithmetic operations, shift and LUT-based operations Flexible interconnect schemes between the cells, providing nearest-neighbor and global communication Nearest-neighbor interconnections are anyway sufficient to implement the simplest DSP algorithms, the global ones are more useful for matrix-multiplications and 3D graphics algorithms Dedicated input and output in addition to the system bus (or network!) interface which is mainly used for configuration purposes
Butter Accelerator a coarse grain reconfigurable Coarse-Grained Parallelism Maximizing the performance in the elaboration of multimedia, signal processing, 3D applications. A parametric VHDL model Infrequent data communication, after larger amounts of computation IMPACT: However, The mapping of VHDL on standard-cells technologies implies more area on chip lower clock frequencies
Butter Accelerator [cont.] execution of applications detecting the parts specialized hardware Butter is a coprocessor attached to the system bus Configuration bits are stored in a dedicated memory inside Butter, and can be written the core or via DMA transfers. Direct memory access (DMA) is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit.
Butter Processing Element: Cells Butter is organized as a matrix of processing elements called cells two inputs ports to read 32-bit wide operands; 6-bit wide input port (Configuration bits) control the internal registers reset enable input are used to of the cell. two 32-bit output ports for each cell 64-bit result of a 32-bit multiplication, li or a generic 32-bit result coming from another functional unit Input registers inside the cells are used to sample the operands Introduces the pipeline Can be disabled to avoid useless dynamic power consumption special input register is used to keep constant values inside the cell, so that they can be used during the elaboration with no need to re-route them.
Butter Processing Element: Cells [cont.] Inside each cell there are three functional units a multiplier, an adder, a barrel shifter small memory (4 cells 32-bit wide) used as lookuptable (LUT) A special functional unit (floating-point it multiplications) ltili 3D graphics benefit from fast, low precision floating-point operations results produced by the adder and the multiplier, rounding them to be stored in the floating-point format a dedicated block inside the cells: (with three portions) calculates the amount of leading zeros for each of the operands, the sign of the result, packs the internal number into the final format.
Internal Architecture of a Cell of Butter Accelerator The first row of cells read their operands from global vertical interconnections; The results of the elaboration are put as output accessible from the underlying rows. The final result can be read externally of Butter either from its last row at the bottom of the device, or from the rightmost column: results can be accessed as soon as they are produced, with no need to wait that they go through all the rows.
Different kinds of interconnection inside Butter
Interconnections in Butter The interleaved interconnection is useful (for example) to propagate the 64-bit result of multiplications splitting their processing over two adjacent rows. They are useful in easing and enhancing the mapping of some algorithms, and in reducing the amount of cells used. Thanks to the interleaved connections it is possible to implement the FIR algorithm using only three rows of the array: the first row executes the multiplications, the second row the additions of the least significant bits of the products, the third row the addition of the most significant bits. Global Interconnections: connecting the output of each cell to every input of the cells laying on the row below algorithms like matrix matrix multiplications and matrix vector multiplications?
Butter Co-Processor Requirements Butter was synthesized on FPGA : operating frequency 57 MHz 90 nm Standard-cells technology: Operating Frequency: 280MHz Thanks to its wide datapath, high parallelism and pipelined nature Butter can run algorithms using a very limited amount of clock cycles; for example, an FIR filter takes 16 cycles, ` a matrix vector multiplication takes 4 cycles, and a 2D IDCT 54 cycles.
Milk Coprocessor Design And Verification of a VHDL Model of a Floating-Point Unit for a RISC Microprocessor
Solutions to Improve Performance pipelined architecture, to deliver up to one result per clock cycle parallel elaboration of instructions High Parallelism different functional units commit their elaboration simultaneously, a multi-port register file allows the concurrent write back of their results. fast internal bus switching hardware support for denormal operands handling Scalability & Adaptability functional units can be inserted or removed from the architecture in an immediate way Modularity to the Functional Unit Hardware logic for register locking and to stall the core The GCC compiler s support. Parallel elaboration of instructions is made so that some fast instructions can be run while a heavier one is still in progress; the compiler can then provide a significant improvement in the execution of algorithms by making a good scheduling of the instructions, reducing this way unused clock cycles and increasing global computation efficiency. any non-zero number which is smaller than the smallest normal number is denormal'.
Milk co-processor external interface Coffee RISC core supports up Pins Interfacing to four coprocessors two signals (c-index [ 1.. 0 ]) 1. wr_cop are used to select which coprocesser is currently being 2. rd_cop addressed 3. c_index[1, 0] specify the daia exchange direction (input or output) 4. r_index[3,0] It has 4-bit address used for internal It h 4 bit dd d f i t l registers addressing: signal cop-exc indicates internal 5. cop_exc bit r_index [3] logical high: a special exceptions: they are register is being indexed (r-index [0] concurrent writes on the 6. data(31,0) then selects among status register or Coprocessor register file, by the control register) internal functional units and the bit r_index [3] logical low: one among the eight general purpose registers is being indexed processor core arithmetical exceptions: overflow, underhow, inexact result, invalid operand, division by zero illegal instruction code (the current instrutioni is not supported by the coprocessor).
Milk Coprocessor Internal Architecture
MILK CO-Processor Requirements It requires105 K gates The operating frequency 400 MHz on a 90 nm standard cells technology 20K Logic Elements running at 67 MHz on an Altera Stratix FPGA. It is capable of completing instructions in a very small number of clock cycles: 3 for multiplications, 5 for additions, 8 for square root, 11 for divisions 2 for conversions and 1 for all the other ones
QUESTIONS?