FPGA Co-Processing Architectures for Video Compression

Similar documents
FPGAs Provide Reconfigurable DSP Solutions

How FPGAs Enable Automotive Systems

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

White Paper Low-Cost FPGA Solution for PCI Express Implementation

Stratix II vs. Virtex-4 Performance Comparison

AN 549: Managing Designs with Multiple FPGAs

Simulating the PCI MegaCore Function Behavioral Models

4K Format Conversion Reference Design

CORDIC Reference Design. Introduction. Background

Simulating the Reed-Solomon Model

FFT/IFFT Block Floating Point Scaling

White Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores

Using MAX 3000A Devices as a Microcontroller I/O Expander

Matrices in MAX II & MAX 3000A Devices

Using MAX II & MAX 3000A Devices as a Microcontroller I/O Expander

Exercise 1 In this exercise you will review the DSSS modem design using the Quartus II software.

System-on-a-Programmable-Chip (SOPC) Development Board

DSP Builder Handbook Volume 1: Introduction to DSP Builder

Video and Image Processing Suite

Implementing LED Drivers in MAX Devices

Practical Hardware Debugging: Quick Notes On How to Simulate Altera s Nios II Multiprocessor Systems Using Mentor Graphics ModelSim

Design Tools for 100,000 Gate Programmable Logic Devices

White Paper Understanding 40-nm FPGA Solutions for SATA/SAS

Increasing Productivity with Altera Quartus II to I/O Designer/DxDesigner Interface

Logic Optimization Techniques for Multiplexers

DSP Development Kit, Stratix II Edition

Estimating Nios Resource Usage & Performance

Using the Serial FlashLoader With the Quartus II Software

MILITARY ANTI-TAMPERING SOLUTIONS USING PROGRAMMABLE LOGIC

Mixed Signal Verification of an FPGA-Embedded DDR3 SDRAM Memory Controller using ADMS

Simulating the PCI MegaCore Function Behavioral Models

Introduction. Design Hierarchy. FPGA Compiler II BLIS & the Quartus II LogicLock Design Flow

White Paper Using Cyclone III FPGAs for Emerging Wireless Applications

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Simulating Visual IP Models with the NC-Verilog, Verilog-XL, VCS, or ModelSim (UNIX) Simulators

System Debugging Tools Overview

FPGA Design Security Solution Using MAX II Devices

White Paper Enabling Quality of Service With Customizable Traffic Managers

FFT MegaCore Function User Guide

Implementing LED Drivers in MAX and MAX II Devices. Introduction. Commercial LED Driver Chips

Table 1 shows the issues that affect the FIR Compiler v7.1.

Enhanced Configuration Devices

White Paper Assessing FPGA DSP Benchmarks at 40 nm

Simple Excalibur System

DSP Builder. DSP Builder v6.1 Issues. Error When Directory Pathname is a Network UNC Path

Nios Soft Core Embedded Processor

Five Ways to Build Flexibility into Industrial Applications with FPGAs

AN 610: Implementing Deterministic Latency for CPRI and OBSAI Protocols in Altera Devices

White Paper AHB to Avalon & Avalon to AHB Bridges

Design Guidelines for Optimal Results in High-Density FPGAs

White Paper Compromises of Using a 10-Gbps Transceiver at Other Data Rates

AN 547: Putting the MAX II CPLD in Hibernation Mode to Achieve Zero Standby Current

Cyclone II FPGA Family

DSP Builder Handbook Volume 1: Introduction to DSP Builder

Design Verification Using the SignalTap II Embedded

Arria II GX FPGA Development Board

Nios II Embedded Design Suite 7.1 Release Notes

Implementing the Top Five Control-Path Applications with Low-Cost, Low-Power CPLDs

Supporting Custom Boards with DSP Builder

DSP Builder Release Notes

Legacy SDRAM Controller with Avalon Interface

UTOPIA Level 2 Slave MegaCore Function

Transient Voltage Protection for Stratix GX Devices

Simultaneous Multi-Mastering with the Avalon Bus

Error Correction Code (ALTECC_ENCODER and ALTECC_DECODER) Megafunctions User Guide

Nios II Embedded Design Suite 6.1 Release Notes

Power Optimization in FPGA Designs

Using Flexible-LVDS I/O Pins in

Nios II Performance Benchmarks

FPGAs: FAST TRACK TO DSP

Active Serial Memory Interface

Table 1 shows the issues that affect the FIR Compiler, v6.1. Table 1. FIR Compiler, v6.1 Issues.

White Paper Configuring the MicroBlaster Passive Serial Software Driver

Benefits of Embedded RAM in FLEX 10K Devices

AN 370: Using the Serial FlashLoader with the Quartus II Software

POS-PHY Level 4 POS-PHY Level 3 Bridge Reference Design

Toolflow for ARM-Based Embedded Processor PLDs

POS-PHY Level 4 MegaCore Function

EEM870 Embedded System and Experiment Lecture 4: SoC Design Flow and Tools

Using the Nios Development Board Configuration Controller Reference Designs

Intro to System Generator. Objectives. After completing this module, you will be able to:

RapidIO Physical Layer MegaCore Function

The S6000 Family of Processors

Video Input Daughter Card Reference Manual

Arria II GX FPGA Development Kit HSMC Loopback Tests Rev 0.1

DDR and DDR2 SDRAM Controller Compiler User Guide

APEX II The Complete I/O Solution

Implementing FIR Filters

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

Using Verplex Conformal LEC for Formal Verification of Design Functionality

24K FFT for 3GPP LTE RACH Detection

Nios DMA. General Description. Functional Description

Introduction. Synchronous vs. Asynchronous Memory. Converting Memory from Asynchronous to Synchronous for Stratix & Stratix GX Designs

RapidIO MegaCore Function

DDR & DDR2 SDRAM Controller

For Quartus II Software. This Quick Start Guide will show you how to set up a Quartus

December 2002, ver. 1.3 Application Note 191. Six individual interrupts Six-bit priority scheme Five-bit priority scheme plus one individual interrupt

Enhanced Configuration Devices

DDR & DDR2 SDRAM Controller

Simulating Nios II Embedded Processor Designs

Transcription:

Co-Processing Architectures for Compression Overview Alex Soohoo Altera Corporation 101 Innovation Drive San Jose, CA 95054, USA (408) 544-8063 asoohoo@altera.com The push to roll out high definition video enabled video and imaging equipment is creating numerous challenges for video system architects. The increased image resolution brings with it higher performance requirements for basic video data path processing and next-generation compression standards, outstripping that which standalone digital signal processors (DSPs) can provide. In addition, the system specifications require designers to support a range of standard and custom video interfaces and peripherals usually not supported by off-the-shelf DSPs. While it is possible to go the route of application specific integrated circuits (ASICs) or use application specific standard products (ASSPs), these can be difficult and expensive alternatives that might require a compromised feature set. Furthermore, these choices can hasten a short product life cycle and force yet another system redesign to meet varied and quickly changing market requirements. Field programmable gate arrays (s) are an option that can bridge the flexibility gap in these types of designs. Additionally, with the increasing number of embedded hard multipliers and high memory bandwidth, the latest generation of s can enable customized designs for video systems while offering a manifold performance improvement over the fastest available stand-alone DSPs. Designers now have the ability with state-of-the-art co-processor design flows to implement high-performance DSP video and image processing applications. This new generation of tools facilitates the design of a system architecture that is more scalable and powerful than traditional DSP-only designs while at the same time taking advantage of the price and performance benefits of s. Design Flow The emergence of these new DSP design flows has made the combined DSP processor and co-processor architecture an attractive option for video and image processing systems. What has made this possible is the co-processor flow that merges the traditional C-language based development environments for programmable DSPs and hardware description language (HDL) tools for s with powerful system integration capabilities (see Fig. 1). Through clever system partitioning, designers now have the ability to leverage a legacy code base for DSPs and offload the most computationally intensive blocks of an algorithm to an to create systems optimized for both price/performance and time-to-market.

Figure 1: Combined DSP Design Flow Figure 2: Flow DSP Processor C-Language Integrated Development Environment Development Flow Design Entry DSP Algorithm Simulation HDL Development Tools Optimized IP Functions Model-Based Design System Integration System Integration Tools System Integration Tool RTL Generation RTL Synthesis RTL Simulation Software development environments for DSPs are quite mature, having been refined over many years to address the most common design bottlenecks. On the other hand, there are many options for designing and creating co-processors. The design of DSP systems with s can utilize both high-level algorithm and hardware description language (HDL) development tools as seen in Figure 2. The most straightforward approach is to create an entire design from scratch, writing custom DSP functions in HDL and then using standard design software. While it is possible to develop highperformance, optimized designs, it can be a time-consuming and labor intensive effort. suppliers and third-party vendors now offer highly optimized, parameterizable, off-the-shelf intellectual property (IP), typically the most common video and image processing functions and key video compression algorithm blocks. These IP cores with well defined high-speed interface wrappers can be quickly integrated into a system design enabling shorter design cycles and an accelerated time-to-market. Model-based design environments such as The Mathworks Simulink allow designers to develop, simulate and verify a DSP processing data path for an coprocessor. Models can be built using a mix of proprietary and off-the-shelf DSP building blocks. design software can integrate this environment combining its capabilities with standard HDL synthesis, simulation and customized development tools. HW Programming Debug & Verification Finally, new system integration tools enable rapid development of custom coprocessor solutions and the ability to leverage existing solutions to add new capabilities and improve system performance. By automating the integration phase of system components and peripherals, this design software can allow users to focus attention on system-level requirements instead of the mundane, manual task of integrating individual blocks with varying requirements. For example, the job of creating and verifying the interface between an and a DSP can be complex. The newest system integration tools allow the designer to drop in a FIFO-based IP core and interface to an external processor without having to manage or consider the specific pin-out details. This can be critically important for a DSP software engineer with limited experience in design and hardware implementation. Figure 3 and Figure 4 illustrate example DSP/ co-processing architectures using the Texas Instruments external memory interface () and the industry standard Serial RapidIO (SRIO) interface. These architectures can provide memory and peripheral expansion as well as the capability for increased processing performance. The latest generation of system integration tools can automatically generate a seamless bridge between the DSP and the, making it easier to implement algorithms defined at the block or component level without having to focus

on the detailed device interface mapping. Figure 3: Co-Processing With DSP - DSP Switch Fabric SDRAM Flash Peripheral Local Figure 4: Co-Processing With DSP SRIO Up to 12.5 Gbps image mixing/blending have little or no control flow component. For that reason, the bulk of the video processing chain should be implemented completely on an. compression algorithms, which have a well defined mix of control and processing operations, might be implemented in a DSP processor or split between a DSP and depending on the system requirements. The following examples highlight the challenges and rationale for co-processor architectures. Figure 5: Processing Chains Pre-processing Scaler Alpha Blending Mixer Output Encoder DSP 1x/4x SRIO SRIO IP Core Switch Fabric Peripheral Local OSD Post-processing SDRAM Decoder Scaler Alpha Blending Mixer Output Flash OSD Co-Processing for High Performance and Image Processing The main justification for the coprocessor design flow approach is the benefit of enhanced system price/performance. Properly architected designs can offload a DSP processor and execute computationally intensive blocks of a DSP algorithm in a more efficient parallel implementation on an. This is especially attractive for emerging video and image processing applications where DSP performance requirements are growing at the fastest rates. Consider the typical video compression (encoding/decoding) processing chains. By taking a closer look at the pre-processing and post-processing halves, it is possible to identify the types of algorithms that might be partitioned between DSP processors and s to implement a video data path. Multiply-accumulate (MAC) intensive algorithms such as color space conversion (), noise reduction filtering, scaling and A simple video noise reduction filtering example seen in Figure 6 demonstrates the potential of the co-processor approach. For video pre-processing in a high definition encoding system, a 7x7 twodimensional filter kernel is applied to broadcast HDTV 1080p video at 1920x1080 resolution, 30 frames per second, 24 bits per pixel. This operation will require over 9 gig multiply-accumulates per second (GMACs), more performance than the fastest commercially available DSP can offer. The same function can be implemented on a lowcost with headroom to spare. Figure 6: High Definition Encoding System Digital Broadcast Encoding System Pre-processing in H.264 HD Encoder Network

For video compression systems, coprocessing architectures can create especially cost effective solutions compared to platforms based on multiple DSPs. Highdefinition broadcast quality encoding utilizing video codecs MPEG2, MPEG4 and H.264 can be implemented with a single and DSP. Figure 7: H.264 Encoding Co-Processing Partition programmable parts of the system. The motion estimation block, in particular, leaves room to incorporate a range of different techniques for motion vector search. From the equipment vendor s point of view, this flexibility allows for customization and differentiation that is not possible when the only choice is a fixed ASSP. Conclusion Performance requirements for video and image processing end equipment is growing as a direct correlation to the new compression standards and higher resolution formats that are being adopted. coprocessor system architectures, complemented by leading-edge design software, allow designers to implement these high performance DSP algorithms in a cost-effective, efficient manner and realize significant benefits. Figure 7 shows an example coprocessor partition of the H.264 encoding standard. The has absorbed the sections of the algorithm that require the most cycles on the DSP, including the motion estimation block, entropy coding and the deblocking filter. The DSP can execute the remaining parts on the algorithm that are more control flow oriented and better mapped to a C-code implementation. Newer entropy coding techniques such as CAVLC and CABAC do not map well to a typical DSP instruction set and are best realized as hardware accelerated blocks on the. In the case of the latest video compression standards, the co-processor architecture provides a number of advantages. When a standard is relatively new or in flux, many system developers prefer that some degree of flexibility be allocated into the design. When the video compression community converges on the optimal algorithmic approach to the parts of the standard that have some room for enhancement, the hardware architecture can be preserved with only modifications to the

101 Innovation Drive San Jose, CA 95134 (408) 544-7000 http://www.altera.com Copyright 2005 Altera Corporation. All rights reserved. Altera, The Programmable Solutions Company, the stylized Altera logo, specific device designations, and all other words and logos that are identified as trademarks and/or service marks are, unless noted otherwise, the trademarks and service marks of Altera Corporation in the U.S. and other countries. All other product or service names are the property of their respective holders. Altera products are protected under numerous U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera Corporation. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services.