Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors Paul Ekas, DSP Engineering, Altera Corp. pekas@altera.com, Tel: (408) 544-8388, Fax: (408) 544-6424 Altera Corp., 101 Innovation Dr., San Jose, Calif. 95134 Overview Across a wide spectrum of applications, the growth in signal processing algorithm complexity is exceeding the processing capabilities of stand-alone digital signal processors. In some of these applications, software developers have used hardware co-processors to off-load a variety of algorithms including Viterbi decoding, Turbo encoding/decoding, butterfly processing, discrete cosine transforms (DCT), and 1D and 2D filters. In a few cases, DSP processors include on-chip hardware coprocessors where the end application supports the expense of designing a market specific solution. In 3 rd generation wireless systems, the addition of the Turbo forward error correction algorithm had a huge impact on the amount of processing required per user data channel in a channel element card. Texas Instruments successfully utilized coprocessors for Turbo and Viterbi processing to extend their leadership position in 3 rd generation wireless infrastructure equipment. Unfortunately, the high cost of implemention makes the availability of DSPs with end-marketspecific coprocessors unattainable. In these applications where no co-processors are available, Altera has developed design tools and methodologies that enable companies to develop their own coprocessors using Altera s Stratix and Cyclone devices that easily interface with a wide range of DSP and general purpose processors (GPP) providing increased system performance and lower system costs. This paper will discuss the technical development and integration of FPGA co-processors including: Profiling applications to identify high-load software algorithms suitable for offloading to co-processors Development of custom co-processor blocks Viable coprocessor system architectures Processor interface selection Hardware and software system integration FPGA co-processor development systems Cost and performance improvement attainable with FPGA co-processors In this article, a design example that implements an FPGA co-processor for a TI DSP to increase the performance and lower the cost of an example modem system will be used to highlight the methodology and application of FPGA co-processors. This article assumes the target system is initially implemented in software with no foresight into an optimal hardware/software partitioning. CF-031605-1.0

Identifying Software that can be Off-loaded to a Coprocessor Often times in DSP processing applications, 80% of the MIPS required are consumed by 20% of the program code. This 20% of the program code often requires time consuming, error prone, and difficult to maintain assembly coding to increase overall system performance. This code also becomes far less portable than the remaining 80% of the code that is focused on initialization and system execution control. At the same time, the other 80% of the code reflects the majority of the system complexity. This creates a double challenge for DSP software engineers, reducing the processing load in 20% of the software and managing the complexity of the remaining 80% of the code. FPGA co-processing is well suited to addressing that 80% processing load caused by 20% of the algorithm code. The challenge is to identify what should be offloaded from the DSP to a coprocessor. The key to identifying what should be offloaded from a DSP to a co-processor are the profiling tools used by the software developer. Profiling tools parse the program code and identify percentage of processing consumed by each function and sub-routine. Every software development system includes tools to profile the program code and identify which functions consume the majority of the processing MIPS. With code profiling, the functions that consume the majority of the MIPS can be identified and the option to be accelerated by a HARDWARE co-processor can be made. Not all functions are appropriate to off-load to a co-processor. First of all, the goal is to identify a group of algorithms which together occupies more than half of your processing load. Second, the identified group of algorithms should be clustered together so that once data has been sent to the co-processor there is no processor dependency in the calculation until the processing is complete and the result can be returned to the DSP. A third criteria is that the processing is straightforward to implement in hardware. The simplest definition to describe this criterion is that the algorithm is heavily looped thus implying a very repetitive computational structure. The example system described in this article relies on a TI processor, although the principles applied are applicable to all DSP processors. The TI development tools are encapsulated in a product called Code Composer Studio (CCS). CCS includes a debugger, compiler, linker, assembler, code profiler and other assorted capabilities to enable the software developer to fully describe and develop their TI DSP program code in one environment. TI development systems can be purchased that include a TI development board, CCS, and application code examples. The example system discussed in this article utilizes one of the application examples, modem.c, that come with the TI development kits, specifically the TI6x series of development systems. Modem.c implements a QAM modem implemented entirely in software. When modem.c is compiled and executed on the Tic6711 development system, it takes 177,000 instruction cycles to execute. Next, CCS was used to profile the Modem.c example to identify what could be off-loaded to an FPGA Co-processor. The analysis identified that the majority of the processing was required by the modem transmitter algorithm (modem tx). The modem tx consumed 96.5% of the processing

MIPS. The modem tx is also very suitable for off-loading to a single FPGA co-processor that implements the modem tx dataflow. The contents of the modem tx include a shaping filter (82% MIPS), modulation (8% MIPS), sine lookup (2.5% MIPS) and the cosine lookup (3.5% MIPS). Figure 1: TI Modem.c Structure and Code Profile Results FPGA Coprocessor Block Development Co-processors as defined by Altera include at least a data interface and a control interface. The control interface(s) is used by the CPU to setup and monitor the operation of the co-processors. The data interface(s) can communicate to memories, peripherals, or other co-processors both as sources and sinks of data. To maximize system performance, the data interfaces are defined to include integrated direct memory access (DMA) controllers for each data interface. These DMA controllers are programmed by the CPU through the control interface of the co-processor. In general, the operation of a co-processor is setup by the CPU and is then autonomously executed by the co-processor itself. Many powerful capabilities are inherent in this architecture that yield extremely high performance systems. The first of these is that the co-processors can be setup to automatically source and sink data without dynamic interaction with the controlling CPU. This capability is

enabled by flexibility in how the DMAs can be programmed along with architectural selections made as part of the FPGA co-processing system definition. The DMAs can be controlled by a linked list of source or destination addresses that automatically enable the co-processors to continuously execute without CPU interaction. These source and destination locations can be memories that the CPU or some other co-processors source or sink data to. The source and destination locations could also be peripherals such as UARTs, A/Ds, or D/As. The overall architecture flexibility of FPGA co-processors enable a system definition the can be relatively tightly coupled to the master CPU or a loosely coupled data processing plane that has only minimal setup and status interaction with the master CPU. This wide variation in capabilities makes FPGA co-processors suitable for dealing with systems with a wide range of performance and flexibility requirements. There are several mechanisms available to build co-processors. The most powerful tool for building them is Altera s DSP Builder. DSP Builder is an add-on tool to the Mathworks MATLAB and Simulink toolset. DSP Builder provides an integrated design environment for dataflow system design, verification, and implementation for Altera FPGAs enabling designers to assemble parameterized building blocks into complex data flow processing systems. The building blocks of DSP Builder include modular RTL building blocks and optional parameterized complex IP building blocks. One of the features of DSP Builder is the ability package these dataflow systems into co-processing blocks. This enables the development of simple or complex co-processors implementing standard-specific or proprietary algorithm processing. The parameterized complex IP building blocks in DSP Builder are Altera s MegaCore components that include finite-impulse response (FIR) and infinite-impulse response (IIR) filters, fast-fourier transforms (FFTs), Forward Error Correction (FEC) cores, numerically-controlled oscillators (NCOs), and other components. These parameterized IP blocks can be configured first algorithmically then architecturally. The algorithm setup sets the type of filter, the coefficients, the number of coefficient and data bits, and many other algorithmic oriented parameters. The architectural configuration controls the implementation architecture to meet throughput and resource mapping constraints. In many cases, a MegaCore may reflect the entire functionality required to be implemented as a co-processor. In these cases, the MegaCores are capable of directly implementing a co-processor without requiring DSP Builder interaction. The co-processing block identified in the modem.c example requires an integration of a FIR filter, a modulator, and two look-up tables. In this case, DSP Builder has been used to assemble the design from the base library of DSP Builder and the FIR MegaCore.

Figure 2: Modem Co-Processor Captured in DSP Builder Processor Interface Selection When an FPGA co-processor is connected to a separate DSP or GPP, there must be an interface between the DSP and the FPGA co-processing sub-system. This interface is dependent on the interface specifications of the target processor. Most processors support a variety of standard and proprietary interfaces. The standard interfaces today and in the future include PCI (and its permutations), RapidIO, Hypertransport, and others. There are also many proprietary interfaces including EMIF (TI), MPX (Motorola), Link-Port (ADI) and others. For any processor that links to an FPGA co-processing system, an FPGA interface IP block must be available or developed to support that bus interface. The interface selection between the processor and the FPGA will be driven by the application characteristics as well as the available interfaces on the processor. For example, the TI c6x DSPs support several different interfaces. The alternative interfaces include the 16/32/64 bit extended memory interface (EMIF), the 16/32 bit host-port interface (HPI), 32 bit/33mhz PCI interface, and the multi-channel buffered serial ports (McBSPs). The configuration of these interfaces is different across the available devices and in some cases the specific features of the interface are device specific.

For the example system, we chose to use the EMIF interface because it is common to all the c6x devices (with some minor variations in features and number of bits) and provides high performance ( >=100MHz). EMIF has a variety of permutations including support for 16, 32, or 64 bit transfers and asynchronous and synchronous signaling. For this example, we chose asynchronous signaling on the 32 bit interface. FPGA Coprocessor Architecture When the DSP or GPP processor communicates with the co-processor, the efficiency of data movement often becomes the dominant factor in the overall system performance. Today, high performance DSP processors rely on DMA controllers to minimize CPU overhead when communicating outside of the CPU core and its memory cache. Typically, the CPU core will access cache memory as the primary memory in the core DSP algorithms. The DMA engine is used to move data into and out of the cache memory. When interfacing to a co-processor, whether it is on-chip or on an adjacent FPGA, the coprocessor must be interfaced to the cache memory via the DMA controller, thus off-loading the CPU core to continue processing other tasks. On the FPGA side, it is also advantageous to include a memory buffer to act as a local cache to the co-processors. In this way, the DMA control on the CPU side is simply moving data from memory to memory and letter the CPU and the co-processors maintain a stronger independence. The modem example utilizes the FPGA co-processor defined in DSP Builder.

Figure 3: TI EMIF Interface to Modem FPGA Co-processor Hardware/Software System Integration Co-processors, by their very nature, change the software implementation from an algorithmic description to a data passing and function control description. The new function call initializes the co-processor and controls the flow of data to and from the co-processor. This interaction requires that hardware specific information be made available to the software engineer that includes addressing information for controlling the co-processor as well as source and destination address information. It also requires a description of the control structure of the coprocessor. These capabilities can be pre-configured as software drivers that the software developer calls to control the FPGA co-processing dataflow. SOPC Builder is a tool from Altera that can be used to integrate FPGA co-processing blocks into sub-systems that directly interface to standard processors. SOPC Builder can support a variety of IP types including co-processors. Associated with each IP block is a predefined set of software routines used to configure and control that IP block. Within SOPC Builder, users identify which blocks to assemble and how they are parameterized and interconnected. SOPC Builder then automatically generates the hardware architecture as well as generating a software driver file called Excalibur.h. Excalibur.h includes all the software interfaces for the blocks in

the system and automatically dereferences them to the register and memory map defined by the users architectural selections. Figure 4: SOPC Builder Hardware and Software Integration Flow SOPC Builder can include co-processors with both a parameterized hardware architecture definition and a full set of software routines to configure, communicate, and generate status information. When SOPC Builder is used to assemble a co-processing system, not only is the hardware architecture generated, but the software routines are assembled into the Excalibur.h. SOPC Builder can support external processors by implementing the targeted processor interface logic as an IP core that interfaces into the SOPC Builder Avalon bus 1. Examples of this can include all the interfaces discussed above. The modem example system utilizes SOPC Builder to integrate the DSP Builder transmit dataflow co-processor with the TI EMIF interface. When SOPC Builder executes, it creates the hardware for the Altera FPGA based coprocessor and the Excalibur.h software to control the coprocessor from the attached CPU. The Excalibur.h file includes the address for all registers and memories inside the SOPC Builder system as well as associated software APIs for IP blocks that 1 The Avalon bus is a simple circuit switched communication architecture supported by all Altera and 3 rd party IP that supports SOPC Builer.

include APIs. This correct-by-construction file accelerates system integration by months by eliminating error prone and tedius manual development of the low-level software drivers. In addition, once blocks are integrated into SOPC Builder, they become easily reusable. The development system enabling this kind of integration must have both a processor and an FPGA adjacent to each other with the appropriate connections such that the FPGA can be integrated with the available processor busses. These development systems can be integrated onto a single board or be an integration of two or more development boards each hosting a subset of the complete system components. For this example, Altera utilized our own DSP Development Kit, Stratix Edition which includes a standard TI daughtercard connector allowing a direct connection to most of the TI development systems including the standard kits for the c6x family of processors. Conclusion The modem.c example required 155,000 cycles to compute an iteration of the modem functionality. When the FPGA co-processor was added to the system architecture, the total TI clock cycles dropped to 455 clock cycles. The modem co-processor consumes 6209 LEs, or about half of Altera s low-cost Cyclone EP1C12 device. Offloading the modem to a coprocessor enables an increase in channels, functionality, performance, or a significant cost reduction through the use of a less expensive variant of the TI processor. It is clear that FPGA co-processing provides a powerful approach to increasing system performance and reducing costs without changing the software development environment or the DSP platform except for the addition of a low-cost adjunct FPGA. In applications that are forced to leading edge DSPs for performance reasons, this approach can reduce costs by ten times. This approach also provides a handy way to future proof a system when future performance requirements may increase the processing performance demanded on a board. This can be done by designing an empty FPGA socket onto the production boards that is not utilized until future evolutions of the system demand increase processing performance. Through straightforward software revisions and the inclusion of one or more FPGA co-processors, the overall system performance can be dramatically increased with minimal component cost increases to the system.