GENERAL-PURPOSE MICROPROCESSOR PERFORMANCE FOR DSP APPLICATIONS. University of Utah. Salt Lake City, UT USA

Size: px

Start display at page:

Download "GENERAL-PURPOSE MICROPROCESSOR PERFORMANCE FOR DSP APPLICATIONS. University of Utah. Salt Lake City, UT USA"

Sylvia Holt
5 years ago
Views:

1 GENERAL-PURPOSE MICROPROCESSOR PERFORMANCE FOR DSP APPLICATIONS J.N. Barkdull and S.C. Douglas Department of Electrical Engineering University of Utah Salt Lake City, UT USA ABACT Digital signal processors (DSPs) have been used to realize real-time signal processing systems using hardware architectures and software instruction sets that are optimized for such applications. However, general-purpose microprocessors have risen in capability to the point that they can serve as alternative platforms for digital signal processing applications, particularly for audiorate systems. This paper compares the capabilities of two general-purpose microprocessors{ the Apple/IBM/Motorola PowerPC 604 and Intel Pentium P5{with the popular Texas Instruments' TMS320C40 DSP on a suite of three common signal processing subsystems: i) a nite-impulse-response (FIR) lter, ii) the least-mean-square (LMS) adaptive lter, and iii) the fast Fourier transform (FFT). Careful attention is paid to the architectures of the processors to obtain the most computationallyecient realizations. The results indicate that general-purpose microprocessors are viable computational engines for audio-rate processing. 1. INTRODUCTION Digital signal processing is a core technology for many of today's high technology products in elds such as wireless communications, networking, and multimedia. One reason for the prevalence of digital signal processing technology has been the development of low cost, powerful digital signal processors (DSPs) that provide engineers the reliable computing capability to implement these products cheaply and eciently. Since the development of the rst DSPs in the early 1980's, DSP architecture and design have evolved to the point where even sophisticated real-time processing of video-rate sequences can be performed. By contrast, general-purpose microprocessors serve as the computing engines for the personal computers and workstations that are in widespread use in business, education, and in the home. Through the continual miniaturization of circuits brought about by improved semiconductor manufacturing and through clever architecture and bus design, engineers have improved the capabilities of microprocessors such that they are candidates for a wide range of applications, including products employing DSP technology. In this paper, we compare the capabilities of a typical digital signal processor{the Texas Instruments' TMS320C40{with two general-purpose microprocessors{ 0 This research was supported by NSF Grant No. MIP the Apple/IBM/Motorola (AIM) PowerPC 604 and Intel Pentium P5. Our particular application of interest is active noise control [1], although our results naturally extend to other DSP applications. Several key issues motivate our study: Availability/Cost: General-purpose microprocessors are widely used in personal computers and are readily available at a progressively lower cost in the marketplace. Performance: Manufacturers have begun to obtain extremely-fast clock rates for their microprocessors. Moreover, they have begun to employ advanced computing architectures such that their performance has increased substantially in recent years. It should be noted that DSP designers have also advanced the capabilities of DSPs in recent years while maintaining the hardware eciencies essential to fast processing of digital signals. Coding/Maintainability of Code: While DSP subsystems usually allow a regular coding strategy, the combining of these subsystems with control logic into a larger system is better suited to the software tools available for microprocessor-based systems. Novel Features: The use of a general-purpose microprocessor within a personal computer brings capabilities not previously available to a DSP-based system, such as the continuous display of coecients and parameters on a monitor while the system is operating. While our study mainly serves as a snapshot of the capabilities of the current technologies, we explore the issues that will inuence the hardware design of DSP systems in the future. The evaluation of computing architectures for DSP applications has a long history. An excellent review of the developments of DSPs is given in [2, 3], which discusses the architecture and hardware tradeos between various existing designs. In [4], the performances of several processors on DSP tasks are compared, and from this study, a hybrid RISC-DSP processor is proposed that combines the best features of the processors under study. It should be noted that some of the features of this hybrid processor have been incorporated into the latest generations of RISC and CISC processors, most notably the inclusion of several forms of the multiply-and-accumulate operation. In other cases, modern microprocessors incorporate alternative design choices than those used in the hybrid design, such as the superscalar execution of two instructions as opposed to a single parallel instruction performing a memory access and an arithmetic operation simultaneously. Methods

2 for benchmarking processors for DSP applications and a comparison of numerous processors on a large set of benchmark applications are described in [5]. From these tests, the choice of processor can be made according to the importance of a particular benchmarked algorithm in any given application. In our study, we focus on two general-purpose microprocessors not included in [5]. While our benchmark tests are not as extensive as those in [5], our results are directly applicable to active noise control and other audio-rate processing tasks. The organization of the paper is as follows. After reviewing the major capabilities of DSPs and general-purpose microprocessors in the next section, we compare the performances of the three chosen processors on three tasks: i) the nite-impulse-response (FIR) lter, ii) the least-meansquare (LMS) adaptive lter, and iii) the fast Fourier transform (FFT). Careful attention has been paid to the processor architecture in each case to obtain the fastest oatingpoint implementation of each algorithm. Comparing the various implementations, we conclude that both the PowerPC 604 and Pentium P5 microprocessors are viable candidates for DSP applications. In particular, the oatingpoint performance of the PowerPC architecture makes it particularly desirable for FFT-based applications such as high-speed convolution and block adaptive ltering. 2. SYSTEM COMPARISON 2.1. General Features of the Processors Perhaps the most distinguishing feature of all DSP architectures is the fast multiplier, which usually allows the computation of a xed- or oating-point multiply in a single instruction cycle. More recently, the memory architectures of DSPs have been expanded to include multiple busses, and additional parallel operations have been included through the use of parallel and pipelined functional units, including parallel multiply-and-add and address generation units. The increased parallelism of general-purpose microprocessor designs has led to the development of superscalar processors that execute as many as four instructions in a single cycle in current designs. In addition, multiple execution units, large register les, and branch prediction units have been included within the architectures to increase their instruction execution rates. Recently, these designs have included single-cycle multiply and multiply-and-accumulate hardware units that are ideal for DSP implementations. Many of these enhancements now appear in modern RISC and CISC microprocessors. Typically, modern DSPs and general-purpose microprocessors incorporate multiple execution and functional units to achieve their high performance; however, their parallel units are often utilized dierently. All functional units on a typical DSP require a single instruction per clock cycle to execute their task. These units include the ALUs such as the logic, adder, and multiplier units as well as the address generation units (AGUs). In contrast, a superscalar microprocessor species the operation of each execution unit explicitly with one instruction per execution unit; i.e. oating point, integer, load-store, and branch instructions are all separate from one another The Specic Processors Under Study For our analysis, we have chosen the Texas Instruments TMS320C40 processor to represent a typical DSP. We compare this processor to the RISC-based AIM PowerPC 604 and CISC-based Intel Pentium P5 microprocessors. Table 1 illustrates some of the architectural dierences between these processors. Note that the current clock rates of the two microprocessors are about twice as fast as that of the TMS320C40. In addition, both microprocessors execute one instruction per clock cycle, whereas the TMS320C40 executes one instruction every two clock cycles. Thus, the instruction rate of the TMS320C40 is about one-fourth that of the microprocessors. To avoid problems with diering semiconductor technology, we shall compare the number of instructions required by each processor to compute the various benchmarks, in addition to their processing times. The TMS320C40 is a oating-point processor that is pipelined and highly-paralleled; thus allowing fast and ef- cient computations. Its use of several internal busses allows two memory and two register accesses per clock cycle, thus maintaining the throughput of its parallel multiplier and adder. The two address-generating units allow parallel data memory accesses and auxiliary register updates. The TMS320C40 also utilizes delayed branches, a loop counter, and circular buers to remove looping overhead and the need to shift data samples in memory. For further information on this architecture, the reader is referred to [6]. The PowerPC 604 (PPC604) is a superscalar RISC microprocessor that contains six functional units that operate independently and in parallel. The PPC604 also includes 32 general purpose registers, 32 double-precision oatingpoint registers, and several special-purpose registers. This processor is capable of issuing, out-of-order, up to four instructions per cycle. Its branch prediction unit enables zero overhead branching, thus reducing extensive loop overhead and the need for full loop unrolling, which reduces the number of iterations in each loop by duplicating the operations to be performed in each loop iteration. Loop unrolling usually increases performance by reducing overhead caused by branching and by enabling more extensive reordering of instructions to better utilize the execution pipelines. The LSU can load or store one operand per cycle, and the pipelined oating-point unit (FPU) allows a single-cycle multiply-and-accumulate operation. For further information, see [7]. Like the PPC604, the Pentium processor is superscalar and contains two pipelined processing units. Most instructions can be executed with single cycle throughput and are hardwire encoded, similar to a RISC processor. However, the Pentium also has more complex instructions requiring microcode ROM to execute; thus, it is a CISC microprocessor. Simple instructions such as adds and subtracts can be executed independently within the two pipelines. Floatingpoint operations can only be paired with a oating-point exchange (FXCH) instruction, however, a constraint that limits the oating-point performance of this processor. The Pentium contains eight general registers, several control and status registers, and eight oating-point registers that are each 80 bits long for an overall accuracy that is better than that of double precision. The oating-point registers are used to implement a stack architecture, such that most operands and results come from the top of the stack and are placed on the top of the stack, respectively. This architecture presents a top-of-stack bottleneck that can only be alleviated via a FXCH instruction. Since the Pentium does not use a load/store architecture, one of the oating-point operands can come from memory without a performance penalty. This processor also uses branch prediction techniques for increased performance, with a minimum of one cycle per branch. For further information, see [8].

3 Processor TMS320C40 PowerPC 604 Pentium P5 On-Chip Memory 8 KB D and I/512B I 16KB I/D 8KB I/D Floating-Point Unit Own IEEE IEEE Multiply-Accumulate parallel pipelined no Data Paths from Memory Parallel Units Multiplier 3 Integer 2 Parallel Pipelines Acc./ALU FPU Restricted: 2 Addr. Gen. LSU two simple integer BPU or simple fp and fxch 4 instr/cycle sustained Branch Delay Slots Dynamic Prediction Dynamic Prediction Block Repeat Clock Speed 33MHz (66MHz Int.) 120MHz 120MHz Miscellaneous Circle Buer Register Index - Extensive Address - Update Addressing Mode Out-of-Order Exec. Table 1: A summary of the architectural dierences between the processors. 3. THE ALGORITHM SUITE To quantitatively measure each processor's performance, we have chosen three benchmark tasks: a nite-impulse-response (FIR) lter, the least-mean-square (LMS) adaptive lter, and the fast Fourier transform (FFT). An FIR lter is implemented using a series of multiplyand-accumulate (MAC) operations within a tight data loop. The MAC operation is a basic operation used in many DSP tasks, and it must be eciently implemented within the processor for best performance. The LMS adaptive lter is widely used in many real-time applications such as active noise control, echo cancellation, system identication, and adaptive control. In this lter, the coecients of an FIR lter are adjusted to reduce the magnitude of the error signal formed as the dierence between the output of the lter and a desired signal. The decimation-in-frequency FFT is an ecient implementation of the discrete Fourier transform (DFT) of a nite-length sequence. The FFT is useful for performing fast convolution of FIR lters and block LMS adaptive lters. Since the FFT is implemented using the so-called buttery structure, the optimization of the buttery computations is critical to obtaining an ecient FFT implementation. The implementation of each algorithm on each processor begins with a simple coding of the algorithm, i.e. a straightforward sequential coding of the algorithm's operations assuming no parallelism within the calculations. Data ow diagrams are then constructed from the code and are then altered to take advantage of the parallel nature of each processor. The resulting optimized diagram is then used to derive the optimized code. Other optimizations such as loop unrolling and instruction reordering are used in this step to obtain the most ecient coding possible in each case. We illustrate this method in realizing the LMS adaptive lter on the TMS320C40 DSP. The body of the loop is shown in Figure 1. In this code listing, we have assigned shortened names to each of the instructions for subsequent use. Figure 2 depicts the optimized data ow obtainable after careful study of the processor architecture. In this diagram, time runs from top to bottom, and the execution is divided into two columns representing the two operations MPYF3 *AR1++(1)%,R4,R1 ;change in coeff ADDF3 *AR0,R1,R2 ;w(n+1) = w(n)+change STF R2,*AR0--(1) ;save w(n+1) MPYF3 *AR1,R2,R0 ;x*w ADDF3 R0,R3,R3 ;Acc = Acc + x*w 5 Cycles/Tap Short Hand Figure 1: A straightforward implementation of the LMS adaptive lter on the TMS320C40 DSP, with shortened names given on the right. performed in parallel per instruction cycle. Figure 3 shows the complete optimized coding of the algorithm. Here, the optimized inner loop is obtained directly from the data ow diagram. In addition, we have used a delayed block repeat to remove loop overhead. Note that this implementation of the LMS adaptive lter provides a true LMS coecient update, whereas the code provided in [6] implements the delayed LMS algorithm. The implementation of the algorithms on the PowerPC followed a similar procedure. The optimized data ow diagram for the resulting implementation exploits the parallel execution units of the PPC604. In these optimizations, we have performed minimal loop unrolling and extensive instruction reordering to take complete advantage of this processors' oating-point pipeline. The straightforward implementations on the Pentium have also employed loop unrolling and instruction reordering to fully utilize this processor's pipelines. To avoid the top-of-stack bottleneck inherent in the Pentium's oatingpoint unit, FXCH instructions have been paired with other oating-point instructions. 4. PERFORMANCE COMPARISON 4.1. FIR lter The TMS320C40's parallel multiply-and-add operation enables this processor to implement an FIR lter eciently. In addition, circular buers are supported directly in the hardware. In the program, loop initialization and termination are explicitly implemented. The loop consists of a

4 Figure 2: The reorganized data ow diagram for the LMS adaptive lter implemented on the TMS320C40, obtained from manipulating the ow diagram generated from the nieve code. single parallel multiply-add, such that the output can be accumulated; a pre-loop multiply and post-loop add are also required. This implementation requires only O(N) clock cycles for an N-coecient lter. The PPC604 also performs a multiply-and-add operation in a single clock cycle due to the pipelined structure of its FPU. However, the LSU can only provide a single value per cycle to the FPU's register le. In addition, the circular buers required for the FIR lter must be implemented in software. Due to the latency of the oating-point pipeline, the FIR loop was unrolled to calculate two MACs per iteration. This implementation requires O(2N) cycles for the lter. As in the PPC604, the FIR loop for the Pentium must also be unrolled to perform two MACs per clock cycle due to the processor's pipeline. However, the Pentium lacks both the memory bandwidth and an ecient parallel or pipelined MAC operation. Since oating-point operations cannot be performed in parallel with the integer instructions in the two pipelines, the index and circle index operations must be executed separately. Therefore, the FIR lter requires O(4N) cycles on this processor The LMS Adaptive Filter The implementation of the LMS adaptive lter on the TMS320C40 is similar to that of the FIR lter. An additional MAC operation is needed to update the lter coecients, which balances the memory access and computation instruction counts. Five operations are required within the loop to implement this system; however, two parallel pairs of operations can be used to reduce the loop length to O(3N) cycles for the N-coecient adaptive lter. On the PPC604, the speed with which the LMS algorithm can be implemented is limited by the bandwidth of the LSU. As in the FIR lter, we have unrolled the main loop to calcu- LMS RPTBD LOOP MPYF3 *AR1,R4,R1 ; Setup the delayed repeat block ;change in coeff STF R5,*AR1++(1)% ;replace oldest with newest data ADDF3 *AR0,R1,R2 NOP BLOCK MPYF3 *AR1++(1)%,R4,R1 STF R2,*AR0--(1) ;save w(n+1) MPYF3 *AR1,R2,R0 ;w(n+1) = w(n)+change ;x*w ;change in coeff ADDF3 *AR0,R1,R2 ;w(n+1) = w(n)+change LOOP ADDF3 R0,R3,R3 BUD R11 ;delayed return MPYF3 *AR1,R2,R0 ;Acc = Acc + x*w ;x*w STF R2,*AR0--(1) ;save w(n+1) ADDF3 RO,R3,R3 NOP ;Acc = Acc + x*w 3 Cycles/Tap Figure 3: The optimized implementation of the LMS adaptive lter on the TMS320C40. late two taps per iteration so that the processor's pipeline is eciently used. Because of its highly-parallel architecture, this processor also requires O(3N) operations to implement an LMS adaptive lter. As for the Pentium, its lack of a parallel multiply-add in its oating-point pipeline and the necessary load and store operations to use the FPU hamper the implementation of the LMS adaptive lter on this processor. Use of the FXCH instruction prevents the top of the oating-point stack from becoming a bottleneck in the implementation. Similar to the PPC604 implementation, the main loop must be unrolled three times, where each loop contains 22 instructions. Thus, the Pentium requires 0(7:333N) cycles to compute one iteration of the LMS adaptive lter The FFT Algorithm For these implementations, we have employed in-place computations to calculate the decimate-in-frequency FFT. In this case, we have optimized the FFT implementations on the TMS320C40 and Pentium processors to make use of unity-valued twiddle factors W k = e?j2k=n wherever possible. This optimization is not used in the PPC604 implementation as the buttery operation is limited by the processor's memory bandwidth. For an N-point FFT where N is a power-of-two, the number of instruction cycles required for each implementation are NF F T;T MS = 4:5N log 2 (N) + 6N + 10 log 2 (N)? 8 NF F T;P P C = 4N log 2 (N) + 7N + 4 log 2 (N) + 1 NF F T;P 5 = 11:5N log 2 (N)? N + 2 log 2 (N) + 2 Figure 4 plots the actual time required by each processor to complete the FFT, where S0 = log 2 (N). For large values of N, the dominating term in each case is of the form KN log 2 (N), where N is the size of the FFT and K is a scaling constant. 5. RESULTS AND CONCLUSION Table 2 shows the cycle times for the various algorithms on the dierent processors, where N is the length of the FIR

5 Processor FIR LMS FFT TMS320C40 O(N) O(3N) O(4:5N log 2 PowerPC 604 O(2N) O(3N) O(4N log 2 Pentium P5 O(4N) O(7:333N) O(11:5N log 2 Table 2: Complexity of the three algorithms on the TMS320C40, PowerPC 604, and Pentium P5 processors. The table shows the number of instruction cycles for each algorithm. For the algorithms, only the order of the number of cycles is given, where N refers to the number of lter taps and length of the FFT, respectively. Time (micro seconds) TMS stages (log2(fft_size)) Figure 4: Execution time required to calculate the FFT on the three processors. lter, the length of the LMS adaptive lter, and the size of the FFT, respectively. As can be seen, the Pentium processor does not perform as well as the other two processors in any of the algorithms because it does not have nearly the level of parallelism as do the other two processors. Note that the PowerPC 604 processor is as ecient in instruction cycles as the TMS320C40 processor in implementing the LMS adaptive lter, and it is more ecient in implementing the FFT. Moreover, if we consider the instruction cycle time of the three processors, the Pentium and PowerPC processors outperform the TMS320C40 DSP in terms of actual computing time for the three tasks. The results indicate that general-purpose microprocessors are viable candidates for DSP applications, and for audio-rate systems in particular. Our results suggest that standard processors in personal computers can be employed to perform digital signal processing tasks with eciencies that are comparable to standard DSP chips. Such implementations could benet from the exible programming environment of the personal computer while still providing a platform for real-time signal processing. Our results also indicate that the delivery of an advanced signal processing solution would only require a software program to be installed on the computer, so long as the hardware met minimal conguration requirements. Although the focus of this paper has been on the timing performance of representative processors, an applications engineer must also consider other issues such as chip or board cost, the richness of the software development tools, and the availability of ready-to-use software routines when choosing a computing platform. Numerous development tools for general-purpose microprocessors are available from a number of vendors, but there are relatively few sources of optimized DSP software libraries for even the most popular microprocessors. It is likely that the nature of the end application will strongly dictate the choice of computing platform. In the future, new generations of processors and architectures for both DSP and general-purpose computing could very well change the outcome of our evaluation in the near future. For example, faster Pentium and PowerPC chips have been released since this work began. Also, new multimedia chip designs under development by a number of companies are likely to include very-long-instruction-word (VLIW) instructions, hardware threads to remove pipeline latency, and numerous parallel execution units. From these developments, it is clear that the applications engineer will have numerous choices for a processing platform upon which to develop DSP products in the future. REFERENCES [1] S.M. Kuo and D.M. Morgan, Active Noise Control Systems: Algorithms and DSP Implementations (New York: Wiley- Interscience, 1996). [2] E.A. Lee \Programmable DSP architectures: Part I," IEEE Signal Processing Magazine, vol. 5, no. 4, pp. 4-19, October [3] E.A. Lee \Programmable DSP architectures: Part II," IEEE Signal Processing Magazine, vol. 6, no. 1, pp. 4-14, January [4] M.R. Smith, \How RISCy is DSP?" IEEE Micro, vol. 12, no. 6, pp , December [5] P. Lapsley and G. Blalock, \How to estimate DSP processor performance," IEEE Spectrum, vol. 33, no. 7, pp , July [6] TMS320C4X User's Guide (Dallas, TX: Texas Instruments, Inc., 1991). [7] PowerPC 604 RISC Microprocessor User's Manual (Phoenix, AZ: Motorola, Inc., 1994). [8] Pentium Processor Family Developer's Manual, vols. 1 and 2, (Mt. Prospect, IL: Intel Corporation, 1995).

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose