GENERAL-PURPOSE MICROPROCESSOR PERFORMANCE FOR DSP APPLICATIONS. University of Utah. Salt Lake City, UT USA

Size: px
Start display at page:

Download "GENERAL-PURPOSE MICROPROCESSOR PERFORMANCE FOR DSP APPLICATIONS. University of Utah. Salt Lake City, UT USA"

Transcription

1 GENERAL-PURPOSE MICROPROCESSOR PERFORMANCE FOR DSP APPLICATIONS J.N. Barkdull and S.C. Douglas Department of Electrical Engineering University of Utah Salt Lake City, UT USA ABACT Digital signal processors (DSPs) have been used to realize real-time signal processing systems using hardware architectures and software instruction sets that are optimized for such applications. However, general-purpose microprocessors have risen in capability to the point that they can serve as alternative platforms for digital signal processing applications, particularly for audiorate systems. This paper compares the capabilities of two general-purpose microprocessors{ the Apple/IBM/Motorola PowerPC 604 and Intel Pentium P5{with the popular Texas Instruments' TMS320C40 DSP on a suite of three common signal processing subsystems: i) a nite-impulse-response (FIR) lter, ii) the least-mean-square (LMS) adaptive lter, and iii) the fast Fourier transform (FFT). Careful attention is paid to the architectures of the processors to obtain the most computationallyecient realizations. The results indicate that general-purpose microprocessors are viable computational engines for audio-rate processing. 1. INTRODUCTION Digital signal processing is a core technology for many of today's high technology products in elds such as wireless communications, networking, and multimedia. One reason for the prevalence of digital signal processing technology has been the development of low cost, powerful digital signal processors (DSPs) that provide engineers the reliable computing capability to implement these products cheaply and eciently. Since the development of the rst DSPs in the early 1980's, DSP architecture and design have evolved to the point where even sophisticated real-time processing of video-rate sequences can be performed. By contrast, general-purpose microprocessors serve as the computing engines for the personal computers and workstations that are in widespread use in business, education, and in the home. Through the continual miniaturization of circuits brought about by improved semiconductor manufacturing and through clever architecture and bus design, engineers have improved the capabilities of microprocessors such that they are candidates for a wide range of applications, including products employing DSP technology. In this paper, we compare the capabilities of a typical digital signal processor{the Texas Instruments' TMS320C40{with two general-purpose microprocessors{ 0 This research was supported by NSF Grant No. MIP the Apple/IBM/Motorola (AIM) PowerPC 604 and Intel Pentium P5. Our particular application of interest is active noise control [1], although our results naturally extend to other DSP applications. Several key issues motivate our study: Availability/Cost: General-purpose microprocessors are widely used in personal computers and are readily available at a progressively lower cost in the marketplace. Performance: Manufacturers have begun to obtain extremely-fast clock rates for their microprocessors. Moreover, they have begun to employ advanced computing architectures such that their performance has increased substantially in recent years. It should be noted that DSP designers have also advanced the capabilities of DSPs in recent years while maintaining the hardware eciencies essential to fast processing of digital signals. Coding/Maintainability of Code: While DSP subsystems usually allow a regular coding strategy, the combining of these subsystems with control logic into a larger system is better suited to the software tools available for microprocessor-based systems. Novel Features: The use of a general-purpose microprocessor within a personal computer brings capabilities not previously available to a DSP-based system, such as the continuous display of coecients and parameters on a monitor while the system is operating. While our study mainly serves as a snapshot of the capabilities of the current technologies, we explore the issues that will inuence the hardware design of DSP systems in the future. The evaluation of computing architectures for DSP applications has a long history. An excellent review of the developments of DSPs is given in [2, 3], which discusses the architecture and hardware tradeos between various existing designs. In [4], the performances of several processors on DSP tasks are compared, and from this study, a hybrid RISC-DSP processor is proposed that combines the best features of the processors under study. It should be noted that some of the features of this hybrid processor have been incorporated into the latest generations of RISC and CISC processors, most notably the inclusion of several forms of the multiply-and-accumulate operation. In other cases, modern microprocessors incorporate alternative design choices than those used in the hybrid design, such as the superscalar execution of two instructions as opposed to a single parallel instruction performing a memory access and an arithmetic operation simultaneously. Methods

2 for benchmarking processors for DSP applications and a comparison of numerous processors on a large set of benchmark applications are described in [5]. From these tests, the choice of processor can be made according to the importance of a particular benchmarked algorithm in any given application. In our study, we focus on two general-purpose microprocessors not included in [5]. While our benchmark tests are not as extensive as those in [5], our results are directly applicable to active noise control and other audio-rate processing tasks. The organization of the paper is as follows. After reviewing the major capabilities of DSPs and general-purpose microprocessors in the next section, we compare the performances of the three chosen processors on three tasks: i) the nite-impulse-response (FIR) lter, ii) the least-meansquare (LMS) adaptive lter, and iii) the fast Fourier transform (FFT). Careful attention has been paid to the processor architecture in each case to obtain the fastest oatingpoint implementation of each algorithm. Comparing the various implementations, we conclude that both the PowerPC 604 and Pentium P5 microprocessors are viable candidates for DSP applications. In particular, the oatingpoint performance of the PowerPC architecture makes it particularly desirable for FFT-based applications such as high-speed convolution and block adaptive ltering. 2. SYSTEM COMPARISON 2.1. General Features of the Processors Perhaps the most distinguishing feature of all DSP architectures is the fast multiplier, which usually allows the computation of a xed- or oating-point multiply in a single instruction cycle. More recently, the memory architectures of DSPs have been expanded to include multiple busses, and additional parallel operations have been included through the use of parallel and pipelined functional units, including parallel multiply-and-add and address generation units. The increased parallelism of general-purpose microprocessor designs has led to the development of superscalar processors that execute as many as four instructions in a single cycle in current designs. In addition, multiple execution units, large register les, and branch prediction units have been included within the architectures to increase their instruction execution rates. Recently, these designs have included single-cycle multiply and multiply-and-accumulate hardware units that are ideal for DSP implementations. Many of these enhancements now appear in modern RISC and CISC microprocessors. Typically, modern DSPs and general-purpose microprocessors incorporate multiple execution and functional units to achieve their high performance; however, their parallel units are often utilized dierently. All functional units on a typical DSP require a single instruction per clock cycle to execute their task. These units include the ALUs such as the logic, adder, and multiplier units as well as the address generation units (AGUs). In contrast, a superscalar microprocessor species the operation of each execution unit explicitly with one instruction per execution unit; i.e. oating point, integer, load-store, and branch instructions are all separate from one another The Specic Processors Under Study For our analysis, we have chosen the Texas Instruments TMS320C40 processor to represent a typical DSP. We compare this processor to the RISC-based AIM PowerPC 604 and CISC-based Intel Pentium P5 microprocessors. Table 1 illustrates some of the architectural dierences between these processors. Note that the current clock rates of the two microprocessors are about twice as fast as that of the TMS320C40. In addition, both microprocessors execute one instruction per clock cycle, whereas the TMS320C40 executes one instruction every two clock cycles. Thus, the instruction rate of the TMS320C40 is about one-fourth that of the microprocessors. To avoid problems with diering semiconductor technology, we shall compare the number of instructions required by each processor to compute the various benchmarks, in addition to their processing times. The TMS320C40 is a oating-point processor that is pipelined and highly-paralleled; thus allowing fast and ef- cient computations. Its use of several internal busses allows two memory and two register accesses per clock cycle, thus maintaining the throughput of its parallel multiplier and adder. The two address-generating units allow parallel data memory accesses and auxiliary register updates. The TMS320C40 also utilizes delayed branches, a loop counter, and circular buers to remove looping overhead and the need to shift data samples in memory. For further information on this architecture, the reader is referred to [6]. The PowerPC 604 (PPC604) is a superscalar RISC microprocessor that contains six functional units that operate independently and in parallel. The PPC604 also includes 32 general purpose registers, 32 double-precision oatingpoint registers, and several special-purpose registers. This processor is capable of issuing, out-of-order, up to four instructions per cycle. Its branch prediction unit enables zero overhead branching, thus reducing extensive loop overhead and the need for full loop unrolling, which reduces the number of iterations in each loop by duplicating the operations to be performed in each loop iteration. Loop unrolling usually increases performance by reducing overhead caused by branching and by enabling more extensive reordering of instructions to better utilize the execution pipelines. The LSU can load or store one operand per cycle, and the pipelined oating-point unit (FPU) allows a single-cycle multiply-and-accumulate operation. For further information, see [7]. Like the PPC604, the Pentium processor is superscalar and contains two pipelined processing units. Most instructions can be executed with single cycle throughput and are hardwire encoded, similar to a RISC processor. However, the Pentium also has more complex instructions requiring microcode ROM to execute; thus, it is a CISC microprocessor. Simple instructions such as adds and subtracts can be executed independently within the two pipelines. Floatingpoint operations can only be paired with a oating-point exchange (FXCH) instruction, however, a constraint that limits the oating-point performance of this processor. The Pentium contains eight general registers, several control and status registers, and eight oating-point registers that are each 80 bits long for an overall accuracy that is better than that of double precision. The oating-point registers are used to implement a stack architecture, such that most operands and results come from the top of the stack and are placed on the top of the stack, respectively. This architecture presents a top-of-stack bottleneck that can only be alleviated via a FXCH instruction. Since the Pentium does not use a load/store architecture, one of the oating-point operands can come from memory without a performance penalty. This processor also uses branch prediction techniques for increased performance, with a minimum of one cycle per branch. For further information, see [8].

3 Processor TMS320C40 PowerPC 604 Pentium P5 On-Chip Memory 8 KB D and I/512B I 16KB I/D 8KB I/D Floating-Point Unit Own IEEE IEEE Multiply-Accumulate parallel pipelined no Data Paths from Memory Parallel Units Multiplier 3 Integer 2 Parallel Pipelines Acc./ALU FPU Restricted: 2 Addr. Gen. LSU two simple integer BPU or simple fp and fxch 4 instr/cycle sustained Branch Delay Slots Dynamic Prediction Dynamic Prediction Block Repeat Clock Speed 33MHz (66MHz Int.) 120MHz 120MHz Miscellaneous Circle Buer Register Index - Extensive Address - Update Addressing Mode Out-of-Order Exec. Table 1: A summary of the architectural dierences between the processors. 3. THE ALGORITHM SUITE To quantitatively measure each processor's performance, we have chosen three benchmark tasks: a nite-impulse-response (FIR) lter, the least-mean-square (LMS) adaptive lter, and the fast Fourier transform (FFT). An FIR lter is implemented using a series of multiplyand-accumulate (MAC) operations within a tight data loop. The MAC operation is a basic operation used in many DSP tasks, and it must be eciently implemented within the processor for best performance. The LMS adaptive lter is widely used in many real-time applications such as active noise control, echo cancellation, system identication, and adaptive control. In this lter, the coecients of an FIR lter are adjusted to reduce the magnitude of the error signal formed as the dierence between the output of the lter and a desired signal. The decimation-in-frequency FFT is an ecient implementation of the discrete Fourier transform (DFT) of a nite-length sequence. The FFT is useful for performing fast convolution of FIR lters and block LMS adaptive lters. Since the FFT is implemented using the so-called buttery structure, the optimization of the buttery computations is critical to obtaining an ecient FFT implementation. The implementation of each algorithm on each processor begins with a simple coding of the algorithm, i.e. a straightforward sequential coding of the algorithm's operations assuming no parallelism within the calculations. Data ow diagrams are then constructed from the code and are then altered to take advantage of the parallel nature of each processor. The resulting optimized diagram is then used to derive the optimized code. Other optimizations such as loop unrolling and instruction reordering are used in this step to obtain the most ecient coding possible in each case. We illustrate this method in realizing the LMS adaptive lter on the TMS320C40 DSP. The body of the loop is shown in Figure 1. In this code listing, we have assigned shortened names to each of the instructions for subsequent use. Figure 2 depicts the optimized data ow obtainable after careful study of the processor architecture. In this diagram, time runs from top to bottom, and the execution is divided into two columns representing the two operations MPYF3 *AR1++(1)%,R4,R1 ;change in coeff ADDF3 *AR0,R1,R2 ;w(n+1) = w(n)+change STF R2,*AR0--(1) ;save w(n+1) MPYF3 *AR1,R2,R0 ;x*w ADDF3 R0,R3,R3 ;Acc = Acc + x*w 5 Cycles/Tap Short Hand Figure 1: A straightforward implementation of the LMS adaptive lter on the TMS320C40 DSP, with shortened names given on the right. performed in parallel per instruction cycle. Figure 3 shows the complete optimized coding of the algorithm. Here, the optimized inner loop is obtained directly from the data ow diagram. In addition, we have used a delayed block repeat to remove loop overhead. Note that this implementation of the LMS adaptive lter provides a true LMS coecient update, whereas the code provided in [6] implements the delayed LMS algorithm. The implementation of the algorithms on the PowerPC followed a similar procedure. The optimized data ow diagram for the resulting implementation exploits the parallel execution units of the PPC604. In these optimizations, we have performed minimal loop unrolling and extensive instruction reordering to take complete advantage of this processors' oating-point pipeline. The straightforward implementations on the Pentium have also employed loop unrolling and instruction reordering to fully utilize this processor's pipelines. To avoid the top-of-stack bottleneck inherent in the Pentium's oatingpoint unit, FXCH instructions have been paired with other oating-point instructions. 4. PERFORMANCE COMPARISON 4.1. FIR lter The TMS320C40's parallel multiply-and-add operation enables this processor to implement an FIR lter eciently. In addition, circular buers are supported directly in the hardware. In the program, loop initialization and termination are explicitly implemented. The loop consists of a

4 Figure 2: The reorganized data ow diagram for the LMS adaptive lter implemented on the TMS320C40, obtained from manipulating the ow diagram generated from the nieve code. single parallel multiply-add, such that the output can be accumulated; a pre-loop multiply and post-loop add are also required. This implementation requires only O(N) clock cycles for an N-coecient lter. The PPC604 also performs a multiply-and-add operation in a single clock cycle due to the pipelined structure of its FPU. However, the LSU can only provide a single value per cycle to the FPU's register le. In addition, the circular buers required for the FIR lter must be implemented in software. Due to the latency of the oating-point pipeline, the FIR loop was unrolled to calculate two MACs per iteration. This implementation requires O(2N) cycles for the lter. As in the PPC604, the FIR loop for the Pentium must also be unrolled to perform two MACs per clock cycle due to the processor's pipeline. However, the Pentium lacks both the memory bandwidth and an ecient parallel or pipelined MAC operation. Since oating-point operations cannot be performed in parallel with the integer instructions in the two pipelines, the index and circle index operations must be executed separately. Therefore, the FIR lter requires O(4N) cycles on this processor The LMS Adaptive Filter The implementation of the LMS adaptive lter on the TMS320C40 is similar to that of the FIR lter. An additional MAC operation is needed to update the lter coecients, which balances the memory access and computation instruction counts. Five operations are required within the loop to implement this system; however, two parallel pairs of operations can be used to reduce the loop length to O(3N) cycles for the N-coecient adaptive lter. On the PPC604, the speed with which the LMS algorithm can be implemented is limited by the bandwidth of the LSU. As in the FIR lter, we have unrolled the main loop to calcu- LMS RPTBD LOOP MPYF3 *AR1,R4,R1 ; Setup the delayed repeat block ;change in coeff STF R5,*AR1++(1)% ;replace oldest with newest data ADDF3 *AR0,R1,R2 NOP BLOCK MPYF3 *AR1++(1)%,R4,R1 STF R2,*AR0--(1) ;save w(n+1) MPYF3 *AR1,R2,R0 ;w(n+1) = w(n)+change ;x*w ;change in coeff ADDF3 *AR0,R1,R2 ;w(n+1) = w(n)+change LOOP ADDF3 R0,R3,R3 BUD R11 ;delayed return MPYF3 *AR1,R2,R0 ;Acc = Acc + x*w ;x*w STF R2,*AR0--(1) ;save w(n+1) ADDF3 RO,R3,R3 NOP ;Acc = Acc + x*w 3 Cycles/Tap Figure 3: The optimized implementation of the LMS adaptive lter on the TMS320C40. late two taps per iteration so that the processor's pipeline is eciently used. Because of its highly-parallel architecture, this processor also requires O(3N) operations to implement an LMS adaptive lter. As for the Pentium, its lack of a parallel multiply-add in its oating-point pipeline and the necessary load and store operations to use the FPU hamper the implementation of the LMS adaptive lter on this processor. Use of the FXCH instruction prevents the top of the oating-point stack from becoming a bottleneck in the implementation. Similar to the PPC604 implementation, the main loop must be unrolled three times, where each loop contains 22 instructions. Thus, the Pentium requires 0(7:333N) cycles to compute one iteration of the LMS adaptive lter The FFT Algorithm For these implementations, we have employed in-place computations to calculate the decimate-in-frequency FFT. In this case, we have optimized the FFT implementations on the TMS320C40 and Pentium processors to make use of unity-valued twiddle factors W k = e?j2k=n wherever possible. This optimization is not used in the PPC604 implementation as the buttery operation is limited by the processor's memory bandwidth. For an N-point FFT where N is a power-of-two, the number of instruction cycles required for each implementation are NF F T;T MS = 4:5N log 2 (N) + 6N + 10 log 2 (N)? 8 NF F T;P P C = 4N log 2 (N) + 7N + 4 log 2 (N) + 1 NF F T;P 5 = 11:5N log 2 (N)? N + 2 log 2 (N) + 2 Figure 4 plots the actual time required by each processor to complete the FFT, where S0 = log 2 (N). For large values of N, the dominating term in each case is of the form KN log 2 (N), where N is the size of the FFT and K is a scaling constant. 5. RESULTS AND CONCLUSION Table 2 shows the cycle times for the various algorithms on the dierent processors, where N is the length of the FIR

5 Processor FIR LMS FFT TMS320C40 O(N) O(3N) O(4:5N log 2 PowerPC 604 O(2N) O(3N) O(4N log 2 Pentium P5 O(4N) O(7:333N) O(11:5N log 2 Table 2: Complexity of the three algorithms on the TMS320C40, PowerPC 604, and Pentium P5 processors. The table shows the number of instruction cycles for each algorithm. For the algorithms, only the order of the number of cycles is given, where N refers to the number of lter taps and length of the FFT, respectively. Time (micro seconds) TMS stages (log2(fft_size)) Figure 4: Execution time required to calculate the FFT on the three processors. lter, the length of the LMS adaptive lter, and the size of the FFT, respectively. As can be seen, the Pentium processor does not perform as well as the other two processors in any of the algorithms because it does not have nearly the level of parallelism as do the other two processors. Note that the PowerPC 604 processor is as ecient in instruction cycles as the TMS320C40 processor in implementing the LMS adaptive lter, and it is more ecient in implementing the FFT. Moreover, if we consider the instruction cycle time of the three processors, the Pentium and PowerPC processors outperform the TMS320C40 DSP in terms of actual computing time for the three tasks. The results indicate that general-purpose microprocessors are viable candidates for DSP applications, and for audio-rate systems in particular. Our results suggest that standard processors in personal computers can be employed to perform digital signal processing tasks with eciencies that are comparable to standard DSP chips. Such implementations could benet from the exible programming environment of the personal computer while still providing a platform for real-time signal processing. Our results also indicate that the delivery of an advanced signal processing solution would only require a software program to be installed on the computer, so long as the hardware met minimal conguration requirements. Although the focus of this paper has been on the timing performance of representative processors, an applications engineer must also consider other issues such as chip or board cost, the richness of the software development tools, and the availability of ready-to-use software routines when choosing a computing platform. Numerous development tools for general-purpose microprocessors are available from a number of vendors, but there are relatively few sources of optimized DSP software libraries for even the most popular microprocessors. It is likely that the nature of the end application will strongly dictate the choice of computing platform. In the future, new generations of processors and architectures for both DSP and general-purpose computing could very well change the outcome of our evaluation in the near future. For example, faster Pentium and PowerPC chips have been released since this work began. Also, new multimedia chip designs under development by a number of companies are likely to include very-long-instruction-word (VLIW) instructions, hardware threads to remove pipeline latency, and numerous parallel execution units. From these developments, it is clear that the applications engineer will have numerous choices for a processing platform upon which to develop DSP products in the future. REFERENCES [1] S.M. Kuo and D.M. Morgan, Active Noise Control Systems: Algorithms and DSP Implementations (New York: Wiley- Interscience, 1996). [2] E.A. Lee \Programmable DSP architectures: Part I," IEEE Signal Processing Magazine, vol. 5, no. 4, pp. 4-19, October [3] E.A. Lee \Programmable DSP architectures: Part II," IEEE Signal Processing Magazine, vol. 6, no. 1, pp. 4-14, January [4] M.R. Smith, \How RISCy is DSP?" IEEE Micro, vol. 12, no. 6, pp , December [5] P. Lapsley and G. Blalock, \How to estimate DSP processor performance," IEEE Spectrum, vol. 33, no. 7, pp , July [6] TMS320C4X User's Guide (Dallas, TX: Texas Instruments, Inc., 1991). [7] PowerPC 604 RISC Microprocessor User's Manual (Phoenix, AZ: Motorola, Inc., 1994). [8] Pentium Processor Family Developer's Manual, vols. 1 and 2, (Mt. Prospect, IL: Intel Corporation, 1995).

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures An introduction to DSP s Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures DSP example: mobile phone DSP example: mobile phone with video camera DSP: applications Why a DSP?

More information

V. Zivojnovic, H. Schraut, M. Willems and R. Schoenen. Aachen University of Technology. of the DSP instruction set. Exploiting

V. Zivojnovic, H. Schraut, M. Willems and R. Schoenen. Aachen University of Technology. of the DSP instruction set. Exploiting DSPs, GPPs, and Multimedia Applications An Evaluation Using DSPstone V. Zivojnovic, H. Schraut, M. Willems and R. Schoenen Integrated Systems for Signal Processing Aachen University of Technology Templergraben

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

CPU1. D $, 16-K Dual Ported South UPA

CPU1. D $, 16-K Dual Ported South UPA MAJC-5200: A High Performance Microprocessor for Multimedia Computing Subramania Sudharsanan Sun Microsystems, Inc., Palo Alto, CA 94303, USA Abstract. The newly introduced Microprocessor Architecture

More information

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009 Digital Signal Processing 8 December 24, 2009 VIII. DSP Processors 2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator (MAC), Modified bus structures and memory access

More information

Baseline V IRAM Trimedia. Cycles ( x 1000 ) N

Baseline V IRAM Trimedia. Cycles ( x 1000 ) N CS 252 COMPUTER ARCHITECTURE MAY 2000 An Investigation of the QR Decomposition Algorithm on Parallel Architectures Vito Dai and Brian Limketkai Abstract This paper presents an implementation of a QR decomposition

More information

An Optimizing Compiler for the TMS320C25 DSP Chip

An Optimizing Compiler for the TMS320C25 DSP Chip An Optimizing Compiler for the TMS320C25 DSP Chip Wen-Yen Lin, Corinna G Lee, and Paul Chow Published in Proceedings of the 5th International Conference on Signal Processing Applications and Technology,

More information

Rapid Prototyping System for Teaching Real-Time Digital Signal Processing

Rapid Prototyping System for Teaching Real-Time Digital Signal Processing IEEE TRANSACTIONS ON EDUCATION, VOL. 43, NO. 1, FEBRUARY 2000 19 Rapid Prototyping System for Teaching Real-Time Digital Signal Processing Woon-Seng Gan, Member, IEEE, Yong-Kim Chong, Wilson Gong, and

More information

New Advances in Micro-Processors and computer architectures

New Advances in Micro-Processors and computer architectures New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Independent DSP Benchmarks: Methodologies and Results. Outline

Independent DSP Benchmarks: Methodologies and Results. Outline Independent DSP Benchmarks: Methodologies and Results Berkeley Design Technology, Inc. 2107 Dwight Way, Second Floor Berkeley, California U.S.A. +1 (510) 665-1600 info@bdti.com http:// Copyright 1 Outline

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

One instruction specifies multiple operations All scheduling of execution units is static

One instruction specifies multiple operations All scheduling of execution units is static VLIW Architectures Very Long Instruction Word Architecture One instruction specifies multiple operations All scheduling of execution units is static Done by compiler Static scheduling should mean less

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Microprocessor Extensions for Wireless Communications

Microprocessor Extensions for Wireless Communications Microprocessor Extensions for Wireless Communications Sridhar Rajagopal and Joseph R. Cavallaro DRAFT REPORT Rice University Center for Multimedia Communication Department of Electrical and Computer Engineering

More information

PERFORMANCE ANALYSIS OF ALTERNATIVE STRUCTURES FOR 16-BIT INTEGER FIR FILTER IMPLEMENTED ON ALTIVEC SIMD PROCESSING UNIT

PERFORMANCE ANALYSIS OF ALTERNATIVE STRUCTURES FOR 16-BIT INTEGER FIR FILTER IMPLEMENTED ON ALTIVEC SIMD PROCESSING UNIT PERFORMANCE ANALYSIS OF ALTERNATIVE STRUCTURES FOR -BIT INTEGER FIR FILTER IMPLEMENTED ON ALTIVEC SIMD PROCESSING UNIT Grzegorz Kraszewski Białystok Technical University, Department of Electric Engineering

More information

Complexity-effective Enhancements to a RISC CPU Architecture

Complexity-effective Enhancements to a RISC CPU Architecture Complexity-effective Enhancements to a RISC CPU Architecture Jeff Scott, John Arends, Bill Moyer Embedded Platform Systems, Motorola, Inc. 7700 West Parmer Lane, Building C, MD PL31, Austin, TX 78729 {Jeff.Scott,John.Arends,Bill.Moyer}@motorola.com

More information

3.1 Description of Microprocessor. 3.2 History of Microprocessor

3.1 Description of Microprocessor. 3.2 History of Microprocessor 3.0 MAIN CONTENT 3.1 Description of Microprocessor The brain or engine of the PC is the processor (sometimes called microprocessor), or central processing unit (CPU). The CPU performs the system s calculating

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

DSP Processors Lecture 13

DSP Processors Lecture 13 DSP Processors Lecture 13 Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles ingrid@ee.ucla.edu 1 References The origins: E.A. Lee, Programmable DSP Processors,

More information

Typical DSP application

Typical DSP application DSP markets DSP markets Typical DSP application TI DSP History: Modem applications 1982 TMS32010, TI introduces its first programmable general-purpose DSP to market Operating at 5 MIPS. It was ideal for

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

INTRODUCTION TO DIGITAL SIGNAL PROCESSOR

INTRODUCTION TO DIGITAL SIGNAL PROCESSOR INTRODUCTION TO DIGITAL SIGNAL PROCESSOR By, Snehal Gor snehalg@embed.isquareit.ac.in 1 PURPOSE Purpose is deliberately thought-through goal-directedness. - http://en.wikipedia.org/wiki/purpose This document

More information

Evaluating MMX Technology Using DSP and Multimedia Applications

Evaluating MMX Technology Using DSP and Multimedia Applications Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * November 22, 1999 The University of Texas at Austin Department of Electrical

More information

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

Classification of Semiconductor LSI

Classification of Semiconductor LSI Classification of Semiconductor LSI 1. Logic LSI: ASIC: Application Specific LSI (you have to develop. HIGH COST!) For only mass production. ASSP: Application Specific Standard Product (you can buy. Low

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

However, no results are published that indicate the applicability for cycle-accurate simulation purposes. The language RADL [12] is derived from earli

However, no results are published that indicate the applicability for cycle-accurate simulation purposes. The language RADL [12] is derived from earli Retargeting of Compiled Simulators for Digital Signal Processors Using a Machine Description Language Stefan Pees, Andreas Homann, Heinrich Meyr Integrated Signal Processing Systems, RWTH Aachen pees[homann,meyr]@ert.rwth-aachen.de

More information

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors Chapter 06: Instruction Pipelining and Parallel Processing Lesson 14: Example of the Pipelined CISC and RISC Processors 1 Objective To understand pipelines and parallel pipelines in CISC and RISC Processors

More information

04 - DSP Architecture and Microarchitecture

04 - DSP Architecture and Microarchitecture September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:

More information

University of California, Berkeley. Midterm II. You are allowed to use a calculator and one 8.5" x 1" double-sided page of notes.

University of California, Berkeley. Midterm II. You are allowed to use a calculator and one 8.5 x 1 double-sided page of notes. University of California, Berkeley College of Engineering Computer Science Division EECS Fall 1997 D.A. Patterson Midterm II October 19, 1997 CS152 Computer Architecture and Engineering You are allowed

More information

RISC Principles. Introduction

RISC Principles. Introduction 3 RISC Principles In the last chapter, we presented many details on the processor design space as well as the CISC and RISC architectures. It is time we consolidated our discussion to give details of RISC

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

Better sharc data such as vliw format, number of kind of functional units

Better sharc data such as vliw format, number of kind of functional units Better sharc data such as vliw format, number of kind of functional units Pictures of pipe would help Build up zero overhead loop example better FIR inner loop in coldfire Mine more material from bsdi.com

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

MIPS Technologies MIPS32 M4K Synthesizable Processor Core By the staff of

MIPS Technologies MIPS32 M4K Synthesizable Processor Core By the staff of An Independent Analysis of the: MIPS Technologies MIPS32 M4K Synthesizable Processor Core By the staff of Berkeley Design Technology, Inc. OVERVIEW MIPS Technologies, Inc. is an Intellectual Property (IP)

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

TMS320C3X Floating Point DSP

TMS320C3X Floating Point DSP TMS320C3X Floating Point DSP Microcontrollers & Microprocessors Undergraduate Course Isfahan University of Technology Oct 2010 By : Mohammad 1 DSP DSP : Digital Signal Processor Why A DSP? Example Voice

More information

Two High Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic

Two High Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic Two High Performance Adaptive Filter Implementation Schemes Using istributed Arithmetic Rui Guo and Linda S. ebrunner Abstract istributed arithmetic (A) is performed to design bit-level architectures for

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton,

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers William Stallings Computer Organization and Architecture Chapter 12 Reduced Instruction Set Computers Major Advances in Computers(1) The family concept IBM System/360 1964 DEC PDP-8 Separates architecture

More information

Microprocessors vs. DSPs (ESC-223)

Microprocessors vs. DSPs (ESC-223) Insight, Analysis, and Advice on Signal Processing Technology Microprocessors vs. DSPs (ESC-223) Kenton Williston Berkeley Design Technology, Inc. Berkeley, California USA +1 (510) 665-1600 info@bdti.com

More information

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 GBI0001@AUBURN.EDU ELEC 6200-001: Computer Architecture and Design Silicon Technology Moore s law Moore's Law describes a long-term trend in the history

More information

Embedded Computing Platform. Architecture and Instruction Set

Embedded Computing Platform. Architecture and Instruction Set Embedded Computing Platform Microprocessor: Architecture and Instruction Set Ingo Sander ingo@kth.se Microprocessor A central part of the embedded platform A platform is the basic hardware and software

More information

The Evolution of DSP Processors

The Evolution of DSP Processors Berkeley Design Technology, Inc. Optimized DSP Software Independent DSP Analysis A BDTI White Paper The Evolution of DSP Processors By Jennifer Eyre and Jeff Bier, Berkeley Design Technology, Inc. (BDTI)

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Basic Computer Architecture

Basic Computer Architecture Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I

More information

Efficient FFT Algorithm and Programming Tricks

Efficient FFT Algorithm and Programming Tricks Connexions module: m12021 1 Efficient FFT Algorithm and Programming Tricks Douglas L. Jones This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

In this tutorial, we will discuss the architecture, pin diagram and other key concepts of microprocessors.

In this tutorial, we will discuss the architecture, pin diagram and other key concepts of microprocessors. About the Tutorial A microprocessor is a controlling unit of a micro-computer, fabricated on a small chip capable of performing Arithmetic Logical Unit (ALU) operations and communicating with the other

More information

Chapter 2: Data Manipulation

Chapter 2: Data Manipulation Chapter 2: Data Manipulation Computer Science: An Overview Eleventh Edition by J. Glenn Brookshear Copyright 2012 Pearson Education, Inc. Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine

More information

Typical Processor Execution Cycle

Typical Processor Execution Cycle Typical Processor Execution Cycle Instruction Fetch Obtain instruction from program storage Instruction Decode Determine required actions and instruction size Operand Fetch Locate and obtain operand data

More information

Chapter 2: Data Manipulation

Chapter 2: Data Manipulation Chapter 2: Data Manipulation Computer Science: An Overview Eleventh Edition by J. Glenn Brookshear Copyright 2012 Pearson Education, Inc. Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine

More information

Embedded Computation

Embedded Computation Embedded Computation What is an Embedded Processor? Any device that includes a programmable computer, but is not itself a general-purpose computer [W. Wolf, 2000]. Commonly found in cell phones, automobiles,

More information

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise CSCI 4717/5717 Computer Architecture Topic: Instruction Level Parallelism Reading: Stallings, Chapter 14 What is Superscalar? A machine designed to improve the performance of the execution of scalar instructions.

More information

M.Tech. credit seminar report, Electronic Systems Group, EE Dept, IIT Bombay, Submitted: November Evolution of DSPs

M.Tech. credit seminar report, Electronic Systems Group, EE Dept, IIT Bombay, Submitted: November Evolution of DSPs M.Tech. credit seminar report, Electronic Systems Group, EE Dept, IIT Bombay, Submitted: November 2002 Evolution of DSPs Author: Kartik Kariya (Roll No. 02307923) Supervisor: Prof. Vikram M. Gadre, Associate

More information

vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX,

vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX, Vector Microprocessors Simple MultiMedia Applications for G. Lee and Mark G. Stoodley Corinna of Toronto University Paper: http://www.eecg.toronto.edu/ecorinna/vector/ vector extensions to state-of-the-art

More information

Digital Signal Processor 2010/1/4

Digital Signal Processor 2010/1/4 Digital Signal Processor 1 Analog to Digital Shift 2 Digital Signal Processing Applications FAX Phone Personal Computer Medical Instruments DVD player Air conditioner (controller) Digital Camera MP3 audio

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

Chapter 5:: Target Machine Architecture (cont.)

Chapter 5:: Target Machine Architecture (cont.) Chapter 5:: Target Machine Architecture (cont.) Programming Language Pragmatics Michael L. Scott Review Describe the heap for dynamic memory allocation? What is scope and with most languages how what happens

More information

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki Communications and Computer Engineering II: Microprocessor 2: Processor Micro-Architecture Lecturer : Tsuyoshi Isshiki Dept. Communications and Computer Engineering, Tokyo Institute of Technology isshiki@ict.e.titech.ac.jp

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers.

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers. Set No. 1 IV B.Tech I Semester Supplementary Examinations, March - 2017 COMPUTER ARCHITECTURE & ORGANIZATION (Common to Electronics & Communication Engineering and Electronics & Time: 3 hours Max. Marks:

More information

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

URL:   Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture 01 1 EE 4720 Computer Architecture 01 1 URL: https://www.ece.lsu.edu/ee4720/ RSS: https://www.ece.lsu.edu/ee4720/rss home.xml Offered by: David M. Koppelman 3316R P. F. Taylor Hall, 578-5482, koppel@ece.lsu.edu,

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Low-Power FIR Digital Filters Using Residue Arithmetic

Low-Power FIR Digital Filters Using Residue Arithmetic Low-Power FIR Digital Filters Using Residue Arithmetic William L. Freking and Keshab K. Parhi Department of Electrical and Computer Engineering University of Minnesota 200 Union St. S.E. Minneapolis, MN

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Impact of Source-Level Loop Optimization on DSP Architecture Design

Impact of Source-Level Loop Optimization on DSP Architecture Design Impact of Source-Level Loop Optimization on DSP Architecture Design Bogong Su Jian Wang Erh-Wen Hu Andrew Esguerra Wayne, NJ 77, USA bsuwpc@frontier.wilpaterson.edu Wireless Speech and Data Nortel Networks,

More information

lesson 3 Transforming Data into Information

lesson 3 Transforming Data into Information essential concepts lesson 3 Transforming Data into Information This lesson includes the following sections: How Computers Represent Data How Computers Process Data Factors Affecting Processing Speed Extending

More information

Structure of Computer Systems

Structure of Computer Systems Structure of Computer Systems Structure of Computer Systems Baruch Zoltan Francisc Technical University of Cluj-Napoca Computer Science Department U. T. PRES Cluj-Napoca, 2002 CONTENTS PREFACE... xiii

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGNAL PROCESSING UTN-FRBA 2010 Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable. Reproducibility. Don t depend on components

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information