Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts

Size: px
Start display at page:

Download "Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts"

Transcription

1 Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton, New Jersey ABSTRACT This paper presents a design methodology for a high-performance, programmable video signal processor (VSP). The proposed design methodology explores both technology-driven hardware tradeoffs and application-driven architectural tradeoffs for optimizing cost and performance within a class of processor architectures. In particular, this methodology allows concurrent consideration of these competing factors at different levels of design sophistication, ranging from early design exploration towards full processor simulation. We present the results of this methodology for an aggressive Very- Long-Instruction-Word (VLIW) video signal processor design and discuss its utility for other programmable signal processor designs. Keywords: processor design, video signal processing, multimedia architecture, VLIW, video compression 1. INTRODUCTION The extraordinary growth of the multimedia industry has generated considerable demand for digital video in today's applications. General-purpose processors and traditional DSPs are unable to efficiently support the intense computational costs of digital video, so video signal processors (VSPs) have been developed to meet this demand. Dedicated video signal processors are now available for compression as well as other special video applications. Dedicated VSPs, however, cannot support the rapid evolution of new and existing video applications. Consequently, the need for greater functionality, as well as increased cost and time-to-market pressures, will push the video industry towards programmable video signal processors. Programmable VSPs offer the necessary flexibility for supporting multimedia applications at lower cost and higher performance than workstation microprocessor architectures. Unlike microprocessor architecture design, the design of programmable VSPs is an immature field. Typical microprocessor design uses existing processor designs, CAD tools, and benchmarks to provide a starting point in the design process. Existing processor designs provide an architectural reference point from which design modifications can be made. Then, using the application characteristics defined by the benchmarks, and simulating potential designs with the CAD design tools, it is possible to readily evaluate the merits of various architectural tradeoffs. With programmable VSPs, this is not the case. There are no design tools, no benchmarks, and only a small number of existing VSPs, few of which are programmable. This provides a very limited foundation from which to begin a new design. Therefore, approaching the problem of programmable VSP design, or the design of any new computer architecture field, requires a different strategy. One potential strategy is our proposed design methodology. The most accurate method of architectural assessment involves circuit-level timing simulation of full processor layout and cycle-level simulation of full applications based on optimized, compiled code. Unfortunately, the development of such a simulation model is extremely expensive and can only be accomplished for a tiny subset of the design space. We require a more practical design strategy that obtains results over a larger design space. Such a design methodology has been devised that incorporates early exploration both for technology-driven design parameters such as circuit performance and for instruction-level behavior. The primary advantage of this methodology is that it allows us to concurrently consider these competing factors throughout the range of the design process, from early design exploration to full processor simulation. This document will continue to examine the specifics of the proposed design methodology in the section 2. Following in section 3, we present an example of how this methodology has been applied to the design of an aggressive Very-Long-Instruction-Word (VLIW) video signal processor. And finally, conclusions as to the effectiveness of this method will be presented in section 4.

2 2. DESIGN METHODOLOGY The basis of our design methodology is that it depends on an early exploration phase to generate the largest possible design space in which to search for the best candidate architectures. Early exploration provides us with an approximate model of the final design upon which we can perform more detailed evaluation. The primary steps in the initial exploration phase of the design methodology are: 1. Choose an architectural paradigm. 2. Perform detailed, parameterizable, transistor-level design of key modules. Use area and performance data to define unique design space. Construct candidate architectures based on module area and performance costs. Estimate cycle time. 3. Evaluate candidate architectures by hand scheduling key VSP kernels onto each architecture using a variety of well-known compilation strategies. The results of these experiments refine the design space and direct the implementation of the compiler. Within the reduced design space, detailed simulation is again performed on a more limited set of designs using a wide array of applications, a prototype compiler, and a full architectural simulator with real application data. This iterative application of continually more limited and detailed evaluations will eventually result in a final programmable video signal processor design Architectural paradigm The first step in the design of the programmable VSP is the choice of the basic processor framework. There are a number of possible architectural paradigms available to choose from. However, we are primarily interested in architectures that are easily programmable from a high-level language (HLL). Some potential choices include superscalar, VLIW, and multiprocessor DSP. Superscalar architectures are generally not suitable in that they are not optimized for signal processing characteristics, that is for high data throughput and real-time constraints. VLIW and multiprocessor DSP are the common choices currently being used by existing programmable VSPs 1, 2, 3, 4. For aggressive architectures with high degrees of parallelism and deep sub-micron manufacturing processes, multiprocessor DSP architectures are likely to be less suitable because of the large memory requirements. The basic multiply-accumulate architecture of DSPs requires two memory reads and one memory write per cycle. This requires three memory ports per functional unit, which, for multiple parallel functional units, can lead to poor speed and area performance. Multiprocessor DSP architectures, however, have the advantage of a strong existing compiler technology. Many early programmable video signal processor designs are based on this processor framework for this reason 1. VLIW architectures provide a popular alternative because of the high degree of parallelism and the high clock rates made possible by static scheduling. Because the VLIW architecture is a load/store architecture, there is not as large a demand on the memory as with the multiprocessor architecture. The primary drawback of the VLIW architecture is that it does not have the existing technology base of current DSP architectures. For both the VLIW and multiprocessor architectures, when using high degrees of parallelism and fast clock rates, area and speed are likely to be an issue due to their decreasing performance from the excessive number of ports in memory and register files, as well as increasing costs for bypassing in pipelines. However, much of this can be minimized by paying careful attention to VLSI design and, if necessary, using a distributed architecture with a high-speed interconnect, as will be seen with our example VLIW VSP processor in the next section Architectural assessment After choosing the basic processor framework, it is necessary to perform an analysis to determine the potential performance of the processor. The limitations on performance are primarily dictated by the tradeoff between speed and area, where speed determines how fast processor modules will run and area determines the number and type of modules that may fit on the chip. In order to meet the intense performance requirements of video signal processing, it is necessary to achieve high clock rates but still maintain reasonable area requirements in order to allow for the provision of more on-chip resources. These tradeoffs may be determined by performing detailed VLSI simulation of the processor design.

3 Unfortunately, full processor simulation is not practical during the intial stages of design. A more reasonable solution is to instead perform detailed transistor-level design of key parameterizable modules only. This is considerably less expensive, but still provides sufficient information for early design exploration. In the case of a programmable VSP, the key modules include high-bandwidth interconnect, high-connectivity register files, and high-speed local memories. Other important modules may include speed critical functional units such as multipliers and dividers. This detailed design is particularly important when implementing an aggressive architecture using a more advanced manufacturing process. Area and performance data from these designs define a unique design space for the processor in a given implementation process. This design space provides critical information about technology-driven hardware tradeoffs and the effects upon area and speed performance. Therefore, given target speed and area constraints, these data may be used to determine the maximum size and number of ports for memory and register files, length and degree of bypassing in the pipeline, size of the interconnect network, number, type, and arrangement of functional units, as well as numerous other aspects of the design space. Based on these module cost and performance data, the system designer constructs a number of potential architecture designs. These candidate architectures will be used for an application assessment to determine their utility in typical programmable VSP applications. Later, an automated design exploration tool based on analytical performance models is used to suggest additional interesting candidates for evaluation Application assessment Once candidate architectures are available, key VSP kernels are hand scheduled onto these architectures using a variety of compilation strategies. This serves a two-fold purpose. First, it provides early performance estimates as to the effectiveness of each of the candidate architectures and provides an early indication of any potential architectural bottlenecks. Secondly, the fact that skilled system designers are performing the scheduling by hand by allows assessment of the effectiveness of known compilation strategies under the proposed architectures. Performing scheduling using a compiler can conceal some of the insights that may be found with hand-coding. While hand scheduling does not provide optimum results, a first-generation compiler is likely to have many inefficiencies and provide no more accuracy than hand-coding. Unfortunately, as we mentioned before, there is no standard set of VSP benchmarks, so it is necessary to devise our own set of VSP applications for scheduling. Also, it is not practical to attempt to hand-code full applications over numerous candidate architectures as well as doing so using numerous different compiler methods, particularly before an optimizing compiler is available. Instead, it is necessary to take a more cautious approach and only schedule key VSP kernels. While kernels do not exactly indicate the characteristics of full applications, kernels tend to dominate signal processing code to an even greater degree than in general-purpose code, so we can expect reasonably accurate results. The choice of kernels to use in our first design was primarily dictated by the current trend in video applications towards compression and decompression. The primary routines involved in these functions include motion estimation, discrete cosine transform (DCT), variable bit rate (VBR) coding, and color-space transformation. While there are numerous other video application kernels aside from these four, between them they display many of the characteristics of video applications and thus are believed to provide a reasonable kernel set for early exploration. The choice of kernels and applications can be modified later when performing more detailed evaluation of the processor architecture. The early performance results from these hand-coding experiments are used to guide the implementation of the compiler. Coupled with the results of the detailed VLSI area and performance data, they can be used to refine the design space and target a more limited range of designs. Within the new design space, more detailed simulation is performed with a wide range of applications, a prototype compiler, and a full architectural simulator with real application data in order to determine the final processor architecture.

4 3. TARGET PROCESSOR This methodology is being used to design a real processor. This design targets operations per cycle at clock rates in excess of 500 MHz. To enable such high speeds and to facilitate a high degree of parallelism, we have chosen to use a VLIW architecture and a 0.25µ manufacturing process. VLIW architectures have only recently become possible to implement on a single-chip and so many architectural structures are feasible for the first time 5, creating an even more challenging implementation problem. We shall present how our design methodology manages to deal with the necessary design issues Architectural assessment As specified by the design methodology, following adoption of the basic processor framework it is necessary to perform detailed transistor-level simulation. Using a 0.25µ process, detailed designs have been made of parameterizable versions of key modules, including a high-bandwidth interconnect, high-connectivity register file, and high-speed local memory 6, 7. In the particular case of the register file and local memory, a single global register file and memory simply are not feasible supporting functional units at such high speeds. In order to achieve speeds on the order of 500 MHz, register files and memory are relatively limited in both size and the number of ports they may support. Therefore we selected a distributed cluster architecture, similar to the one in Labrousse, et. al. 8, where each cluster has its own register file and local memory. The high-speed interconnect then serves the purpose of connecting these clusters. Since a VLIW processor is a statically scheduled machine, the burden falls on the compiler to efficiently schedule the code. However, architectural tradeoffs may be made in some cases to ease the job of the compiler. One common architectural improvement is to provide higher connectivity of units within the processor. Therefore, in this case, it is desirable to make the clusters as large as possible, include full bypassing of the pipeline, but still stay within acceptable area and performance constraints. From the results of the detailed module designs, we found that clusters of up to 4 issue slots were feasible allowing register files of up to 256 registers and a maximum of 32 KB for single-ported local memories, while still maintaining a clock rate of 650 MHz. The processor could support up to 8 clusters with all the issue slots connected by a single-cycle 32x32 crossbar interconnect. However, if we are willing to sacrifice connectivity for higher speed, we can instead obtain a 850 MHz machine with 16 clusters and 2 issue slots per cluster, local memories of up to 16 KB of memory, and register files of 64 registers. In this case, however, only a 16x16 crossbar interconnect may be supported, allowing for only one slot per cluster to be connected to the crossbar. We didn't explicitly perform detailed design of any of the functional units, but external sources 9,10 indicate the performance of similar functional units designed in a 0.25µ process. From these we anticipate that we can support 4-stage pipelines with full function ALUs, 16-bit shifters, 8x8 multipliers, and load/store units with simple addressing modes (i.e. direct or register-indirect). However, by moving to a 5-stage pipeline we can support more complex addressing modes (i.e. indexed and base displacement) as well as a 16x16 multiplier. This would add single-cycle load-use and multiply-use hazards. The above combined data regarding the potential cluster sizes and arrangements and pipeline lengths and characteristics becomes the design space for this programmable VSP. So it can be see how detailed transistor-level simulation of key hardware modules is invaluable in defining the design space of the processor. We identified seven likely candidate architectures, based on three criteria from the design space. Within the seven architectures, four used the higher connectivity 8-cluster/4-issue-slot model, whereas the other three supported the 16- cluster/2- issue-slot model. Three of the architectures supported 4-stage pipelines, while the other four used a 5-stage pipeline. And finally, five of the models used 8x8 multipliers versus the 16x16 multipliers supported in the other two models. More detailed regarding the candidate architectures may be found in Wolfe, et al 11. A design space exploration tool is also currently being tested in an attempt to find other important design space criteria and identify any additional likely candidate architectures.

5 3.2. Application assessment The seven candidate architectures created from the design space were then programmed with six different kernels from various compression standards (i.e. JPEG, MPEG), including full search motion estimation, three-step search motion estimation, traditional 2-D discrete cosine transform (DCT), row-column DCT, variable bit rate (VBR) coding, and RGB to 4:2:0 YCrCb color-space transformation. Each of these kernels has different application characteristics allowing us to examine various aspects of the candidate architectures. For example, motion estimation requires considerable memory usage, DCT places a heavy load on the multipliers, and VBR coding has long dependency chains and limited parallelism. These kernels were then hand scheduled using a variety of different compiler strategies. Initially, the kernels were scheduled onto each architecture in a sequential manner (i.e. using only one slot of 32 available slots) using only traditional scalar optimizations. This provides a reference point for comparison with the more aggressive parallel compiler methods. And, in cases where it was feasible, a sequential implementation with one level of unrolling was also performed. This provided a better reference against the primary parallel compilation method used, which was software pipelining. Other compilation methods included SIMD scheduling across clusters, predication, list scheduling, increasing degrees of unrolling, blocking, and the addition of special operations. The results of these experiments showed that while all the candidate architectures were relatively balanced and no particular model was particularly deficient in load/store units, multipliers, or shifters, among the three criteria upon which the candidate architectures were based, in two of the three cases, one model proved to have clearly superior performance. In the case of the multipliers, the 8x8 multiplier model had considerably slower performance than the 16x16 multiplier model. The other case was more surprising; the lower connectivity cluster model, the 16-cluster/2-issue-slot model, typically showed better performance than the 8-cluster/4-issue-slot model. This can be generally attributed to the fact that it had additional resources (with one multiplier, load/store unit, shifter per cluster, there are twice as many of these limited resources) and so was less likely to be limited by resources and more likely to be limited by the number of issue slots, and because this model has a 30% faster clock cycle. However, because hand scheduling was involved, we believe the higher connectivity model is likely to have equivalent performance when a compiler is performing the scheduling. More detailed results of the kernel scheduling can be found in Wolfe, et al 11. The results were also invaluable in evaluating the effectiveness of the various compiler strategies. While software pipelining appeared to be the dominant parallel compiler method, in cases where unrolling was used, particularly in the cases of multiple levels of unrolling, list scheduling often performed nearly as well. The compilation strategy that appeared to be most valuable in general was unrolling. We often used an SIMD approach, performing the same operations on different data in different clusters, but it was also feasible to split up the operations over multiple clusters to shorten the length of the code. This allowed for multiple levels of unrolling while still keeping the code length to a size that would fit within the instruction cache. These observations will prove useful while undertaking the design of the compiler. With the detailed evaluations provided by the performance of the kernels, we are now beginning to refine the design space and prepare for more detailed evaluation of this design space. A design exploration tool being is designed to look for other candidate architectures, design of the compiler is currently under way, a RTL-simulator is being designed that will allow more accurate evaluation of programs, and more extensive research is being done for finding a more complete set of video applications. Eventually we intend to have sufficient design tools for complete evaluation of our architectures using compiled code from complete production applications. 4. CONCLUSION While the proposed design methodology provides no guarantees for finding the optimal design of a programmable video signal processor, even with the limited foundation with which we began our processor design, we were still able to come up with a reasonably successful design. Based on our results, by efficiently using processor resources, we can maintain sustained performance exceeding 15 GOPs for large periods of time. Even full-search motion-estimation, generally believed to be the most time consuming routine in video compression, can be done in real-time using only 33%-46% of the compute time for CCIR-601 resolutions (720x480). Therefore, we achieved a processor design that not only offers performance comparable to or better than today's best dedicated VSPs, but also provides the valuable benefit of programmability

6 The most valuable advantage of this design methodology is that it does not specifically relate to programmable video signal processors. By first choosing an architectural paradigm, then performing detailed simulation on key modules, and finally performing hand-scheduling of key application kernels onto the candidate architectures identified within the resulting design space, it should be possible to generate a design for any type of signal processor, or even, potentially, any general-purpose processor. 5. REFERENCES 1. R. Gove, "The MVP: A Highly-Integrated Video Compression Chip," in Proc. of the Data Compression Conf. 1994, pp , C. Hansen, "MicroUnity's MediaProcessor Architecture," IEEE Micro, Aug. 1996, pp P. Foley, "The Mpact Media Processor Redefines the Multimedia PC," Proc. of COMPCON '96, pp , S. Rathnam and G. Slavenburg, "An Architectural Overview of the Programmable Multimedia Processor, TM-1," Proc. of COMPCON '96, pp J. Gray, A. Naylor, A. Abnous, and N. Bagherzadeh, "VIPER: A 25-MHz 100 MIPs Peak VLIW Microprocessor," Proc. of 1993 IEEE Custom Integrated Circuits Conf., pp , S. Dutta, "VLSI Issues for Video Signal Processing," Ph.D. Thesis, Princeton Univ., S. Dutta, A. Wolfe, W. Wolf, and K. O'Connor, "Design Issues for a Very-Long-Instruction-Word VLSI Video Signal Processor," in VLSI Signal Processing IX, pp , Oct J. Labrousse and G. Slavenburg, "A 50 MHz Microprocessor with a VLIW Architecture," Proc. Inter. Solid State Circuits Conf., San Francisco, N. Ohkubo, et al., "A 4.4ns CMOS 54x54-b Multiplier Using Pass-Transistor Multiplexer," Proc. of 1994 IEEE Custom Integrated Circuits Conf., pp , M. Suzuki, et al., "A 1.5ns, 32b CMOS ALU in Double Pass-Transistor Logic," Inter. Solid State Circuits Conf., pp , A. Wolfe, J. Fritts, S. Dutta, and E. S. Fernandes, "Datapath Design for a VLIW Video Signal Processor," scheduled to appear in Third Inter. Symp. on High-Perf. Comp. Arch., Jan

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Evaluation of Static and Dynamic Scheduling for Media Processors.

Evaluation of Static and Dynamic Scheduling for Media Processors. Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts 1 and Wayne Wolf 2 1 Dept. of Computer Science, Washington University, St. Louis, MO 2 Dept. of Electrical Engineering, Princeton

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues Performance measurement SMD149 - Operating Systems - Performance and processor design Roland Parviainen November 28, 2005 Performance measurement Motivation Techniques Common metrics Processor architectural

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

The MorphoSys Parallel Reconfigurable System

The MorphoSys Parallel Reconfigurable System The MorphoSys Parallel Reconfigurable System Guangming Lu 1, Hartej Singh 1,Ming-hauLee 1, Nader Bagherzadeh 1, Fadi Kurdahi 1, and Eliseu M.C. Filho 2 1 Department of Electrical and Computer Engineering

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Design Space Exploration of Network Processor Architectures

Design Space Exploration of Network Processor Architectures Design Space Exploration of Network Processor Architectures ECE 697J December 3 rd, 2002 ECE 697J 1 Introduction Network processor architectures have many choices Number of processors Size of memory Type

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM

PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM Thinh PQ Nguyen, Avideh Zakhor, and Kathy Yelick * Department of Electrical Engineering and Computer Sciences University of California at Berkeley,

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

Vector IRAM: A Microprocessor Architecture for Media Processing

Vector IRAM: A Microprocessor Architecture for Media Processing IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology

More information

Understanding Sources of Inefficiency in General-Purpose Chips

Understanding Sources of Inefficiency in General-Purpose Chips Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors

More information

A Scalable Multiprocessor for Real-time Signal Processing

A Scalable Multiprocessor for Real-time Signal Processing A Scalable Multiprocessor for Real-time Signal Processing Daniel Scherrer, Hans Eberle Institute for Computer Systems, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland {scherrer, eberle}@inf.ethz.ch

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

Design of Transport Triggered Architecture Processor for Discrete Cosine Transform

Design of Transport Triggered Architecture Processor for Discrete Cosine Transform Design of Transport Triggered Architecture Processor for Discrete Cosine Transform by J. Heikkinen, J. Sertamo, T. Rautiainen,and J. Takala Presented by Aki Happonen Table of Content Introduction Transport

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Architecture Implementation Using the Machine Description Language LISA

Architecture Implementation Using the Machine Description Language LISA Architecture Implementation Using the Machine Description Language LISA Oliver Schliebusch, Andreas Hoffmann, Achim Nohl, Gunnar Braun and Heinrich Meyr Integrated Signal Processing Systems, RWTH Aachen,

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Overview of Digital Design with Verilog HDL 1

Overview of Digital Design with Verilog HDL 1 Overview of Digital Design with Verilog HDL 1 1.1 Evolution of Computer-Aided Digital Design Digital circuit design has evolved rapidly over the last 25 years. The earliest digital circuits were designed

More information

On Efficiency of Transport Triggered Architectures in DSP Applications

On Efficiency of Transport Triggered Architectures in DSP Applications On Efficiency of Transport Triggered Architectures in DSP Applications JARI HEIKKINEN 1, JARMO TAKALA 1, ANDREA CILIO 2, and HENK CORPORAAL 3 1 Tampere University of Technology, P.O.B. 553, 33101 Tampere,

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

Novel Multimedia Instruction Capabilities in VLIW Media Processors. Contents

Novel Multimedia Instruction Capabilities in VLIW Media Processors. Contents Novel Multimedia Instruction Capabilities in VLIW Media Processors J. T. J. van Eijndhoven 1,2 F. W. Sijstermans 1 (1) Philips Research Eindhoven (2) Eindhoven University of Technology The Netherlands

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Subash Chandar G (g-chandar1@ti.com), Vaideeswaran S (vaidee@ti.com) DSP Design, Texas Instruments India

More information

CPU1. D $, 16-K Dual Ported South UPA

CPU1. D $, 16-K Dual Ported South UPA MAJC-5200: A High Performance Microprocessor for Multimedia Computing Subramania Sudharsanan Sun Microsystems, Inc., Palo Alto, CA 94303, USA Abstract. The newly introduced Microprocessor Architecture

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

NISC Application and Advantages

NISC Application and Advantages NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering

More information

Teaching Computer Architecture with FPGA Soft Processors

Teaching Computer Architecture with FPGA Soft Processors Teaching Computer Architecture with FPGA Soft Processors Dr. Andrew Strelzoff 1 Abstract Computer Architecture has traditionally been taught to Computer Science students using simulation. Students develop

More information

Understanding multimedia application chacteristics for designing programmable media processors

Understanding multimedia application chacteristics for designing programmable media processors Understanding multimedia application chacteristics for designing programmable media processors Jason Fritts Jason Fritts, Wayne Wolf, and Bede Liu SPIE Media Processors '99 January 28, 1999 Why programmable

More information

Novel Multimedia Instruction Capabilities in VLIW Media Processors

Novel Multimedia Instruction Capabilities in VLIW Media Processors Novel Multimedia Instruction Capabilities in VLIW Media Processors J. T. J. van Eijndhoven 1,2 F. W. Sijstermans 1 (1) Philips Research Eindhoven (2) Eindhoven University of Technology The Netherlands

More information

Application generators: a case study

Application generators: a case study Application generators: a case study by JAMES H. WALDROP Hamilton Brothers Oil Company Denver, Colorado ABSTRACT Hamilton Brothers Oil Company recently implemented a complex accounting and finance system.

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

General-purpose Reconfigurable Functional Cache architecture. Rajesh Ramanujam. A thesis submitted to the graduate faculty

General-purpose Reconfigurable Functional Cache architecture. Rajesh Ramanujam. A thesis submitted to the graduate faculty General-purpose Reconfigurable Functional Cache architecture by Rajesh Ramanujam A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE

More information

Design and Implementation of a Super Scalar DLX based Microprocessor

Design and Implementation of a Super Scalar DLX based Microprocessor Design and Implementation of a Super Scalar DLX based Microprocessor 2 DLX Architecture As mentioned above, the Kishon is based on the original DLX as studies in (Hennessy & Patterson, 1996). By: Amnon

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

Master of Engineering Preliminary Thesis Proposal For Prototyping Research Results. December 5, 2002

Master of Engineering Preliminary Thesis Proposal For Prototyping Research Results. December 5, 2002 Master of Engineering Preliminary Thesis Proposal For 6.191 Prototyping Research Results December 5, 2002 Cemal Akcaba Massachusetts Institute of Technology Cambridge, MA 02139. Thesis Advisor: Prof. Agarwal

More information

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 GBI0001@AUBURN.EDU ELEC 6200-001: Computer Architecture and Design Silicon Technology Moore s law Moore's Law describes a long-term trend in the history

More information

04 - DSP Architecture and Microarchitecture

04 - DSP Architecture and Microarchitecture September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

IMAGINE: Signal and Image Processing Using Streams

IMAGINE: Signal and Image Processing Using Streams IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Computer Architecture s Changing Definition

Computer Architecture s Changing Definition Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction

More information

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A REVIEWARTICLE OF SDRAM DESIGN WITH NECESSARY CRITERIA OF DDR CONTROLLER Sushmita Bilani *1 & Mr. Sujeet Mishra 2 *1 M.Tech Student

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto. Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors

More information

Distributed Vision Processing in Smart Camera Networks

Distributed Vision Processing in Smart Camera Networks Distributed Vision Processing in Smart Camera Networks CVPR-07 Hamid Aghajan, Stanford University, USA François Berry, Univ. Blaise Pascal, France Horst Bischof, TU Graz, Austria Richard Kleihorst, NXP

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in

More information

Parallel Computer Architecture

Parallel Computer Architecture Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»

More information

Multicore Hardware and Parallelism

Multicore Hardware and Parallelism Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3

More information

DESIGN OF AN FFT PROCESSOR

DESIGN OF AN FFT PROCESSOR 1 DESIGN OF AN FFT PROCESSOR Erik Nordhamn, Björn Sikström and Lars Wanhammar Department of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract In this paper we present a structured

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. 11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but

More information

CS184a: Computer Architecture (Structure and Organization) Previously. Today. Computing Requirements (review) Requirements

CS184a: Computer Architecture (Structure and Organization) Previously. Today. Computing Requirements (review) Requirements CS184a: Computer Architecture (Structure and Organization) Day 8: January 24, 2005 Computing Requirements and Instruction Space Previously Fixed and Programmable Computation Area-Time-Energy Tradeoffs

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Impact of Source-Level Loop Optimization on DSP Architecture Design

Impact of Source-Level Loop Optimization on DSP Architecture Design Impact of Source-Level Loop Optimization on DSP Architecture Design Bogong Su Jian Wang Erh-Wen Hu Andrew Esguerra Wayne, NJ 77, USA bsuwpc@frontier.wilpaterson.edu Wireless Speech and Data Nortel Networks,

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Reconfigurable Cell Array for DSP Applications

Reconfigurable Cell Array for DSP Applications Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell

More information