Complexity-effective Enhancements to a RISC CPU Architecture

Size: px
Start display at page:

Download "Complexity-effective Enhancements to a RISC CPU Architecture"

Transcription

1 Complexity-effective Enhancements to a RISC CPU Architecture Jeff Scott, John Arends, Bill Moyer Embedded Platform Systems, Motorola, Inc West Parmer Lane, Building C, MD PL31, Austin, TX {Jeff.Scott,John.Arends,Bill.Moyer}@motorola.com Abstract The M CORE TM RISC architecture has been developed to address the growing need for long battery life among today s embedded applications [4]. In this paper, we present several architectural enhancements to the M CORE M3 processor. Specifically, we discuss the burst mode memory enhancements, the instruction fetch enhancements, the selectable branch prediction implementation, and the improvements for software patching. These additions to the M CORE processor were carefully selected in order to increase performance at minimal cost and complexity, in order to meet the requirements of the portable, embedded marketplace. 1 Introduction An increase in the number of portable computing devices with a requirement for extended battery life has led to innovative architectural techniques which increase performance while minimizing energy consumption as well as design complexity. Complexity-effective design techniques have proven to significantly reduce die size as well as enhance frequency of execution. In the M3 processor, which is the newest member of the M CORE family of RISC solutions, new techniques are directed toward these goals. Due to increased performance requirements of portable computing devices, memory system performance has been a focus for innovation. A commonly practiced technique of bursting sequential memory accesses is used to minimize memory access time. In addition, microprocessors have frequently adopted a Harvard architecture to increase instruction fetch bandwidth and reduce the performance penalty for simultaneous instruction and data memory requests. The M3 processor achieves comparable performance through additional instruction buffering for pipelining load, store, and floating point operations, without requiring a Harvard architecture [5]. As a result, interfacing the M3 processor to burst mode memory devices involves providing newly defined memory interface signals to achieve enhanced performance. Off-chip accesses in portable computing devices are expensive in terms of performance degradation as well as energy consumption. Cost is also an important aspect of portable devices. Many cost sensitive designs employ a single external memory device of a given type (SRAM, Flash, or ROM) in order to minimize size and cost, and these packages are often 16 bits in data width. Therefore, it is common that memory subsystems implement a reduced bus-width for off-chip/external accesses, while maintaining a wider internal bus-width to performance critical memories [3]. New techniques in the M3 processor offer the flexibility to maximize the performance for both 16-bit external accesses and internal 32-bit memory accesses. Increasingly longer RISC pipelines and instruction buffering dictate the use of branch prediction in order to minimize the performance degradation due to conditional branches [6]. Commonly used branch prediction techniques are expensive in terms of increased silicon die area and energy consumption. A new technique for selectable branch prediction will be introduced that has proven to be effective at optimizing memory access time as well as minimizing complexity overhead and power consumption. Software developers often use software patching as a means of correcting already installed ROM based code. A similar technique is often times used as a technique to modify data values dynamically. Most techniques of patching require a context switch or require the implementation to introduce logic in the processor s time-critical memory access path. A new technique for both instruction and data patching has proven to be cost-effective in various embedded control applications. The paper is organized as follows. In Section 2, burst mode enhancements are described which includes a description of inefficiencies and solutions. Section 3 describes the instruction fetch enhancements for 16-bit and 32-bit memory subsystems. Selectable branch prediction is discussed which optionally improves memory address setup time in Section 4. Section 5 includes a software patching scheme for both instruction and data patching. Section 6 summarizes the paper. 2 Burst Mode Interface Enhancements Many data processing systems include memories capable of burst mode operation. Burst mode memory devices

2 are capable of providing greater throughput and reduced latency compared to standard memory systems. Burst mode operation takes advantage of the fact that successive memory accesses are often to sequential addresses. After an initial latency for the first data item requested, subsequent burst mode accesses can be completed in fewer cycles compared to the initial access. Interruptions to a burst sequence by events such as change of flow instructions or interleaved instruction and data accesses cause the burst sequence to end, resulting in longer latency memory access times. Traditionally, a processor designed to take advantage of burst mode memories asserted a single output signal signifying that a current or upcoming bus fetch was sequential to the previous bus fetch. Usually, to accommodate a wide variety of memory sub-systems, this signal was valid as early as possible in the bus cycle. 2.1 Burst Mode Inefficiencies Defined Consider the example code for a 16-bit instruction set shown in Figure 1. This code will be used to illustrate how a typical system would assert a sequential address indicator (SEQ) to a memory sub-system add r1,r2 ; accumulate 1002 ld r2,(2000) ; load 1004 sub r1,r2 ; subtract 1006 cmp r1,r2 ; compare 1008 beq 1030 ; branch if equal (taken) 100a mult r2,r3 ; multiply : : 1030 add r1,r2 ; accumulate 1032 cmp r1,r2 ; compare 1034 beq 1050 ; branch if equal (not taken) 1036 and r1,r2 ; and 1038 xor r1,r2 ; exclusive or 103a not r1 ; not Figure 1: Example code for burst mode memory sequence Figure 2 below illustrates the memory access sequence for the code listed above. In addition, the SEQ signal value is shown. ; SEQ 1000 add r1,r2 ;? 1002 ld r2,(2000) ; sub r1,r2 ; ; cmp r1,r2 ; beq 1030 ; 1 100a mult r2,r3 ; add r1,r2 ; cmp r1,r2 ; beq 1050 ; and r1,r2 ; xor r1,r2 ; 0 103a not r1 ; 1 Figure 2: Memory sequence for example code As can be seen above, SEQ is negated for address 1038, even though it is indeed sequential to address This is due to the fact that SEQ is required to be valid as early in the clock cycle as possible to accommodate a wide range of memory systems. In order to meet this requirement, the accuracy of SEQ is sacrificed, causing SEQ to be negated unnecessarily in order to improve timing. SEQ is negated in this example because the processor encounters a branch instruction inthepipelineataddress Sincethere isno branch prediction, the next address fetched (either 1038 or the branch target 1050) will depend on the resolution of a condition code. Typically this condition code is resolved late in the clock cycle as a result of the computation of the previous instruction (compare). In addition, SEQ is negated for the fetch. This is necessary because the processor has no idea whether the access is in the same memory, or a separate memory, from the instruction space. If this is in a separate memory system from the instructions, then SEQ is once again unnecessarily negated at the cost of increased memory latency. 2.2 Burst Mode Solution Our solution to these problems is to define a set of three signals that encompass all possible sequential scenarios. These signals, SEQ, ASEQ, and ISEQ, provide the system designer the greatest flexibility in order to maximize memory throughput without adding inordinate amounts of complexity. ASEQ stands for accurate SEQ and asserts later in the clock cycle compared to SEQ. However, ASEQ is completely accurate since it relies on the completion of branch resolution. For memory systems that can handle the later timing of ASEQ, a significant reduction in effective memory latency can be achieved. ISEQ stands for instruction SEQ. This signal is used exclusively for instruction fetches and is a don t care for fetches. This signal can be utilized by a memory subsystem dedicated to instruction memory, allowing interleaved accesses to not interrupt a burstable sequence in the instruction memory system. Figure 3 shows the same memory sequence as before, along with the newly defined signals. ASEQ is asserted for thefetchofaddress 1038, althoughitisnotvalidasearly in the cycle as SEQ. Also, ISEQ is asserted for the fetch of 2000 and This is because 1006 is a sequential instruction fetch to Finally, ISEQ is asserted similar to ASEQ for address 1038, at the cost of a later setup time for ISEQ compared to SEQ. The assertion of the sequential indicator signals allows the burst memory sequence to continue uninterrupted, thereby avoiding the penalty of a burst sequence re-start.

3 ; SEQ ASEQ ISEQ 1000 add r1,r2 ;??? 1002 ld r2,(2000) ; sub r1,r2 ; ; cmp r1,r2 ; beq 1030 ; a mult r2,r3 ; add r1,r2 ; cmp r1,r2 ; beq 1050 ; and r1,r2 ; xor r1,r2 ; a not r1 ; Figure 3: Memory sequence with ASEQ and ISEQ 2.3 Burst Mode Interface Results The addition of the two extra signals, ASEQ and ISEQ, results in a significant increase in performance with minimal cost. In embedded applications for the M CORE M3 processor, where change-of-flow instructions account for roughly 15% of the dynamic instruction stream and load/ store instructions account for 20% of the dynamic instruction stream, the addition of these signals has a significant impact on overall system performance, with a minimal impact on area. Powerstone benchmark analysis show ASEQ signal assertion to be 40% greater than that of SEQ, resulting in a 40% decrease in burst sequence interruptions [4]. 3 Instruction Fetch Enhancements In many processors, pipeline throughput is improved by the addition of instruction buffers and wider paths to memory [5]. As buffers and wider datapaths are added, instruction fetch bandwidth is increased, thereby allowing for a more efficient pipeline utilization. If a CPU s instruction length is 16-bits, two instructions may be accessed each cycle from a 32-bit memory system. These instructions are stored in instruction buffers until needed, allowing for a surplus of instructions during periods of lower instruction memory throughput. It has been shown that the addition of instruction buffers andtheincrease ofinstruction fetchsizeprovidesignificant performance advantages. The 16-bit instruction set architecture M CORE M3 processor showed a 28% performance improvement from 32-bit memories with the addition of three instruction buffers and doubling the instruction fetch size [5]. However, these improvements are only seen when accessing a memory as wide as, or wider than, the instruction fetch size. For a memory system smaller than the instruction fetch size, there is a performance penalty. This penalty is associated with the fetch of unused opcodes around change-of-flow instructions because a smaller memory width device is not capable of supplying a pair of instructions with the same latency as a single instruction. In order to gain performance in wider memory systems, without degrading performance in cost conscious smaller width memory systems, a dynamic input signal was defined. The signal, IFSIZ, is asserted when 16-bit memory systems are accessed and negated when wider memory systems are accessed. This signal dynamically changes during program execution, depending on the memory width accessed, and is based simply on an address decoder. Upon a switch from one memory width to another, the processor detects the change in IFSIZ and dynamically changes the instruction fetchsizeatthenextproperly alignedboundary. Figure 4 shows the transition from a 32-bit memory to a 16-bit memory for the M CORE M3 processor. The instruction fetch to address A0 uses a transfer size request (TSIZ) of 32-bits and returns instructions I0 and I1 the following cycle. When address A4 is fetched, an address decode indicates that the fetch is to a 16-bit memory and IFSIZ is asserted accordingly. The processor recognizes this assertion and changes the instruction fetch size to 16- bits on the following access. Because the fetch to address A4 is to a 16-bit memory and the fetch size was still 32- bits, the processor must wait for the memory system to sequentially fetch both half-words (instructions I2 and I3) and return them to the processor. Once the initial fetch to the 16-bit memory system is complete, all subsequent fetches use a 16-bit TSIZ. ADDR TSIZ IFSIZ A0 A4 A8 A8 A I0,I1 I2,I3 I4 I5 Figure 4: IFSIZ transition from 32-bit to 16-bit memory Figure 5 below shows the case for a transition from 16- bit memory to 32-bit memory. The instruction fetch to address A0 uses a TSIZ of 16-bits and returns instruction I0 the following cycle. When address A0 is fetched, an address decode indicates that the fetch is to a 32-bit memory and IFSIZ is negated accordingly. The processor recognizes this negation and changes the instruction fetch size to 32-bits on the next properly aligned access. Because address A0 is a word-aligned address, the transfer size can not be changed until both halves of the word have been fetched. Once the initial fetch (or fetches) to the 32-bit memory system is complete, all subsequent fetches use a 32-bit TSIZ as is shown in the figure.

4 ADDR TSIZ IFSIZ A0 A2 A4 A8 A I0 I1 I2,I3 I4,I5 I6,I7 Figure 5: IFSIZ transition from 16-bit to 32-bit memory The addition of the IFSIZ input signal to the M CORE M3 processor requires a small state-machine used to detect the current memory region and the next memory region. This state-machine generates control signals to properly assert the TSIZ output pin and to selectively add 2 or 4 to the previous address to generate the next address. There are no issues with timing since there is at least a cycle to make the transitions, which is plenty of setup time. The addition of this flexibility in the system design resulted in an 11% performance improvement when tested using the Powerstone benchmark suite [4]. 4 Selectable Branch Prediction Branch prediction is one technique used to improve processor performance. In most instances, processors that predict the outcome of conditional branch instructions gain performance because they can make educated guesses and fetch the branch target instruction before resolution of the actual outcome of the branch instruction. However, there is usually a performance penalty associated with mispredictions. Since branch instructions cause a change in the sequential instruction stream fetch pattern, an incorrect address speculation can result in lost processor cycles. These processor cycles are lost because in the case of a misprediction, the incorrectly fetched instruction stream must be discarded, and the correct instruction stream is reloaded into the processor pipeline. The M CORE M2 processor has an aggressive branch implementation in which a taken branch is performed in two clock cycles (assuming zero wait state memory) [2]. There is no branch prediction, meaning the correct address is driven after resolution of the condition bit late in the decode cycle of the branch. For the M CORE M3 processor, the addition of instruction buffers and the increase in the instruction fetch size reduced the penalty associated with a branch misprediction. Therefore, a simple branch prediction scheme was implemented in which all branches were predicted taken [5]. If this prediction turned out to be false, an ABORT signal is asserted to the memory system during the next cycle to abort the transfer. This results in a single bus cycle penalty for a misprediction. When executing code from a 32- bit memory system, this single bus cycle penalty does not affect overall performance in most cases because of the large surplus of sequential instructions available in the instruction buffers. The advantage with this scheme is the increase in setup time of the address bus to the memory system, allowing for a more cost-effective memory subsystem. Branch resolution is no longer in the critical path of the next address calculation. Figure 6 shows the case of a mis-prediction in the M3 processor. In the figure, BT is a branch on condition true instruction. ADDR DECODE ABORT_B A0 A4 A8 B Targ A12 A16 I0,I1 BT,I3 I4,I5 I6,I7 I8,I9 I0 I1 BT I3 I4 I5 I6 Figure 6: Branch mis-prediction in the M3 cpu from 32-bit memory However, for 16-bit memory systems, this enhancement is costly. With the limited instruction fetch bandwidth, even a one bus cycle penalty will negatively affect overall processor performance (see Figure 7). The solution to this is to allow for the system designer to selectively choose whether they prefer to use the branch prediction mechanism (and gain more memory access time) or not use branch prediction (and lose memory access time). Therefore, a signal called APRED (or address prediction) was added to the processor. This signal is tied during system integration and statically selects whether branch prediction is enabled or not. ADDR DECODE ABORT_B Figure 7: Branch mis-prediction in the M3 cpu from 16-bit memory INST REG I0 I1 BT I3 I4 I5 I6 I7 BUFFERS I1 BT I3 I4 I5 I6 I7 I8 I3 I4 I5 I7 I8 I9 I5 I9 A0 A4 B Targ A8 INST REG BUFFERS I0,BT I2,I3 I0 BT I2 I3 STALL I4 I0 BT BT.. The addition of this signal turned out to be a very complexity-effective design decision. The chosen branch prediction methodology is very simple, yet improves address bus timing significantly. In fact, synthesis can be performed which completely eliminates all logic related to the I2 I3 I3 I4,I5 I4 I5 I5

5 mode that will not be used, saving area and eliminating timing false paths through the unused logic. 5 Software ing Enhancements Many systems today are Read-Only-Memory (ROM) based systems that require program code to be installed at the time silicon masks are created. Since the turnaround time from mask generation to silicon can be anywhere from three weeks to three months, it is desirable to have an effective means of modifying ROM code built into the hardware. There are also examples in certain automotive applications where data tuning a program as it is running is required. In both of these examples, a complexity-effective design is required in order to minimize silicon cost, design time, and software overhead, as well as achieve a minimal performance penalty when software patching is needed. 5.1 Traditional ROM based patching Traditionally, ROM patching involves a set of programmable address or address range comparators, which cause an interrupt exception to occur on a match. Exceptions dictate that the processor be in a state where the exceptions can be handled. This requires the processor to alter the current execution context and then save and restore the processor state. Furthermore, during portions of code where exceptions are inhibited, no patching may occur if the patching mechanism is implemented via forcing an exception. Other traditional ROM patching schemes include opcode substitution where a delay path is added to the time critical data input path to force a substituted instruction such as an absolute jump. This method is used in the Motorola DSP56600 family [1]. Using this method, memory timing must be taken into account to avoid additional speed paths in the memory subsystem and memory control. 5.2 Software ing Enhancements The software patching implementation has several aspects. One aspect is the way in which program memory is defined. Each portion of instructions or data to be patched corresponds to a patch code pointer which is located in a patch pointer table as shown in Figure 8. Program Memory Instruction(s) to be patched Code N-2 Pointer Address Code Pointer 0 Code Pointer 1 0 Code Pointer N-2 Code Pointer N-1 Pointer Table Offset Code N-1 Code Memory Figure 8: Program memory for patching Base Address A patch pointer table includes patch code pointer 0 through patch code pointer N-1, where each patch code pointer may correspond to one instruction or group of instructions to be patched or to substitute data values. Upon accessing an instruction address requiring a substitution, the processor utilizes the patch pointer table to locate the corresponding patch code. code pointers provide a patch code address which redirects program flow to a patch code memory containing the actual patch code to be executed. At the end of the patch code, a flow redirection instruction may return flow back to program memory to continue normal execution of the code until a next patch pointer address is encountered. Likewise, in response to a data access requiring a substitution, the processor utilizes the patch pointer table to provide substitute data values. The patch routines may each reside in separate memories or may even reside within program memory. Likewise, patch pointer tables may be stored within any memory and are defined to be user programmable. Each patch code pointer within the patch pointer table is referenced by a patch base address and a corresponding patch offset. Another aspect of the software patching implementation is the hardware circuitry needed to implement the scheme which is illustrated in Figure 9.

6 5.3 Software ing Conclusion Bit Offset Request Register Offset Request Base Register Bit Mux Instruction Register Offset Opcode Address Mask Address & Attributes Low Order Address Comparator & Attributes Inst Request Address Base Encoder Match Address Base Address Mask Match Figure 9: Software ing Implementation The hardware circuitry that is defined redirects program flow by generating an address comparison match when it identifies an address for which program execution should be redirected. The circuitry generates a control field having an offset specifically corresponding to the address. The instruction data is tagged with an identifying patch bit and is recognized when the instruction is decoded by the data processor. The data processor receives the instruction, but discards the instruction prior to execution. The circuitry then creates a redirected address value by combining the patch offset with the patch base address and implements redirection of program flow by utilizing the redirected address value. The circuitry is also capable of redirecting a data access by generating an address comparison match when it identifies a data access for which a substitute data value should be provided. The circuitry generates a control field having the offset, and subsequently, the data access is tagged with the data patch bit during the termination cycle of the data access. Once again, the processor discards the data prior to execution completion, and the circuitry creates a redirected address value by combining the patch offset with the patch base register. The data processor implements redirection of the data access by utilizing the redirected address value to access the substitute data value. In the software patching scheme, program flow redirection can be performed by utilizing a control field without changing the current execution context through an exception or without providing the processor a substitute opcode, such as an absolute jump. This method may also be utilized for patching exception handlers, or any code that is executed in supervisor or user space. 6 Conclusion The M CORE M3 architecture enhancements strike a delicate balance between the goals of increased performance and minimal cost and complexity. This balance is the foundation of the low-cost, low-power portable embedded marketplace. The improvements discussed in this paper resulted in significant performance increases, without a substantial increase in cost or complexity. 7 References [1] DSP bit Digital Signal Processor Family Manual, Motorola Inc., [2] M CORE Reference Manual, Motorola Inc., [3] B. Moyer, J. Arends, RISC Gets Small, Byte Magazine, February, [4] J. Scott, L. Lee, J. Arends, B. Moyer, Designing the Low- Power M CORE Architecture, Proc. Int l. Symp. on Computer Architecture Power Driven Microarchitecture Workshop, Barcelona, Spain, July 1998, pp [5] J. Scott, L. Lee, A. Chin, J. Arends, B. Moyer, Designing the M CORE M3 CPU Architecture, Proc. IEEE Int l. Conf. on Computer Design, Austin, Texas, October 1999, pp [6] D. Patterson, J. Hennesey. Computer Architecture: A Quantitative Approach, 2nd ed. (San Francisco: Morgan Kaufmann Publishers, Inc., 1996), M CORE is a trademark of Motorola, Inc.

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture

More information

The Processor: Improving the performance - Control Hazards

The Processor: Improving the performance - Control Hazards The Processor: Improving the performance - Control Hazards Wednesday 14 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

CISC RISC. Compiler. Compiler. Processor. Processor

CISC RISC. Compiler. Compiler. Processor. Processor Q1. Explain briefly the RISC design philosophy. Answer: RISC is a design philosophy aimed at delivering simple but powerful instructions that execute within a single cycle at a high clock speed. The RISC

More information

Instruction Fetch Energy Reduction Using Loop Caches For Embedded Applications with Small Tight Loops. Lea Hwang Lee, William Moyer, John Arends

Instruction Fetch Energy Reduction Using Loop Caches For Embedded Applications with Small Tight Loops. Lea Hwang Lee, William Moyer, John Arends Instruction Fetch Energy Reduction Using Loop Caches For Embedded Applications ith Small Tight Loops Lea Hang Lee, William Moyer, John Arends Instruction Fetch Energy Reduction Using Loop Caches For Loop

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Universität Dortmund. ARM Architecture

Universität Dortmund. ARM Architecture ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Freescale Semiconductor, I

Freescale Semiconductor, I Copyright (c) Institute of Electrical Freescale and Electronics Semiconductor, Engineers. Reprinted Inc. with permission. This material is posted here with permission of the IEEE. Such permission of the

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hardware Organization and Design Lecture 27: Midterm2 review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Midterm 2 Review Midterm will cover Section 1.6: Processor

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Low-Cost Embedded Program Loop Caching - Revisited

Low-Cost Embedded Program Loop Caching - Revisited Low-Cost Embedded Program Loop Caching - Revisited Lea Hwang Lee, Bill Moyer*, John Arends* Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

14:332:331 Pipelined Datapath

14:332:331 Pipelined Datapath 14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

Chapter 3 - Top Level View of Computer Function

Chapter 3 - Top Level View of Computer Function Chapter 3 - Top Level View of Computer Function Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Top Level View 1 / 127 Table of Contents I 1 Introduction 2 Computer Components

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

More information

ELE 655 Microprocessor System Design

ELE 655 Microprocessor System Design ELE 655 Microprocessor System Design Section 2 Instruction Level Parallelism Class 1 Basic Pipeline Notes: Reg shows up two places but actually is the same register file Writes occur on the second half

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

15CS44: MICROPROCESSORS AND MICROCONTROLLERS. QUESTION BANK with SOLUTIONS MODULE-4

15CS44: MICROPROCESSORS AND MICROCONTROLLERS. QUESTION BANK with SOLUTIONS MODULE-4 15CS44: MICROPROCESSORS AND MICROCONTROLLERS QUESTION BANK with SOLUTIONS MODULE-4 1) Differentiate CISC and RISC architectures. 2) Explain the important design rules of RISC philosophy. The RISC philosophy

More information

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs V8-uRISC 8-bit RISC Microprocessor February 8, 1998 Product Specification VAutomation, Inc. 20 Trafalgar Square Nashua, NH 03063 Phone: +1 603-882-2282 Fax: +1 603-882-1587 E-mail: sales@vautomation.com

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

EN1640: Design of Computing Systems Topic 06: Memory System

EN1640: Design of Computing Systems Topic 06: Memory System EN164: Design of Computing Systems Topic 6: Memory System Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Wed. Aug 23 Announcements

Wed. Aug 23 Announcements Wed. Aug 23 Announcements Professor Office Hours 1:30 to 2:30 Wed/Fri EE 326A You should all be signed up for piazza Most labs done individually (if not called out in the doc) Make sure to register your

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 14 - Processor Structure and Function

Chapter 14 - Processor Structure and Function Chapter 14 - Processor Structure and Function Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 14 - Processor Structure and Function 1 / 94 Table of Contents I 1 Processor Organization

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141 EECS151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: John Wawrzynek and Nick Weaver Lecture 19: Caches Cache Introduction 40% of this ARM CPU is devoted to SRAM cache. But the role

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

ECE550 PRACTICE Final

ECE550 PRACTICE Final ECE550 PRACTICE Final This is a full length practice midterm exam. If you want to take it at exam pace, give yourself 175 minutes to take the entire test. Just like the real exam, each question has a point

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK] Lecture 17 Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] SRAM / / Flash / RRAM / HDD SRAM / / Flash / RRAM/ HDD SRAM

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

EIE/ENE 334 Microprocessors

EIE/ENE 334 Microprocessors EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

More information

Micro-programmed Control Ch 15

Micro-programmed Control Ch 15 Micro-programmed Control Ch 15 Micro-instructions Micro-programmed Control Unit Sequencing Execution Characteristics 1 Hardwired Control (4) Complex Fast Difficult to design Difficult to modify Lots of

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

Machine Instructions vs. Micro-instructions. Micro-programmed Control Ch 15. Machine Instructions vs. Micro-instructions (2) Hardwired Control (4)

Machine Instructions vs. Micro-instructions. Micro-programmed Control Ch 15. Machine Instructions vs. Micro-instructions (2) Hardwired Control (4) Micro-programmed Control Ch 15 Micro-instructions Micro-programmed Control Unit Sequencing Execution Characteristics 1 Machine Instructions vs. Micro-instructions Memory execution unit CPU control memory

More information

Micro-programmed Control Ch 15

Micro-programmed Control Ch 15 Micro-programmed Control Ch 15 Micro-instructions Micro-programmed Control Unit Sequencing Execution Characteristics 1 Hardwired Control (4) Complex Fast Difficult to design Difficult to modify Lots of

More information

TMS320C5x Interrupt Response Time

TMS320C5x Interrupt Response Time TMS320 DSP DESIGNER S NOTEBOOK TMS320C5x Interrupt Response Time APPLICATION BRIEF: SPRA220 Jeff Beinart Digital Signal Processing Products Semiconductor Group Texas Instruments March 1993 IMPORTANT NOTICE

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

CS 101, Mock Computer Architecture

CS 101, Mock Computer Architecture CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically

More information

Hardwired Control (4) Micro-programmed Control Ch 17. Micro-programmed Control (3) Machine Instructions vs. Micro-instructions

Hardwired Control (4) Micro-programmed Control Ch 17. Micro-programmed Control (3) Machine Instructions vs. Micro-instructions Micro-programmed Control Ch 17 Micro-instructions Micro-programmed Control Unit Sequencing Execution Characteristics Course Summary Hardwired Control (4) Complex Fast Difficult to design Difficult to modify

More information

EN1640: Design of Computing Systems Topic 06: Memory System

EN1640: Design of Computing Systems Topic 06: Memory System EN164: Design of Computing Systems Topic 6: Memory System Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring

More information

THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION

THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION Radu Balaban Computer Science student, Technical University of Cluj Napoca, Romania horizon3d@yahoo.com Horea Hopârtean Computer Science student,

More information

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

CSCE 5610: Computer Architecture

CSCE 5610: Computer Architecture HW #1 1.3, 1.5, 1.9, 1.12 Due: Sept 12, 2018 Review: Execution time of a program Arithmetic Average, Weighted Arithmetic Average Geometric Mean Benchmarks, kernels and synthetic benchmarks Computing CPI

More information

A First Look at Microprocessors

A First Look at Microprocessors A First Look at Microprocessors using the The General Prototype Computer (GPC) model Part 2 Can you identify an opcode to: Decrement the contents of R1, and store the result in R5? Invert the contents

More information

CN310 Microprocessor Systems Design

CN310 Microprocessor Systems Design CN310 Microprocessor Systems Design Micro Architecture Nawin Somyat Department of Electrical and Computer Engineering Thammasat University 28 August 2018 Outline Course Contents 1 Introduction 2 Simple

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

More information

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions. MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc - 2016 To understand this chapter, you will need to understand some

More information