m powerpc 603~ The PowerPCm 603 Microprocessor: A High Performance, Low Power, Superscalar RISC Microprocessor

Size: px
Start display at page:

Download "m powerpc 603~ The PowerPCm 603 Microprocessor: A High Performance, Low Power, Superscalar RISC Microprocessor"

Transcription

1 The PowerPCm 603 Microprocessor: A High Performance, Low Power, Superscalar RISC Microprocessor Brad Burgesst, Mike Alexandee, Ying-wai Ho, Suzanne Plummer Litcht, Soummya Mallicks, Deene Ogdent, Sung-Ho Parks, Jeff Slatont tmotorola, Inc., $International Business Machines Corporation Somerset Design Center, 9737 Great Hills Trail, Austin TX Abstract m powerpc 603~ microprocessor is the second member of the PowerPC microprocessor family. The 603* is a superscalar implementation featuring low power operation of less than 3 watts while maintaining high performance of 75 SPECint92 (estimated) at 80MHz. The 7.4mm by IISmm design is implemented in OScvn, four level metal CMOS technology. The 603 features dual 8-Kbyte instruction and data caches and a 3U64-bit system bus. Peak instruction rates of 3 instructions per clock cycle give outstanding performance to notebook and portable applications Introduction he 603 is the second design in the PowerPC series of microprocessors. The 603 is a 32-bit implementation and uses a 3.3V, four level metal, 0.5p.n CMOS technology. It offers a peak instruction rate of 3 instructions per clock cycle with power dissipation under 3 watts at 80Mhz. The 1.6 million transistor, 85mm2 design was done at the Somerset design center and used the talents of engineers from Apple, IBM, and Motorola. The project took 18 months from the beginning of the microarchitecture definition to lirst tape out, and initial silicon was able to boot operating systems Instruction flow overview A block diagram of the 603 is shown in Figure 1. Up to two instructions per clock cycle are fetched from the +In this document, the terms PowerPC 603 Microprocessa and 603 are used to denote the second microprocessor from the PowerPC Architecture family. IBM, PowerPC, PowerPC Architecture, PowerPC 603 and PowerPC 601 are trademarks of International Business Machines Corporation. Motada is a trademark d Motorola, I ncop rated. Apple is a trademark of Apple Computer Corporation. SPEC, SPECinC32 and SPECfp92 are trademarks of Standard Performance Evaluation Corporation. Data MMW 8-Kbyte Data Cache Figure 1: 603 Block Diagram instruction cache and sent to both the instruction buffer and the branch unit. The branch unit decodes and executes branch instructions and deletes them from the instruction buffer (branch folding). The instruction buffer supplies instructions to a primary instruction decoder which can dispatch up to two instructions per clock cycle, in program order, to any of four function units: Fixed-point Unit 0, System Unit (SYSU), Load / Store Unit (LSU), and Floating-point Unit (FPU). Operands from the Gentral Purpose Register (GPR) file and/or Floating Point Register (FPR) file are read during the dispatch cycle. Each function unit has a reservation station so that instructions can be dispatched regardless of whether all operands are available. When an operand becomes available after instruction dispatch, it is forwarded to the unit and the instruction is executed. Register renaming alleviates anti-dependency delays and provides support for both precise exceptions and speculative execution. Once dispatched, instructions are allowed to execute out of order (in order within each /94 $ IEEE 300

2 function unit). A completion buffer mechanism is used to tradc execution and to control retirement of up to two instructions per clock cycle in program order. Including branch folding, the 603 can effectively retire three instructions in a given clock cycle Dispatch / completion The primary instruction decoder decodes two instructions per clock cycle from the lowest two entries of the 6- entry instruction buffer FIFO. As part of instruction decode, the source operands are read from the register files or forwarded from rename registers. Destination rename registers are also allocated as part of dispatching the instruction. Instructions are dispatched to a function unit and also to the completion buffer. The five-entry completion buffer tracks the status of all dispatched instructions. As the function unit finishes execution of the instruction, the result is written to a rename register. The completion buffer tags the instruction as finished and the function unit is free to accept another instruction. Up to two finished instructions per clock cycle are retired in program order by the completion buffer which controls the writing of rename registers to the architected registers. There are several conditions that can prevent instruction dispatch such as function unit busy, completion buffer full, or rename registers unavailable The branch unit As shown in Figure 2, the 603 s branch unit decodes and executes branches independently from the main instruction decoder. Instructions are fetched two per clock cycle from the instruction cache and are forwarded to both the instruction buffer and the branch unit. (If the address is to Effective Address Block Instructions Figure 2: Branch Unit Block Diagram the last word of a cache line only a single instruction is returned.) The branch unit also decodes the opcode space for modifications to branch related resources such as the Count Register (ctr), Link Register Q, or Condition Register field (a-field). This information is maintained with instructions through the sequencer pipeline. When a branch instruction is decoded, the pipeline is sequentially searched to determine whether the resources needed to execute the branch are available. If the needed link or count register is unavailable, the branch unit and instruction prefetching stall. If aneeded a-field is unavailable, the branch condition is saved. The branch is then speculatively executed along either the taken or fallthru path based upon static prediction. The address for the nonpredicted path is saved and subsequent instructions are tagged as speculative. During succeeding clock cycles, tbe a-bit condition is again checked until available. If the crbit matches the predicted value, speculative tags are cleared. Otherwise, the tagged instructions are flushed and fetching is initiated down the correct path When a branch instruction that does not modify the link or count registers is decoded, the instruction is deleted Erom the instruction buffer. Subsequent instructions are advanced in the instruction buffer collapsing the bubble and recovering the branch s dispatch slot for other instructions. To support the precise exception model, branches that modify the link or count registers proceed through the pipeline so that the resource can be architecturally committed in program order. In each clock cycle, the branch unit decodes two instructions, searches branch conditions, detects mispredktion, checks exception processing, determines the number of words being returned by the Icache, and then generates a fetch address. The link and count registers are both shadowed in order to support the precise exception model. Data forwarding from the system unit is provided for move-to-link and move-tu-count instructions. For subroutine returns, the link-shadow may also be forwarded. The branch unit also keeps a pipeline of addresses corresponding to each instruction in the machine. When an exception occurs, the address of the faulting instruction is readily available Fixed-point unit All instructions executed by the fixed-point unit are handled by one of three major blocks: the ShiftRome block, the Addcompare block and the MultiplyDivide block as shown in Figure 3. Most fixed-point operations are singlecycle but multi-cycle operations block dispatch of further fixed-point instructions. Arithmetic, logical, compare, and shifthotate instructions are executed in a single-cycle. The operands are 301

3 I- rll- I WCmP I Figure 3: Fixed-point I Unit Block Diagram latched into a reservation station during decode and are executed the following clock cycle, making these operations fully pipelined. The reservation station receives operands from both the GPRs or the GPR rename registers. The result of all fixed-point operations are written back to the rename registers. The multiply/divide and the trap instructions are multicycle operations. Tbe 603 provides a 32- by 8-bit, singlecycle multiply but other operand widths take from one to four clock cycles. Integer divides take 37 clock cycles. The hardware multiplier is a two-stage, 42 carry-save adder with Booth recoding logic. Radix 4 modified Booth recoding is used to halve the number of partial products from 8 to Loadhtore unit The LSU executes all load/store instructions, cache control operations, TLB control operations, and graphics load store operations. The unit consists of all the resources necessary to calculate the effective address, handle data alignment to and from the cache, handle multi-access instructions such as strings and multiples, and perform normalization and denonnalization of floating-point store data. The LSU also prioritizes data exceptions and updates the Data Address Register (DAR) and the Data Storage Interrupt Status Register (DSISR). Since load instructions often provide data to subsequent insuuctions, load latency is more critical than store latency to the overall system performance. The 603 LSU is optimized to handle loads in a low-latency and fully pipelined manner. Loads, when free of dependencies, execute speculatively with a maximum throughput of one per clock cycle with a two clock cycle latency. Stores are held in the LSU until the completion buffer logic signals that the store can be committed. Stores take two clock cycles to execute but are not pipelined e When an instruction is dispatched to the LSU, the two operands are latched into the reservation station as shown in Figure 4. The operands can be taken directly from the register file, can be an immediate value imbedded in the opcode, or can be taken from one of the five rename registers. In the cycle after both operands become valid, the LSU adds the two operands to generate the 32-bit effective address (EA). The EA is sent to the cache and MMU to calculate the physical address, determine if an exception occurred, and determine if the EA hits or misses in the cache. During this cycle, the LSU also checks the EA for misalignment and checks for certain exceptions that can be detected by the LSU such as a misaligned lodstore multiple instruction or a disabled graphics loadstore instruction. Misaligned addresses are broken into two cache accesses by the LSU. The Misalign register in Figure 4 is also used for other multi-access instructions such as loadstore multiple and load/store string. The LSU gathers and aligns the two data transfers during a misaligned access between the register files and the cache. The LSU contains two store queues; the Finished Store Queue (FSQ) and the Committed Store Queue (CSQ). Store EAs that have been looked up in the MMU are latched in the FSQ. Once the completion buffer logic signals that the store can be completed, its EA is latched into the CSQ and sent to the cache as soon as the cache is free. Load requests to the cache are allowed to bypass store requests as long as there are no hazards such as address aliasing or misalignment. If a load is not permitted to bypass a store, its EA is held in the Hold latch. Figure 4: Load/Store Address Path 302

4 There are two &bit unimonal data buses between the LSU and the cache; one for load data to the LSU and one for store data from the LSU. The LSU handles multiplexing of misaligned data within a word and the cache handles word misalignment within a double word. Single- and double-precision floating-point loads and stores are handled by the LSU. Floating-point values are stored in a specially tagged double-precision format which is invisible to the programmer. Some stores of denormalized n m h will require normalization or denormalization by the LSU. The LSU handles these infrequent cases one bit at a time, with a worst case latency of 23 clock cycles Floating-point unit The FPU supports IEEE-754 standard single- and double- precision binary floating-point arithmetic operations. Most single-precision multiply-add instructions have a one clock cycle throughput and three clock cycle latency and most double-precision multiply-add instructions have a two clock cycle throughput and a four clock cycle latency. Since the most common use of a floating-point unit is far dot-product calculation, the FPU architecture is optimized to perform a multiply and an add operation in one floating-point instruction as shown below: FRT = FRA * FRC + FRB (1) where FRT is the target operand. FRA, FRB and FRC are source operands. Each of the four operands can come from any one of the 32 FPRs. The floating-point move, add, subtract and multiply instructions are implemented from this multiply-add architectuxe by forcing the FRC to the constant 1.O ar the FRB to the constant 0.0. The FPU has hardware to support floating-point division, floating-point to integer conversion, p-normalization and post denormalization operations for denormalized Operands. The FPU, as shown in Figure 5. is comprised of three independent pipeline stages; the Mult stage, the Carry Propagate Add (CPA) stage and the Write Back (WB) stage. Each stage generally requires only one clock cycle to execute. The FPU can accept floating-point instructions from the primary instruction decoder at the peak rate of one instruction per clock cycle.. Tbe FPU has Wwrite access to the FPRs and the four floating-point rename registers which are used to provide temporary storage for the target operand. These rename registers also allow fast result forwarding for the next instruction and can eliminate some storage dependency problems. The LSU also uses the same rename registers. The Mult stage performs the main multiply function of (FRA * FRC) using a 53- by 28-bit Booth recoding Wal-! i a a. 2 Rcdrkn Figure 5: FPU Block Diagram lace tree multiplier array to generate the accumulated partial product in the sum and carry format. Concurrently, the mantissa of FRB is right-shifted to align with the product of (FRA * FRC). Finally, the aligned FRB and the product * FRC) will be compressed by a 3 to 2 carry save adder. To minimize area, the multiplier array is implemented with half of a full 53- by 53-bit array using three levels of 4 to 2 carry save adders. Double-precision multiply operations use the Mult stage for an extra clock cycle. The second stage, CPA, performs a 161-bit one s complement add to generate the mantissa result of (FR4 * FRC + FRB). Parts of the 161-bit adder are implemented with inamenters to save area. The single 88-bit fast carry propagate adder is implemented using a combination of carry look ahead and carry select adder schemes. The output of the adder is always in sign magnitude format and is sent to the Leading Zeros Detector for normalization shift count calculation. The third stage, WB. performs the normalization left shift, rounding and status generation Operations. The normalization shifter is implemented with a 63-bit left shiiter. With the exception of mass cancellation in the mantissa calculation, the WB stage typically requires one clock 303

5 cycle to execute. be rounder can round the result to either the single- or double-precision position according to the desii rounding mode. The final result will be written to one of the four FPR rename registers andor forwarded back to the Mult stage for the next instruction execution. The bypass unit handles abnormal execution (i.e.: abnormal operands such as Nan and infinity, and abnonnal operations such as divide by zero). A default result bypasses the Mult and CPA stages to simplify the data path logic. Figure 6 and Figure 7 illustrate the execute timing of three floating-point instructions. Instruction 1 and Instruction 2 have no operand dependency, so they can be tightly coupled. Instruction 3 is waiting for the result of instruction 2 and starts executing immediately after the WB stage of instruction 2 qci cyc2 eyc3 cyd cy6 eyd cyc7 Figure 6: Single-Precision Multiply Timing Figure 7: Double-Precislon Multiply Timing The FPU employs a 2-bit non-restoring division algorithm which produces two correct mantissa bits per clock cycle. A normal single-precision divide requires 18 cycles to execute. The FFW also provides IEEE exception handlers and a non-ieee mode which helps provide deterministic run times by forcing any IEEE denormalized results to zero. This eliminates the data dependant pre-normalization and post denomalization operations System unit The 603 system unit executes all of the move to/from special purpose register instructions and all condition register logical instructions. In general, SYSU instructions are dispatched but not executed until the instruction is ready to retire Caches The dual 8-Kbyte instruction and data caches are two way set-associative and use the Least-Recently-Used (LRU) algorithm for choosing replacement cache limes. Cache loads operate on 32-byte lines and are performed in four 8-byte (doubleword) transfers. During cache line fills, access to the cache is blocked. As line fill data moves into the cache, the critical word is forwarded to minimize stalls due to memory latency. All cache tag entries are automatically invalidated at power-on. They can also be invalidated under software control through bits in a control register. Also, both caches may be locked. While a cache is locked, none of the lines in the cache will be replaced due to misses. The locking function is provided to allow real time loops to have a fixed access latency. Accesses which miss into a locked cache are treated as cache inhibited transactions. The 603 data cache maintains a fully coherent address space and is intended for use in single processor systems where coherency with DMA traffic is required. It is not intended for symmetric multiprocessing applications where significant data sharing occurs. Bus snooping is used to maintain a three state cache coherency protocol: valid clean, valid dirty and invalid. This coherency protocol is a compatible subset of the standard MESI protocol. For compatibility with the SHARED state of the MESI protocol. the 603 interprets all incoming snoops as if they are writes, preventing any other cache from containing the same data as a 603. For the same reason, all line fills by the 603 are marked as write misses (read with intent-tomodify). The 603 also implements a type of bus transaction termed read with no-intent-@cache which is treated as a non-cacheable burst read or write. The 603 snoops this transaction, copies any modified data back to memory, and leaves the cache lime valid and clean. The purpose of this transaction type is to decrease data thrashing by allowing non-caching devices (such as DMA controllers) to transfer data across the bus without invalidating a 603 cache entry. Tbe granularity of snooping is one line (32 bytes). The data cache tags are single-ported so a simultaneous load store and snoop access represents a resource collision. In this situation, the snoop access has highest priority and is given first access to the tags. The stalled load or store proceeds on the clock cycle following the snoop. The 603 implements a one line write-back buffer and a one line snoop copyback buffer. These buffers minimize the latency of the blocking cache protocol. As part of the PowerPC ArchitectureTM, a number of specialized cache control instructions are supported by the 603 to facilitate software control of cache contents. These instructions provide the ability to preload data, invalidate a line, flush a line and allocate/zero-fill a line in the data cache, and to flush a lime in the instruction cache. Only the data cache block zero-lill is broadcast to the external bus Memory Management?he 603 implements a virtual memory management 304

6 mechanism which provides a 32-bit physical address space and a 52-bit virtual address space. The 32-bit effective address space is extended to the 52-bit virtual address through the use of segment registers. The segment registers allow the programmer to partition the effective address space into sixteen 256-Mbyte blocks which are then mapped anywhere within a 4 terabyte (252) address space. The virtual address is then converted to a physical address through the Translation Lookaside Buffer (TLB). The effeaive address may also be translated to a physical address using Block Address Translation (BAT). Both the instruction and the data MMUs incorporate a %nay, two-way set associative TLB array. These TLB arrays are used to cache entries from the memory-resident page translation tables. The page size is CKbytes and replacement TLB entries are selected using an LRU algorithm. The tlbie W-invalidate-entry) instruction allows the operating system to invalidate individual TLB entries to speed up context switching. Each MMU also incorporates a four-entry BAT array. This CAM-based address translator allows the user to create four large memory partitions. The size of these partitions is software programmable from 128-Kbytes to 256- Mbytes. BAT translation takes priority over page translation. The 603 uses software tablewalks to perfonn TLB reloads. Hardware assistance is provided to help reduce software tablewalk latencies. When a miss exception is taken, support logic generates the address pointers to the page table and provides a preconstructed copy of the first word of the desired page table entry. This reduces the main tablewalk routine to approximately sixteen instructions (two line fills). Also, four shadow registers and an automatic copy of a condition code register eliminate the overhead associated with context savdrestores Bus interface unit The 603 features a non-multiplexed 32-bit address and 64-bit data bus interface that is capable of supporting a wide range of system implementations. The bus interface, shown in Figure 8, is based on the PowerPC 60x Microprocessor Interface S cifcatwn. It features independent address and data bus control which allows for advanced bus features including address pipelining and split bus transactions. The bus interface also supports bus snooping, transaction retry, and snoop copyback operations in order to provide full copyback cache coherency support in hardware. These features allow for system implementations ranging from traditional nonpipelined bus interfaces that may or may not require hardware snoop support, to pipeliied interfaces that minimize memory access latency, to Address SIaft DataAlWIstbn Data Transfer Figure 8: External Signals advanced pipelined and cache coherent bus interfaces. The bus interface also supports a 32-bit data bus mode which is configurable at start-up. In addition, the bus interface can be configured to operate at a lower (integer divisible) clock firequency with respect to the intemal processing units of the chip thereby allowing the bus to operate at one frequency while the internal units operate at lx, 2X, 3X, or 4X the bus frequency. The bus can operate at frequencies from 16 to 66 MHz. These features allow low cost system implementations which use affordable interface frequencies or a smaller data bus size, while still taking advantage of the higher internal processing rates of the 603 core. The data bus supports single-beat data transfers for non-cacheable and write-thru cache operations, and 4- beat (burst) data transfers for cache line operations to and from the onchip caches. The bus interface also supports address and data bus parity, address retry capability, normal and error data transfer te-on, and a data retry capability for late error correction by the memory system. To eliminate bus request latency, the bus interface also supports bus parking and pipelined bus granting. The separation of the address and data bus controls allows operations such as address pipelining which helps maximize data transfer throughput. The bus interface can pipeline one or more additional addresses on the bus while the current data tenure is running. These pipelined transactions may be use to service the on-chip branch unit, load store unit, or caches. In addition, any number of ad&esses may be pipelined on the bus as a result of multiple bus master depending on the system implementation. Figure 9 illustrates the flow of address pipelined bus transactions. Figure 9: Address Bus Pipelining Address pipelining also allows split transactions to be performed by the 603. Address tenure for a split transaction is terminated before data tenure begins. Using split 305

7 transactions, the bus interface also allows limited Out-forder transfer capability with respect to the order of data tenures nm, and for enveloped transactions. Qpically, an enveloped transaction is used to cause the processor to perform a complete snoop copyback transaction on a pipelined bus between the address and data tenures of a pending read transaction. This prevents certain deadlock scenarios that could occur when interfacing through a gateway to a second bus that cannot be retried. The bus snooping, transacti on rem, and snoop copyback capabilities of the bus interface allow the 603 to operate its onthip data cache in write-ttnu or copyback mode and still maintain coherency with DMA devices or other caching masters on the bus. Whethex ar not snooping of a Ifansaction is performed is determined by the global address attribute which is transferred with the address on the bus. This pennits global and local sections of the address map to be defined and prevents unnecessary bus snooping. The 603 always fetches instructions as non-gle bal since hardware coherency support is not required far the instructions. The bus snooping protocol allows noncaching, caching 3-state (MEI) and caching 4-state (MESI) bus masters. When the 603 performs a global read, it requires other masters to flush the referenced line. Like wise, when another master performs a global read, the 603 will snoop its data cache and flush any referenced line. The bus interface also provides cache-inhibit, write-thru, and cache-way attributes with the bus address to support a second level cache Development and production support The 603 Common On-chip Rocessor (COP) supports debug and test features on the 603. The COP uses the IEEE (JTAG) standard protocol to communicate with an external system debugger or an In-Circuit Emulation (ICE) system. The COP supports the three JTAG public instructions; Bypass, Sample/Preload and Extest for boundary scan testing. In addition, 37 COP instructions are provided for debug purposes. The COP debug support uses the serial scan chains in which all internal latches can be connected and scanned as one long data register. Before the internal state can be scanned out, the pr>cessor must be stopped through a special soft stop or hard stop* exception which is initiated by a single step, instruction address breakpoint, branch trace. or an external trigger. A soft stop exception freezes the pfocessor after allowing it to reach a recoverable state. Internal state may have advanced but the processor can resume normal operation after scanning. A hard stop freezes the processor immediately and the processor state may not be recoverable. In addition to these mechanisms, a 16-bit counter can be used to stop the processor at a par- ticular clock cycle. The content of embedded amys can also be scanned out. The COP has the ability to read and write external system memory to examine extemal memory contents and to download insauction or data. The COP also supports praduction testing through scan. It controls the Array Built in Self Test (ABIST) which is used to test the caches and tags. During the AF3IST test sequence, all four arrays undergo an exhaustive readwrite-read test sequence to deter stuck-at-faults, speed sensitivities, and failures due to capacitive coupling Power management During idle periods, software may initiate one of three power down modes implemented in the 603. In these modes, instruction processing is suspended and major portions of the processor are disabled to reduce power consumption. In Doze** mode, the timebase and decrementer remain active and data cache coherency is maintained through snooping. All other activity in the processor is stopped. In Nap mode, only the timebase and the decrementer are active, and in Sleep mode all functionality of the processor is stopped except for the clock PLL which can further be externally disabled. An external intempt or decrementer intempt (disabled in sleep mode) is used to bring the processor out of these modes. The 603 also incorporates logic that, on a clock-by-clock basis, powers down idle sections of the processor. This is done in a manner that reduces the average power consumption, but does not impact performance. [l] performance The 603 s performance at 80Mhz with a 1-Mbyte L2 cache is estimated to be 75 SPECint92 and 85 SPECfp92. ll~is performance was estimated nmning sampled traces on an architectural simulator with code generated by a POW~~PC compiler. ~ [ Conclusion The 603 is a high performance, low cost, RISC microprocessor. These features coupled with low power consumption make the 603 an ideal solution for the needs of the laptop and entry level desktop markets References [I] Gary, S., et al., The PowerPCm 603 Microprocessor: A Low-Power Design for Portable Applications, Pmeedings of COMPCON 1994, February [2] Poursepanj. A., et al.. The PowerPCm 603 Microprocessor: Performance Analysis and Design Trade-offs, Pmeedings of COMPCON 1994, February

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

The PowerPC RISC Family Microprocessor

The PowerPC RISC Family Microprocessor The PowerPC RISC Family Microprocessors In Brief... The PowerPC architecture is derived from the IBM Performance Optimized with Enhanced RISC (POWER) architecture. The PowerPC architecture shares all of

More information

PowerPC 603e RISC Microprocessor Technical Summary

PowerPC 603e RISC Microprocessor Technical Summary SA4-2027-00 (IBM Order Number) MPC603E/D (Motorola Order Number) /96 REV Advance Information PowerPC 603e RISC Microprocessor Technical Summary This document provides an overview of the PowerPC 603e microprocessor

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

PowerPC 604e RISC Microprocessor Technical Summary

PowerPC 604e RISC Microprocessor Technical Summary SA4-2053-00 (IBM Order Number) nc. MPC604E/D (Motorola Order Number) /96 REV Advance Information PowerPC 604e RISC Microprocessor Technical Summary This document provides an overview of the PowerPC 604e

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Freescale Semiconductor, I

Freescale Semiconductor, I Copyright (c) Institute of Electrical Freescale and Electronics Semiconductor, Engineers. Reprinted Inc. with permission. This material is posted here with permission of the IEEE. Such permission of the

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

MPC740 Microprocessor Overview Floating-point unit (FPU) Branch processing unit (BPU) System register unit (SRU) Load/store unit (LSU) Two integer uni

MPC740 Microprocessor Overview Floating-point unit (FPU) Branch processing unit (BPU) System register unit (SRU) Load/store unit (LSU) Two integer uni Order Number: MPC740TS/D Rev. 0, 9/2000 Advance Information MPC740 RISC Microprocessor Technical Summary This document provides an overview of the MPC740 PowerPC microprocessor features, including a block

More information

The ARM10 Family of Advanced Microprocessor Cores

The ARM10 Family of Advanced Microprocessor Cores The ARM10 Family of Advanced Microprocessor Cores Stephen Hill ARM Austin Design Center 1 Agenda Design overview Microarchitecture ARM10 o o Memory System Interrupt response 3. Power o o 4. VFP10 ETM10

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Digital Semiconductor Alpha Microprocessor Product Brief

Digital Semiconductor Alpha Microprocessor Product Brief Digital Semiconductor Alpha 21164 Microprocessor Product Brief March 1995 Description The Alpha 21164 microprocessor is a high-performance implementation of Digital s Alpha architecture designed for application

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

by M. T. Vaden L. J. Merkel C. R. Moore J. Reese Potter

by M. T. Vaden L. J. Merkel C. R. Moore J. Reese Potter Design considerations T. R. M. for the PowerPC 601 microprocessor by M. T. Vaden L. J. Merkel C. R. Moore J. Reese Potter The PowerPC 601 microprocessor (601) is the first member of a family of processors

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

A brief History of INTEL and Motorola Microprocessors Part 1

A brief History of INTEL and Motorola Microprocessors Part 1 Eng. Guerino Mangiamele ( Member of EMA) Hobson University Microprocessors Architecture A brief History of INTEL and Motorola Microprocessors Part 1 The Early Intel Microprocessors The first microprocessor

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

hypersparc: The Next-Generation SPARC

hypersparc: The Next-Generation SPARC hypersparc: The Next-Generation SPARC WHITE PAPER Introduction General Description of Product Several years ago, ROSS Technology set itself a goal: to develop the highest-performance microprocessor in

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Operating System Support

Operating System Support William Stallings Computer Organization and Architecture 10 th Edition Edited by Dr. George Lazik + Chapter 8 Operating System Support Application programming interface Application binary interface Instruction

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

ECE 341 Final Exam Solution

ECE 341 Final Exam Solution ECE 341 Final Exam Solution Time allowed: 110 minutes Total Points: 100 Points Scored: Name: Problem No. 1 (10 points) For each of the following statements, indicate whether the statement is TRUE or FALSE.

More information

Topics in computer architecture

Topics in computer architecture Topics in computer architecture Sun Microsystems SPARC P.J. Drongowski SandSoftwareSound.net Copyright 1990-2013 Paul J. Drongowski Sun Microsystems SPARC Scalable Processor Architecture Computer family

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

SAE5C Computer Organization and Architecture. Unit : I - V

SAE5C Computer Organization and Architecture. Unit : I - V SAE5C Computer Organization and Architecture Unit : I - V UNIT-I Evolution of Pentium and Power PC Evolution of Computer Components functions Interconnection Bus Basics of PCI Memory:Characteristics,Hierarchy

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

MIPS R5000 Microprocessor. Technical Backgrounder. 32 kb I-cache and 32 kb D-cache, each 2-way set associative

MIPS R5000 Microprocessor. Technical Backgrounder. 32 kb I-cache and 32 kb D-cache, each 2-way set associative MIPS R5000 Microprocessor Technical Backgrounder Performance: SPECint95 5.5 SPECfp95 5.5 Instruction Set ISA Compatibility Pipeline Clock System Interface clock Caches TLB Power dissipation: Supply voltage

More information

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

IBM Single Chip RISC Processor (RSC)

IBM Single Chip RISC Processor (RSC) IBM Single Chip RISC Processor (RSC) C. R. Moore, D. M. Baker, J.S. Muhich, and R.E. East Advanced Workstation Division International Business Machines Corporation Austin, Texas Abstract A highly in.d

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43 The CPU Pipeline 3 This chapter describes the basic operation of the CPU pipeline, which includes descriptions of the delay instructions (instructions that follow a branch or load instruction in the pipeline),

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at

More information

Superscalar Processor Design

Superscalar Processor Design Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Semester paper for CSE 3322, Fall Memory Hierarchies. vs. By : Login : Date : Nov 8 th, Director: Professor Al-Khaiyat TA : Mr.

Semester paper for CSE 3322, Fall Memory Hierarchies. vs. By : Login : Date : Nov 8 th, Director: Professor Al-Khaiyat TA : Mr. Memory Hierarchies vs. By : Login : Date : Nov 8 th, 1999 Director: Professor Al-Khaiyat TA : Mr. Byung Sung 1 Introduction: As a semester paper for computer sciences architecture course, this paper describe

More information

There are different characteristics for exceptions. They are as follows:

There are different characteristics for exceptions. They are as follows: e-pg PATHSHALA- Computer Science Computer Architecture Module 15 Exception handling and floating point pipelines The objectives of this module are to discuss about exceptions and look at how the MIPS architecture

More information

CN310 Microprocessor Systems Design

CN310 Microprocessor Systems Design CN310 Microprocessor Systems Design Micro Architecture Nawin Somyat Department of Electrical and Computer Engineering Thammasat University 28 August 2018 Outline Course Contents 1 Introduction 2 Simple

More information

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; lindenstruth@computer.org The continued increase in Internet throughput and the emergence of broadband access networks

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

620 Fills Out PowerPC Product Line

620 Fills Out PowerPC Product Line 620 Fills Out PowerPC Product Line New 64-Bit Processor Aimed at Servers, High-End Desktops by Linley Gwennap MICROPROCESSOR BTAC Fetch Branch Double Precision FPU FP Registers Rename Buffer /Tag Predict

More information

POWER3: Next Generation 64-bit PowerPC Processor Design

POWER3: Next Generation 64-bit PowerPC Processor Design POWER3: Next Generation 64-bit PowerPC Processor Design Authors Mark Papermaster, Robert Dinkjian, Michael Mayfield, Peter Lenk, Bill Ciarfella, Frank O Connell, Raymond DuPont High End Processor Design,

More information

The Alpha Microprocessor Architecture. Compaq Computer Corporation 334 South St., Shrewsbury, MA

The Alpha Microprocessor Architecture. Compaq Computer Corporation 334 South St., Shrewsbury, MA The Alpha 21264 Microprocessor Architecture R. E. Kessler, E. J. McLellan 1, and D. A. Webb Compaq Computer Corporation 334 South St., Shrewsbury, MA 01545 richard.kessler@digital.com Abstract The 21264

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Design and Test of the PowerPC TM 603 Microprocessor

Design and Test of the PowerPC TM 603 Microprocessor Design and Test of the PowerPC TM 603 Microprocessor E. Kofi Vida-Torku*, Charles H. Malley**, Sung Park*, Rowland Reed* * International Business Machines Corp., ** Motorola Inc. Somerset Design Center

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University Chapter 3 Top Level View of Computer Function and Interconnection Contents Computer Components Computer Function Interconnection Structures Bus Interconnection PCI 3-2 Program Concept Computer components

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals

More information

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions Structure of von Nuemann machine Arithmetic and Logic Unit Input Output Equipment Main Memory Program Control Unit 1 1 Instruction Set - the type of Instructions Arithmetic + Logical (ADD, SUB, MULT, DIV,

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

This section covers the MIPS instruction set.

This section covers the MIPS instruction set. This section covers the MIPS instruction set. 1 + I am going to break down the instructions into two types. + a machine instruction which is directly defined in the MIPS architecture and has a one to one

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

Advanced Computer Architectures

Advanced Computer Architectures Advanced Computer Architectures 03 Superscalar Techniques Data flow inside processor as result of instructions execution (Register Data Flow) Czech Technical University in Prague, Faculty of Electrical

More information

INTELLIGENCE PLUS CHARACTER - THAT IS THE GOAL OF TRUE EDUCATION UNIT-I

INTELLIGENCE PLUS CHARACTER - THAT IS THE GOAL OF TRUE EDUCATION UNIT-I UNIT-I 1. List and explain the functional units of a computer with a neat diagram 2. Explain the computer levels of programming languages 3. a) Explain about instruction formats b) Evaluate the arithmetic

More information

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units 6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers.

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers. Set No. 1 IV B.Tech I Semester Supplementary Examinations, March - 2017 COMPUTER ARCHITECTURE & ORGANIZATION (Common to Electronics & Communication Engineering and Electronics & Time: 3 hours Max. Marks:

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function William Stallings Computer Organization and Architecture 8 th Edition Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data

More information

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities

More information

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

BOBCAT: AMD S LOW-POWER X86 PROCESSOR ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011 AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information