m powerpc 603~ The PowerPCm 603 Microprocessor: A High Performance, Low Power, Superscalar RISC Microprocessor

Size: px

Start display at page:

Download "m powerpc 603~ The PowerPCm 603 Microprocessor: A High Performance, Low Power, Superscalar RISC Microprocessor"

Chastity Rose
5 years ago
Views:

1 The PowerPCm 603 Microprocessor: A High Performance, Low Power, Superscalar RISC Microprocessor Brad Burgesst, Mike Alexandee, Ying-wai Ho, Suzanne Plummer Litcht, Soummya Mallicks, Deene Ogdent, Sung-Ho Parks, Jeff Slatont tmotorola, Inc., $International Business Machines Corporation Somerset Design Center, 9737 Great Hills Trail, Austin TX Abstract m powerpc 603~ microprocessor is the second member of the PowerPC microprocessor family. The 603* is a superscalar implementation featuring low power operation of less than 3 watts while maintaining high performance of 75 SPECint92 (estimated) at 80MHz. The 7.4mm by IISmm design is implemented in OScvn, four level metal CMOS technology. The 603 features dual 8-Kbyte instruction and data caches and a 3U64-bit system bus. Peak instruction rates of 3 instructions per clock cycle give outstanding performance to notebook and portable applications Introduction he 603 is the second design in the PowerPC series of microprocessors. The 603 is a 32-bit implementation and uses a 3.3V, four level metal, 0.5p.n CMOS technology. It offers a peak instruction rate of 3 instructions per clock cycle with power dissipation under 3 watts at 80Mhz. The 1.6 million transistor, 85mm2 design was done at the Somerset design center and used the talents of engineers from Apple, IBM, and Motorola. The project took 18 months from the beginning of the microarchitecture definition to lirst tape out, and initial silicon was able to boot operating systems Instruction flow overview A block diagram of the 603 is shown in Figure 1. Up to two instructions per clock cycle are fetched from the +In this document, the terms PowerPC 603 Microprocessa and 603 are used to denote the second microprocessor from the PowerPC Architecture family. IBM, PowerPC, PowerPC Architecture, PowerPC 603 and PowerPC 601 are trademarks of International Business Machines Corporation. Motada is a trademark d Motorola, I ncop rated. Apple is a trademark of Apple Computer Corporation. SPEC, SPECinC32 and SPECfp92 are trademarks of Standard Performance Evaluation Corporation. Data MMW 8-Kbyte Data Cache Figure 1: 603 Block Diagram instruction cache and sent to both the instruction buffer and the branch unit. The branch unit decodes and executes branch instructions and deletes them from the instruction buffer (branch folding). The instruction buffer supplies instructions to a primary instruction decoder which can dispatch up to two instructions per clock cycle, in program order, to any of four function units: Fixed-point Unit 0, System Unit (SYSU), Load / Store Unit (LSU), and Floating-point Unit (FPU). Operands from the Gentral Purpose Register (GPR) file and/or Floating Point Register (FPR) file are read during the dispatch cycle. Each function unit has a reservation station so that instructions can be dispatched regardless of whether all operands are available. When an operand becomes available after instruction dispatch, it is forwarded to the unit and the instruction is executed. Register renaming alleviates anti-dependency delays and provides support for both precise exceptions and speculative execution. Once dispatched, instructions are allowed to execute out of order (in order within each /94 $ IEEE 300

2 function unit). A completion buffer mechanism is used to tradc execution and to control retirement of up to two instructions per clock cycle in program order. Including branch folding, the 603 can effectively retire three instructions in a given clock cycle Dispatch / completion The primary instruction decoder decodes two instructions per clock cycle from the lowest two entries of the 6- entry instruction buffer FIFO. As part of instruction decode, the source operands are read from the register files or forwarded from rename registers. Destination rename registers are also allocated as part of dispatching the instruction. Instructions are dispatched to a function unit and also to the completion buffer. The five-entry completion buffer tracks the status of all dispatched instructions. As the function unit finishes execution of the instruction, the result is written to a rename register. The completion buffer tags the instruction as finished and the function unit is free to accept another instruction. Up to two finished instructions per clock cycle are retired in program order by the completion buffer which controls the writing of rename registers to the architected registers. There are several conditions that can prevent instruction dispatch such as function unit busy, completion buffer full, or rename registers unavailable The branch unit As shown in Figure 2, the 603 s branch unit decodes and executes branches independently from the main instruction decoder. Instructions are fetched two per clock cycle from the instruction cache and are forwarded to both the instruction buffer and the branch unit. (If the address is to Effective Address Block Instructions Figure 2: Branch Unit Block Diagram the last word of a cache line only a single instruction is returned.) The branch unit also decodes the opcode space for modifications to branch related resources such as the Count Register (ctr), Link Register Q, or Condition Register field (a-field). This information is maintained with instructions through the sequencer pipeline. When a branch instruction is decoded, the pipeline is sequentially searched to determine whether the resources needed to execute the branch are available. If the needed link or count register is unavailable, the branch unit and instruction prefetching stall. If aneeded a-field is unavailable, the branch condition is saved. The branch is then speculatively executed along either the taken or fallthru path based upon static prediction. The address for the nonpredicted path is saved and subsequent instructions are tagged as speculative. During succeeding clock cycles, tbe a-bit condition is again checked until available. If the crbit matches the predicted value, speculative tags are cleared. Otherwise, the tagged instructions are flushed and fetching is initiated down the correct path When a branch instruction that does not modify the link or count registers is decoded, the instruction is deleted Erom the instruction buffer. Subsequent instructions are advanced in the instruction buffer collapsing the bubble and recovering the branch s dispatch slot for other instructions. To support the precise exception model, branches that modify the link or count registers proceed through the pipeline so that the resource can be architecturally committed in program order. In each clock cycle, the branch unit decodes two instructions, searches branch conditions, detects mispredktion, checks exception processing, determines the number of words being returned by the Icache, and then generates a fetch address. The link and count registers are both shadowed in order to support the precise exception model. Data forwarding from the system unit is provided for move-to-link and move-tu-count instructions. For subroutine returns, the link-shadow may also be forwarded. The branch unit also keeps a pipeline of addresses corresponding to each instruction in the machine. When an exception occurs, the address of the faulting instruction is readily available Fixed-point unit All instructions executed by the fixed-point unit are handled by one of three major blocks: the ShiftRome block, the Addcompare block and the MultiplyDivide block as shown in Figure 3. Most fixed-point operations are singlecycle but multi-cycle operations block dispatch of further fixed-point instructions. Arithmetic, logical, compare, and shifthotate instructions are executed in a single-cycle. The operands are 301

3 I- rll- I WCmP I Figure 3: Fixed-point I Unit Block Diagram latched into a reservation station during decode and are executed the following clock cycle, making these operations fully pipelined. The reservation station receives operands from both the GPRs or the GPR rename registers. The result of all fixed-point operations are written back to the rename registers. The multiply/divide and the trap instructions are multicycle operations. Tbe 603 provides a 32- by 8-bit, singlecycle multiply but other operand widths take from one to four clock cycles. Integer divides take 37 clock cycles. The hardware multiplier is a two-stage, 42 carry-save adder with Booth recoding logic. Radix 4 modified Booth recoding is used to halve the number of partial products from 8 to Loadhtore unit The LSU executes all load/store instructions, cache control operations, TLB control operations, and graphics load store operations. The unit consists of all the resources necessary to calculate the effective address, handle data alignment to and from the cache, handle multi-access instructions such as strings and multiples, and perform normalization and denonnalization of floating-point store data. The LSU also prioritizes data exceptions and updates the Data Address Register (DAR) and the Data Storage Interrupt Status Register (DSISR). Since load instructions often provide data to subsequent insuuctions, load latency is more critical than store latency to the overall system performance. The 603 LSU is optimized to handle loads in a low-latency and fully pipelined manner. Loads, when free of dependencies, execute speculatively with a maximum throughput of one per clock cycle with a two clock cycle latency. Stores are held in the LSU until the completion buffer logic signals that the store can be committed. Stores take two clock cycles to execute but are not pipelined e When an instruction is dispatched to the LSU, the two operands are latched into the reservation station as shown in Figure 4. The operands can be taken directly from the register file, can be an immediate value imbedded in the opcode, or can be taken from one of the five rename registers. In the cycle after both operands become valid, the LSU adds the two operands to generate the 32-bit effective address (EA). The EA is sent to the cache and MMU to calculate the physical address, determine if an exception occurred, and determine if the EA hits or misses in the cache. During this cycle, the LSU also checks the EA for misalignment and checks for certain exceptions that can be detected by the LSU such as a misaligned lodstore multiple instruction or a disabled graphics loadstore instruction. Misaligned addresses are broken into two cache accesses by the LSU. The Misalign register in Figure 4 is also used for other multi-access instructions such as loadstore multiple and load/store string. The LSU gathers and aligns the two data transfers during a misaligned access between the register files and the cache. The LSU contains two store queues; the Finished Store Queue (FSQ) and the Committed Store Queue (CSQ). Store EAs that have been looked up in the MMU are latched in the FSQ. Once the completion buffer logic signals that the store can be completed, its EA is latched into the CSQ and sent to the cache as soon as the cache is free. Load requests to the cache are allowed to bypass store requests as long as there are no hazards such as address aliasing or misalignment. If a load is not permitted to bypass a store, its EA is held in the Hold latch. Figure 4: Load/Store Address Path 302

4 There are two &bit unimonal data buses between the LSU and the cache; one for load data to the LSU and one for store data from the LSU. The LSU handles multiplexing of misaligned data within a word and the cache handles word misalignment within a double word. Single- and double-precision floating-point loads and stores are handled by the LSU. Floating-point values are stored in a specially tagged double-precision format which is invisible to the programmer. Some stores of denormalized n m h will require normalization or denormalization by the LSU. The LSU handles these infrequent cases one bit at a time, with a worst case latency of 23 clock cycles Floating-point unit The FPU supports IEEE-754 standard single- and double- precision binary floating-point arithmetic operations. Most single-precision multiply-add instructions have a one clock cycle throughput and three clock cycle latency and most double-precision multiply-add instructions have a two clock cycle throughput and a four clock cycle latency. Since the most common use of a floating-point unit is far dot-product calculation, the FPU architecture is optimized to perform a multiply and an add operation in one floating-point instruction as shown below: FRT = FRA * FRC + FRB (1) where FRT is the target operand. FRA, FRB and FRC are source operands. Each of the four operands can come from any one of the 32 FPRs. The floating-point move, add, subtract and multiply instructions are implemented from this multiply-add architectuxe by forcing the FRC to the constant 1.O ar the FRB to the constant 0.0. The FPU has hardware to support floating-point division, floating-point to integer conversion, p-normalization and post denormalization operations for denormalized Operands. The FPU, as shown in Figure 5. is comprised of three independent pipeline stages; the Mult stage, the Carry Propagate Add (CPA) stage and the Write Back (WB) stage. Each stage generally requires only one clock cycle to execute. The FPU can accept floating-point instructions from the primary instruction decoder at the peak rate of one instruction per clock cycle.. Tbe FPU has Wwrite access to the FPRs and the four floating-point rename registers which are used to provide temporary storage for the target operand. These rename registers also allow fast result forwarding for the next instruction and can eliminate some storage dependency problems. The LSU also uses the same rename registers. The Mult stage performs the main multiply function of (FRA * FRC) using a 53- by 28-bit Booth recoding Wal-! i a a. 2 Rcdrkn Figure 5: FPU Block Diagram lace tree multiplier array to generate the accumulated partial product in the sum and carry format. Concurrently, the mantissa of FRB is right-shifted to align with the product of (FRA * FRC). Finally, the aligned FRB and the product * FRC) will be compressed by a 3 to 2 carry save adder. To minimize area, the multiplier array is implemented with half of a full 53- by 53-bit array using three levels of 4 to 2 carry save adders. Double-precision multiply operations use the Mult stage for an extra clock cycle. The second stage, CPA, performs a 161-bit one s complement add to generate the mantissa result of (FR4 * FRC + FRB). Parts of the 161-bit adder are implemented with inamenters to save area. The single 88-bit fast carry propagate adder is implemented using a combination of carry look ahead and carry select adder schemes. The output of the adder is always in sign magnitude format and is sent to the Leading Zeros Detector for normalization shift count calculation. The third stage, WB. performs the normalization left shift, rounding and status generation Operations. The normalization shifter is implemented with a 63-bit left shiiter. With the exception of mass cancellation in the mantissa calculation, the WB stage typically requires one clock 303

5 cycle to execute. be rounder can round the result to either the single- or double-precision position according to the desii rounding mode. The final result will be written to one of the four FPR rename registers andor forwarded back to the Mult stage for the next instruction execution. The bypass unit handles abnormal execution (i.e.: abnormal operands such as Nan and infinity, and abnonnal operations such as divide by zero). A default result bypasses the Mult and CPA stages to simplify the data path logic. Figure 6 and Figure 7 illustrate the execute timing of three floating-point instructions. Instruction 1 and Instruction 2 have no operand dependency, so they can be tightly coupled. Instruction 3 is waiting for the result of instruction 2 and starts executing immediately after the WB stage of instruction 2 qci cyc2 eyc3 cyd cy6 eyd cyc7 Figure 6: Single-Precision Multiply Timing Figure 7: Double-Precislon Multiply Timing The FPU employs a 2-bit non-restoring division algorithm which produces two correct mantissa bits per clock cycle. A normal single-precision divide requires 18 cycles to execute. The FFW also provides IEEE exception handlers and a non-ieee mode which helps provide deterministic run times by forcing any IEEE denormalized results to zero. This eliminates the data dependant pre-normalization and post denomalization operations System unit The 603 system unit executes all of the move to/from special purpose register instructions and all condition register logical instructions. In general, SYSU instructions are dispatched but not executed until the instruction is ready to retire Caches The dual 8-Kbyte instruction and data caches are two way set-associative and use the Least-Recently-Used (LRU) algorithm for choosing replacement cache limes. Cache loads operate on 32-byte lines and are performed in four 8-byte (doubleword) transfers. During cache line fills, access to the cache is blocked. As line fill data moves into the cache, the critical word is forwarded to minimize stalls due to memory latency. All cache tag entries are automatically invalidated at power-on. They can also be invalidated under software control through bits in a control register. Also, both caches may be locked. While a cache is locked, none of the lines in the cache will be replaced due to misses. The locking function is provided to allow real time loops to have a fixed access latency. Accesses which miss into a locked cache are treated as cache inhibited transactions. The 603 data cache maintains a fully coherent address space and is intended for use in single processor systems where coherency with DMA traffic is required. It is not intended for symmetric multiprocessing applications where significant data sharing occurs. Bus snooping is used to maintain a three state cache coherency protocol: valid clean, valid dirty and invalid. This coherency protocol is a compatible subset of the standard MESI protocol. For compatibility with the SHARED state of the MESI protocol. the 603 interprets all incoming snoops as if they are writes, preventing any other cache from containing the same data as a 603. For the same reason, all line fills by the 603 are marked as write misses (read with intent-tomodify). The 603 also implements a type of bus transaction termed read with no-intent-@cache which is treated as a non-cacheable burst read or write. The 603 snoops this transaction, copies any modified data back to memory, and leaves the cache lime valid and clean. The purpose of this transaction type is to decrease data thrashing by allowing non-caching devices (such as DMA controllers) to transfer data across the bus without invalidating a 603 cache entry. Tbe granularity of snooping is one line (32 bytes). The data cache tags are single-ported so a simultaneous load store and snoop access represents a resource collision. In this situation, the snoop access has highest priority and is given first access to the tags. The stalled load or store proceeds on the clock cycle following the snoop. The 603 implements a one line write-back buffer and a one line snoop copyback buffer. These buffers minimize the latency of the blocking cache protocol. As part of the PowerPC ArchitectureTM, a number of specialized cache control instructions are supported by the 603 to facilitate software control of cache contents. These instructions provide the ability to preload data, invalidate a line, flush a line and allocate/zero-fill a line in the data cache, and to flush a lime in the instruction cache. Only the data cache block zero-lill is broadcast to the external bus Memory Management?he 603 implements a virtual memory management 304

6 mechanism which provides a 32-bit physical address space and a 52-bit virtual address space. The 32-bit effective address space is extended to the 52-bit virtual address through the use of segment registers. The segment registers allow the programmer to partition the effective address space into sixteen 256-Mbyte blocks which are then mapped anywhere within a 4 terabyte (252) address space. The virtual address is then converted to a physical address through the Translation Lookaside Buffer (TLB). The effeaive address may also be translated to a physical address using Block Address Translation (BAT). Both the instruction and the data MMUs incorporate a %nay, two-way set associative TLB array. These TLB arrays are used to cache entries from the memory-resident page translation tables. The page size is CKbytes and replacement TLB entries are selected using an LRU algorithm. The tlbie W-invalidate-entry) instruction allows the operating system to invalidate individual TLB entries to speed up context switching. Each MMU also incorporates a four-entry BAT array. This CAM-based address translator allows the user to create four large memory partitions. The size of these partitions is software programmable from 128-Kbytes to 256- Mbytes. BAT translation takes priority over page translation. The 603 uses software tablewalks to perfonn TLB reloads. Hardware assistance is provided to help reduce software tablewalk latencies. When a miss exception is taken, support logic generates the address pointers to the page table and provides a preconstructed copy of the first word of the desired page table entry. This reduces the main tablewalk routine to approximately sixteen instructions (two line fills). Also, four shadow registers and an automatic copy of a condition code register eliminate the overhead associated with context savdrestores Bus interface unit The 603 features a non-multiplexed 32-bit address and 64-bit data bus interface that is capable of supporting a wide range of system implementations. The bus interface, shown in Figure 8, is based on the PowerPC 60x Microprocessor Interface S cifcatwn. It features independent address and data bus control which allows for advanced bus features including address pipelining and split bus transactions. The bus interface also supports bus snooping, transaction retry, and snoop copyback operations in order to provide full copyback cache coherency support in hardware. These features allow for system implementations ranging from traditional nonpipelined bus interfaces that may or may not require hardware snoop support, to pipeliied interfaces that minimize memory access latency, to Address SIaft DataAlWIstbn Data Transfer Figure 8: External Signals advanced pipelined and cache coherent bus interfaces. The bus interface also supports a 32-bit data bus mode which is configurable at start-up. In addition, the bus interface can be configured to operate at a lower (integer divisible) clock firequency with respect to the intemal processing units of the chip thereby allowing the bus to operate at one frequency while the internal units operate at lx, 2X, 3X, or 4X the bus frequency. The bus can operate at frequencies from 16 to 66 MHz. These features allow low cost system implementations which use affordable interface frequencies or a smaller data bus size, while still taking advantage of the higher internal processing rates of the 603 core. The data bus supports single-beat data transfers for non-cacheable and write-thru cache operations, and 4- beat (burst) data transfers for cache line operations to and from the onchip caches. The bus interface also supports address and data bus parity, address retry capability, normal and error data transfer te-on, and a data retry capability for late error correction by the memory system. To eliminate bus request latency, the bus interface also supports bus parking and pipelined bus granting. The separation of the address and data bus controls allows operations such as address pipelining which helps maximize data transfer throughput. The bus interface can pipeline one or more additional addresses on the bus while the current data tenure is running. These pipelined transactions may be use to service the on-chip branch unit, load store unit, or caches. In addition, any number of ad&esses may be pipelined on the bus as a result of multiple bus master depending on the system implementation. Figure 9 illustrates the flow of address pipelined bus transactions. Figure 9: Address Bus Pipelining Address pipelining also allows split transactions to be performed by the 603. Address tenure for a split transaction is terminated before data tenure begins. Using split 305

transactions, the bus interface also allows limited Out-forder transfer capability with respect to the order of data tenures nm, and for enveloped transactions.

7 transactions, the bus interface also allows limited Out-forder transfer capability with respect to the order of data tenures nm, and for enveloped transactions. Qpically, an enveloped transaction is used to cause the processor to perform a complete snoop copyback transaction on a pipelined bus between the address and data tenures of a pending read transaction. This prevents certain deadlock scenarios that could occur when interfacing through a gateway to a second bus that cannot be retried. The bus snooping, transacti on rem, and snoop copyback capabilities of the bus interface allow the 603 to operate its onthip data cache in write-ttnu or copyback mode and still maintain coherency with DMA devices or other caching masters on the bus. Whethex ar not snooping of a Ifansaction is performed is determined by the global address attribute which is transferred with the address on the bus. This pennits global and local sections of the address map to be defined and prevents unnecessary bus snooping. The 603 always fetches instructions as non-gle bal since hardware coherency support is not required far the instructions. The bus snooping protocol allows noncaching, caching 3-state (MEI) and caching 4-state (MESI) bus masters. When the 603 performs a global read, it requires other masters to flush the referenced line. Like wise, when another master performs a global read, the 603 will snoop its data cache and flush any referenced line. The bus interface also provides cache-inhibit, write-thru, and cache-way attributes with the bus address to support a second level cache Development and production support The 603 Common On-chip Rocessor (COP) supports debug and test features on the 603. The COP uses the IEEE (JTAG) standard protocol to communicate with an external system debugger or an In-Circuit Emulation (ICE) system. The COP supports the three JTAG public instructions; Bypass, Sample/Preload and Extest for boundary scan testing. In addition, 37 COP instructions are provided for debug purposes. The COP debug support uses the serial scan chains in which all internal latches can be connected and scanned as one long data register. Before the internal state can be scanned out, the pr>cessor must be stopped through a special soft stop or hard stop* exception which is initiated by a single step, instruction address breakpoint, branch trace. or an external trigger. A soft stop exception freezes the pfocessor after allowing it to reach a recoverable state. Internal state may have advanced but the processor can resume normal operation after scanning. A hard stop freezes the processor immediately and the processor state may not be recoverable. In addition to these mechanisms, a 16-bit counter can be used to stop the processor at a par- ticular clock cycle. The content of embedded amys can also be scanned out. The COP has the ability to read and write external system memory to examine extemal memory contents and to download insauction or data. The COP also supports praduction testing through scan. It controls the Array Built in Self Test (ABIST) which is used to test the caches and tags. During the AF3IST test sequence, all four arrays undergo an exhaustive readwrite-read test sequence to deter stuck-at-faults, speed sensitivities, and failures due to capacitive coupling Power management During idle periods, software may initiate one of three power down modes implemented in the 603. In these modes, instruction processing is suspended and major portions of the processor are disabled to reduce power consumption. In Doze** mode, the timebase and decrementer remain active and data cache coherency is maintained through snooping. All other activity in the processor is stopped. In Nap mode, only the timebase and the decrementer are active, and in Sleep mode all functionality of the processor is stopped except for the clock PLL which can further be externally disabled. An external intempt or decrementer intempt (disabled in sleep mode) is used to bring the processor out of these modes. The 603 also incorporates logic that, on a clock-by-clock basis, powers down idle sections of the processor. This is done in a manner that reduces the average power consumption, but does not impact performance. [l] performance The 603 s performance at 80Mhz with a 1-Mbyte L2 cache is estimated to be 75 SPECint92 and 85 SPECfp92. ll~is performance was estimated nmning sampled traces on an architectural simulator with code generated by a POW~~PC compiler. ~ [ Conclusion The 603 is a high performance, low cost, RISC microprocessor. These features coupled with low power consumption make the 603 an ideal solution for the needs of the laptop and entry level desktop markets References [I] Gary, S., et al., The PowerPCm 603 Microprocessor: A Low-Power Design for Portable Applications, Pmeedings of COMPCON 1994, February [2] Poursepanj. A., et al.. The PowerPCm 603 Microprocessor: Performance Analysis and Design Trade-offs, Pmeedings of COMPCON 1994, February

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order