THE latest generation of microprocessors uses a combination

1254 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File Creigton Asato Abstract A 116-word by 64-b register file for a 154 MHz fourissue superscalar processor renames read addresses and reads data in a single operation. A 10-port, 116-word tag comparison unit and a rename logic unit use static-bit-line techniques in the comparison logic. Pulsed-power sense amplifiers achieve a 3.8-ns read delay while dissipating 31% less power than a nonpulsed circuit. I. INTRODUCTION THE latest generation of microprocessors uses a combination of register renaming and superscalar instruction issuance. Register renaming requires a complicated addressdecoding scheme that has the ability to change the assignment of addresses to storage locations. Issuing multiple instructions per cycle requires additional area to implement an increased number of register file read and write ports. A 116-word, 64-b register file meets the requirements of a high-performance 64-b processor [1]. For reads, a standard address decoder was replaced by a matching circuit from a content-addressable memory (CAM) to allow the direct access of renamed registers. To support the issue of up to two threeoperand store instructions and two two-operand arithmetic instructions per clock cycle, the register file was designed with ten 64-b read ports and four 64-b write ports. Special attention was paid to the layout, achieving a density of more than 15 000 transistors per mm in a 0.4 m CMOS process, and novel circuit design techniques were used to reduce the power dissipation. A read access time of 3.8 ns was achieved fast enough to meet a 6.5 ns clock cycle time. A micrograph of the register file is shown in Fig. 1. The register file size is approximately 7.5 mm 3.2 mm in the 0.4 m CMOS process summarized in Table I. It contains 371 000 transistors and dissipates approximately 3.6 W at 154 MHz. II. REGISTER FILE BLOCK DIAGRAM The register file consists of four blocks, as shown in Fig. 2: a 116-word by 64-b Data Storage Array (DSA), ten 116-word by 7-b ROM s, a 10-port, 116-word Tag Comparison Unit (TCU), and a Rename Logic Unit (RLU). The DSA has four standard write decoders but no read decoders; its read-enable word lines are generated in the TCU. The ROM s are programmed to output the addresses of the read-enable word lines activated Manuscript received May 2, 1995; revised September 9, 1995. The author is with HAL Computer Systems, Inc., A Fujitsu TM Company, Campbell, CA 95008 USA. (HAL is a registered trademark of HAL Computer Systems, Inc. Fujitsu is a trademark of Fujitsu, Limited.) IEEE Log Number 9415591. Fig. 1. Register file micrograph. VDD Gate Length, drawn Gate Length, effective Metal 1 Pitch Metal 2 Pitch Metal 3 Pitch Metal 4 Pitch TABLE I PROCESS PARAMETERS 3.3 V 0.4 m 0.35 m 1.8 m 2.1 m 2.5 m 12.0 m by the TCU. The TCU compares the contents of 116 7-b tag-match registers with 10 7-b read address ports; a match between a read address and tag-match register N activates read-enable word line for that read port. The RLU manages the writing of the tag-match registers and the validation of their contents; up to four rename mappings may be changed per clock cycle. The register file can also be viewed as consisting of 116 nearly identical words. Each word has 64 b of data storage, a 7-b tag-match register, ten 7-b comparators, an Address Valid register bit, and portions of the various ROM s and decoders. Each word has a unique physical address, which lies in the range from 0 to 115. The physical address is used to identify the word and the storage located within it. For example, changing the tag-match register at a given physical address will change the read address needed to access the data stored in this word; this mechanism is needed for the register-renaming scheme implemented within the processor. III. REGISTER RENAMING Register renaming is a technique used to dynamically convert standard microprocessor instructions into data-flow instructions; it was first described in a paper by R. M. Tomasulo 0018 9200/95$04.00 1995 IEEE

ASATO: A 14-PORT 3.8-ns 116-WORD 64-b READ-RENAMING REGISTER FILE 1255 Fig. 2. Register file block diagram. Fig. 3. Renamed-read critical path. [2]. Register renaming requires the association of each generalpurpose register that can be specified in an instruction with a tag. Inside the processor, each time a register appears in the destination field of an instruction, it is assigned a new tag that corresponds to the physical address of a word in the register file. In this way, the results of the instruction can be written directly into the register file using the tag as an address. Circuitry external to the register file ensures that words being used to hold register data needed by the processor are not reassociated with new registers. References to registers in issued instructions are converted by the processor into unique 7-b numbers called logical addresses. For each register that needs to be renamed, the RLU writes the register s logical address into the tag-match register located at the physical address of the new tag. The RLU also sets the Address Valid bit at the physical address of the new tag and clears the Address Valid bit at the physical address of the old tag. This enables comparisons using the tag-match register located at the new physical address and disables comparisons using the old location. Future reads to the renamed register will access the new storage word and ignore the contents of the old one. A logical address presented to a read port of the register file will match the contents of one of the tag-match registers. The matching register activates a read-enable word line, and data is read out from that word. Since the physical address is also used as a rename tag, the read operation uses the ROM s to output the current tag associated with a logical address. Both the data and the rename tag are sent to the processor s dataflow machine, which uses the data and tag to resolve data dependencies and initiate the execution of instructions. IV. RENAMED-READ CRITICAL PATH The critical path through the register file is the renamed-read path shown in Fig. 3. Read addresses are buffered into true and complement forms and compared against all 116 7-b tag-match registers. For each read address, at most one match occurs; this is signaled by a rising voltage on the low voltage-swing node. A single-ended sense amplifier level-shifts this voltage into a full-swing read-enable word line. This accesses a word within the ROM, pulling down ROM bit lines. A single-ended sense amplifier converts these low-voltage-swing signals into full-swing CMOS outputs. Since the output of the ROM is a rename tag, these outputs are used to detect whether data currently being written into the register file is needed by an Fig. 4. 14-port data storage bit cell. instruction that is reading the register file; if this is the case, the write data is multiplexed or bypassed in place of the data being read from the register file. For this reason, the ROM path has been optimized at the expense of the read data outputs. Under typical conditions and with 1.8 pf output loading, the delay from read addresses to ROM outputs is 3.0 ns. One optimization to decrease the delay of the ROM outputs was the insertion of buffers between the read enable word lines of the 7-b ROM s and the 64-b DSA; this decreased the loading on the word-line sense amplifiers. Once buffered, the read enable word lines (RE0-RE9) activate bit-line pulldowns in data storage cells such as the one shown in Fig. 4. The bit lines are then level-shifted and buffered by the single-ended sense amplifiers on the outputs. Under typical conditions and with an output loading of 1.0 pf, this path was simulated to be 3.8 ns. V. SENSE AMPLIFIERS Although using differential sense amplifiers could have yielded a faster circuit, single-ended sense amplifiers were used throughout the design to reduce area by 20% overall. The use of precharged circuitry was rejected due to the uncertainty in the arrival times of the read addresses and due to its susceptibility to noise. A self-timed register file would have been less sensitive to arrival times, but the area of the TCU was wire-limited, so any additional wires used for differential sensing or dummy bit-lines would directly increase the height of the register file; this was judged to be unacceptable. The register file was implemented using modified staticload sense amplifiers, such as the one shown in Fig. 5. These

1256 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 Fig. 5. Sense amplifier. Fig. 8. Noise margin simulation. Fig. 6. Fig. 7. Tag match circuit. Sense amplifier equivalent circuit. sense amplifiers were used in pseudo-nmos circuits, such as the tag-match comparison circuit shown in Fig. 6. Transistors such as are used to form a comparator circuit; if any bit in the tag register is unequal to the corresponding bit in the read address, node is pulled down. Also, if the Address Valid register bit for this word is cleared, it disables a match by pulling down using transistor. In normal operation, the TEST input of each sense amplifier is set to, yielding an equivalent circuit shown in Fig. 7. From node, transistor appears to be a 450 A current source. Current may flow from to only if transistor is conducting. The inverter formed by and has a switching threshold set to 900 mv; this means that as soon as rises to 900 mv, transistor is cut off, preventing from rising further. When this occurs, the current through charges node up to the 3.3-V, raising read-enable word line output RE to and turning off. When pulled down, node may range from 475 mv to 0 V, depending on how many pull downs are conducting. The highest voltage is set by the resistive divider formed by transistors and and must be below the switching threshold of the inverter formed by transistors and. At these voltage levels, current is pulled from into the pull-down network, lowering s voltage and setting RE to. Since all of the current flowing through empties into, as soon as all pull-down transistors are shut off, the voltage on will rise as fast as the current can charge the 0.25 pf wiring load on. When reaches 900 mv, shuts off, and begins to charge. If is pulled down again, the delay through the inverter allows to be partially discharged before is turned on again, improving the speed of the falling transition. An important feature of the sense amplifier design is the use of the TEST signal and transistors and. When TEST is set to, the current through is shut off, and node is clamped to. In this state, all internal nodes sit at or, and all static power dissipation is turned off, allowing quiescent-current testing to locate manufacturing defects by their current draw. The speed of the circuit depends on the rate at which the current flowing through can charge from to the threshold of the inverter. The noise margin of the sense amplifier depends on the difference between this input high level and the maximum input low level, set by the voltage divider of and. Making and wider decreases the input low level and increases the noise margin, but it also increases the capacitance on. Raising the threshold of the inverter will also improve the noise margin but would also increase the amount of time needed to charge. Clearly, there is a trade-off between performance and noise margin. The noise margin of the sense amplifier was tested using the circuit shown in Fig. 8. The injection of noise was simulated by inserting a noise voltage into the supply of the pulldown network; demonstrating that the sense amplifier could be switched by the pull-down network in all process, voltage, and temperature corners proved that the circuit had a noise margin that was at least equal to the noise voltage. The sense amplifier circuit itself was highly immune to noise voltages inserted into its power supplies due to the feedback between and the inverter. VI. PULSED-POWER SENSE AMPLIFIER The speed of the sense amplifier depends on the amount of current that can source. The amount of current, however, only makes a difference at the times when is switching. If switches at predictable times within a clock cycle, the current may be reduced once has reached its final value

ASATO: A 14-PORT 3.8-ns 116-WORD 64-b READ-RENAMING REGISTER FILE 1257 Fig. 11. Variable-width pulse generator. Fig. 9. Pulsed-power sense amplifier. Fig. 12. Three-stage programmable delay. Fig. 10. Waveforms of A pulsed-power tag match circuit. without any effect on the switching speed; this reduces the total power dissipation of the sense amplifier. The circuit can be speeded up by adding an additional current-sourcing transistor to the sense amplifier, shown as in Fig. 9. The high-power enable signal HP is pulsed to only during the part of the cycle where could be switching. An example of the use of pulsed-power techniques in the tag match circuit is shown in the waveforms of Fig. 10. The active-low high-power enable signal HP falls to at the same time that the address input changes. In 1.8 ns, rises from 0 V to the 900 mv switching threshold of the inverter. At this time, is shut off, and all of the current sourced by is used to charge, which triggers a rise in the read-enable word line RE, 2.0 ns after the input transitions. After this point in time, the sense amplifier has reached its final value, so may be turned off by deasserting HP. By pulsing HP as shown in the waveforms of Fig. 10, the power dissipation of a sense amplifier such as the one shown in Fig. 5 is reduced from 2.22 mw to 1.70 mw. Since the TCU simultaneously compares 116 tag registers against 10 read addresses, it contains a total of 1160 pulsed-power sense amplifiers; the use of pulsed-power techniques therefore saves a total of 600 mw of power. VII. VARIABLE-WIDTH PULSE GENERATOR The high-power enable signal HP is generated by a programmable pulse generator shown in Fig. 11. The heart of the pulse-generator is a four-stage variable-delay buffer similar to the three-stage one shown in Fig. 12. The variable delay was created by using multiplexers to choose paths through different numbers of delay elements. For example, if, then the path from input to output passes through no extra delay elements; but if, the path selected passes through five extra delay elements. The four-stage variable pulse-width generator used in the register file can create delays from 1.75 ns to 5.5 ns in 0.25 ns steps. The control signals for the pulse generator are set through the scan chain used to test and initialize the processor. Different settings are used to test for manufacturing defects and compensate for process variations. For example, the control signal LO, when set to, turns off all pulsed-power transistors; in this state, a defect in a transistor such as can be detected. Similarly, signal HI, when set to, allows the register file to operate at its highest power dissipation and its highest possible speed. Ideally, the pulse width should be set to its narrowest setting with the processor operating at its fastest clock rate in order to minimize the power dissipation without slowing down the rest of the chip. VIII. CONCLUSION Superscalar microprocessors require that their register files have many read and write ports to allow multiple instructions to be issued and multiple results to be written in each clock cycle. The high clocking rates of high-performance microprocessors require that these same register files have short delays. This challenges register-file designers to create large layouts that can operate at very high speeds. Small, fast layouts can be designed using NMOS-like circuitry, low voltage-swing bit lines, and single-ended sense amplifiers. Modulating the power dissipation of sense amplifiers can help reduce the overall power dissipation without any loss in speed; the resultant reduction in supply current may also allow narrower power bus wires to be used. Finally, replacing a standard register file architecture with a content-addressable memory matching circuit allows registers to be renamed and read in the same clock cycle, which can reduce the depth of a processor s pipeline.

1258 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 ACKNOWLEDGMENT The author would like to thank M. Shebanow, R. Montoye, J. Gmuender, W. Simmons, T. Patil, A. Ike, J. Zasio, M. Chiang, T. Lu, D. Tovey, M. Simone, H. Nguyen, S. Hale, and W. Walker for their technical contributions. The author would also like to thank Fujitsu Limited, of Kawasaki, Japan, for their significant contribution to the Fujitsu SPARC64 1 processor project. REFERENCES [1] T. Williams et al., SPARC64: A 64-b 64-active-instruction out-oforder-execution MCM processor, this issue, pp. 1215 1226. 1 SPARC64 is a trademark of SPARC International, Inc., licensed by SPARC International, Inc., to HAL Computer Systems, Inc. [2] R. M. Tomasulo, An efficient algorithm for exploiting multiple arithmetic units, IBM J., vol. 11, pp. 25 33, Jan. 1967. Creigton Asato received the B.S. degree in electrical engineering from the California Institute of Technology in 1985 and the M.S. degree in computer science from the Leland Stanford Junior University in 1988. From 1985 to 1992, he worked on standard cell and data-path compiler libraries at VLSI Technology, Inc., San Jose, CA. He joined HAL Computer Systems, Inc., A Fujitsu Company, located in Campbell, CA, in 1992 and has worked on register files and other large circuit blocks.