THE latest generation of microprocessors uses a combination

Similar documents
Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

Design of Low Power Wide Gates used in Register File and Tag Comparator

A 65nm LEVEL-1 CACHE FOR MOBILE APPLICATIONS

Semiconductor Memory Classification. Today. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. CPU Memory Hierarchy.

! Memory Overview. ! ROM Memories. ! RAM Memory " SRAM " DRAM. ! This is done because we can build. " large, slow memories OR

ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems

! Memory. " RAM Memory. " Serial Access Memories. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

Introduction to CMOS VLSI Design Lecture 13: SRAM

Introduction to SRAM. Jasur Hanbaba

SRAM. Introduction. Digital IC

Lecture 11 SRAM Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

Digital Integrated Circuits Lecture 13: SRAM

Semiconductor Memory Classification

Design and Implementation of 8K-bits Low Power SRAM in 180nm Technology

FPGA Power Management and Modeling Techniques

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

Column decoder using PTL for memory

MEMORIES. Memories. EEC 116, B. Baas 3

Memory. Outline. ECEN454 Digital Integrated Circuit Design. Memory Arrays. SRAM Architecture DRAM. Serial Access Memories ROM

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

Lecture 11: MOS Memory

Design and verification of low power SRAM system: Backend approach

RT54SX T r / T f Experiment

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

EE577b. Register File. By Joong-Seok Moon

Very Large Scale Integration (VLSI)

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

CENG 4480 L09 Memory 2

DESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR LOGIC FAMILIES

Low Power PLAs. Reginaldo Tavares, Michel Berkelaar, Jochen Jess. Information and Communication Systems Section, Eindhoven University of Technology,

STUDY OF SRAM AND ITS LOW POWER TECHNIQUES

CMPEN 411 VLSI Digital Circuits Spring Lecture 22: Memery, ROM

+1 (479)

Integrated Circuits & Systems

CS250 VLSI Systems Design Lecture 9: Memory

International Journal of Scientific & Engineering Research, Volume 5, Issue 2, February ISSN

VLSI Test Technology and Reliability (ET4076)

Memory Design I. Array-Structured Memory Architecture. Professor Chris H. Kim. Dept. of ECE.

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

A Low Power 32 Bit CMOS ROM Using a Novel ATD Circuit

Memory Arrays. Array Architecture. Chapter 16 Memory Circuits and Chapter 12 Array Subsystems from CMOS VLSI Design by Weste and Harris, 4 th Edition

8Kb Logic Compatible DRAM based Memory Design for Low Power Systems

Analysis and Design of Low Voltage Low Noise LVDS Receiver

CALCULATION OF POWER CONSUMPTION IN 7 TRANSISTOR SRAM CELL USING CADENCE TOOL

IMPLEMENTATION OF LOW POWER AREA EFFICIENT ALU WITH LOW POWER FULL ADDER USING MICROWIND DSCH3

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

Prototype of SRAM by Sergey Kononov, et al.

Implementing Bus LVDS Interface in Cyclone III, Stratix III, and Stratix IV Devices

1. Designing a 64-word Content Addressable Memory Background

Low Power SRAM Design with Reduced Read/Write Time

DYNAMIC ASYNCHRONOUS LOGIC FOR HIGH SPEED CMOS SYSTEMS. I. Introduction

Physical Implementation

SIDDHARTH INSTITUTE OF ENGINEERING AND TECHNOLOGY :: PUTTUR (AUTONOMOUS) Siddharth Nagar, Narayanavanam Road QUESTION BANK UNIT I

High-Performance Full Adders Using an Alternative Logic Structure

10. Interconnects in CMOS Technology

Dynamic CMOS Logic Gate

3. Implementing Logic in CMOS

Content Addressable Memory performance Analysis using NAND Structure FinFET

FIRST-LEVEL cache RAM blocks for the next generation

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

CHAPTER 12 ARRAY SUBSYSTEMS [ ] MANJARI S. KULKARNI

ECEN 449 Microprocessor System Design. Memories

Design and Analysis of 32 bit SRAM architecture in 90nm CMOS Technology

Topics. ! PLAs.! Memories: ! Datapaths.! Floor Planning ! ROM;! SRAM;! DRAM. Modern VLSI Design 2e: Chapter 6. Copyright 1994, 1998 Prentice Hall

Problem Formulation. Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets.

190-MHz CMOS 4-Kbyte Pipelined Caches

Minimizing Power Dissipation during. University of Southern California Los Angeles CA August 28 th, 2007

VERY large scale integration (VLSI) design for power

Optimized CAM Design

Design and Analysis of Ultra Low Power Processors Using Sub/Near-Threshold 3D Stacked ICs

A Low Power SRAM Base on Novel Word-Line Decoding

Analysis of 8T SRAM with Read and Write Assist Schemes (UDVS) In 45nm CMOS Technology

Magnetic core memory (1951) cm 2 ( bit)

International Journal of Advance Engineering and Research Development LOW POWER AND HIGH PERFORMANCE MSML DESIGN FOR CAM USE OF MODIFIED XNOR CELL

Methodology on Extracting Compact Layout Rules for Latchup Prevention in Deep-Submicron Bulk CMOS Technology

CMOS Logic Circuit Design Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計

2.5 V/3.3 V, 2-Bit, Individual Control Level Translator Bus Switch ADG3243

Design and Implementation of Low Leakage Power SRAM System Using Full Stack Asymmetric SRAM

LOW POWER SRAM CELL WITH IMPROVED RESPONSE

Low-Power SRAM and ROM Memories

Addressable Bus Buffer Provides Capacitance Buffering, Live Insertion and Nested Addressing in 2-WireBus Systems

Chapter 2 On-Chip Protection Solution for Radio Frequency Integrated Circuits in Standard CMOS Process

Design and Simulation of Low Power 6TSRAM and Control its Leakage Current Using Sleepy Keeper Approach in different Topology

Memory Design I. Semiconductor Memory Classification. Read-Write Memories (RWM) Memory Scaling Trend. Memory Scaling Trend

! Serial Access Memories. ! Multiported SRAM ! 5T SRAM ! DRAM. ! Shift registers store and delay data. ! Simple design: cascade of registers

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

250nm Technology Based Low Power SRAM Memory

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver

Implementation of DRAM Cell Using Transmission Gate

EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510)

This Part-B course discusses design techniques that are used to reduce noise problems in large-scale integration (LSI) devices.

CHAPTER 1 INTRODUCTION

VLSI microarchitecture. Scaling. Toni Juan

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Survey on Stability of Low Power SRAM Bit Cells

Power Analysis for CMOS based Dual Mode Logic Gates using Power Gating Techniques

Digital IO PAD Overview and Calibration Scheme

Unit 7: Memory. Dynamic shift register: Circuit diagram: Refer to unit 4(ch 6.5.4)

A Practical Approach to Preventing Simultaneous Switching Noise and Ground Bounce Problems in IO Rings

Transcription:

1254 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File Creigton Asato Abstract A 116-word by 64-b register file for a 154 MHz fourissue superscalar processor renames read addresses and reads data in a single operation. A 10-port, 116-word tag comparison unit and a rename logic unit use static-bit-line techniques in the comparison logic. Pulsed-power sense amplifiers achieve a 3.8-ns read delay while dissipating 31% less power than a nonpulsed circuit. I. INTRODUCTION THE latest generation of microprocessors uses a combination of register renaming and superscalar instruction issuance. Register renaming requires a complicated addressdecoding scheme that has the ability to change the assignment of addresses to storage locations. Issuing multiple instructions per cycle requires additional area to implement an increased number of register file read and write ports. A 116-word, 64-b register file meets the requirements of a high-performance 64-b processor [1]. For reads, a standard address decoder was replaced by a matching circuit from a content-addressable memory (CAM) to allow the direct access of renamed registers. To support the issue of up to two threeoperand store instructions and two two-operand arithmetic instructions per clock cycle, the register file was designed with ten 64-b read ports and four 64-b write ports. Special attention was paid to the layout, achieving a density of more than 15 000 transistors per mm in a 0.4 m CMOS process, and novel circuit design techniques were used to reduce the power dissipation. A read access time of 3.8 ns was achieved fast enough to meet a 6.5 ns clock cycle time. A micrograph of the register file is shown in Fig. 1. The register file size is approximately 7.5 mm 3.2 mm in the 0.4 m CMOS process summarized in Table I. It contains 371 000 transistors and dissipates approximately 3.6 W at 154 MHz. II. REGISTER FILE BLOCK DIAGRAM The register file consists of four blocks, as shown in Fig. 2: a 116-word by 64-b Data Storage Array (DSA), ten 116-word by 7-b ROM s, a 10-port, 116-word Tag Comparison Unit (TCU), and a Rename Logic Unit (RLU). The DSA has four standard write decoders but no read decoders; its read-enable word lines are generated in the TCU. The ROM s are programmed to output the addresses of the read-enable word lines activated Manuscript received May 2, 1995; revised September 9, 1995. The author is with HAL Computer Systems, Inc., A Fujitsu TM Company, Campbell, CA 95008 USA. (HAL is a registered trademark of HAL Computer Systems, Inc. Fujitsu is a trademark of Fujitsu, Limited.) IEEE Log Number 9415591. Fig. 1. Register file micrograph. VDD Gate Length, drawn Gate Length, effective Metal 1 Pitch Metal 2 Pitch Metal 3 Pitch Metal 4 Pitch TABLE I PROCESS PARAMETERS 3.3 V 0.4 m 0.35 m 1.8 m 2.1 m 2.5 m 12.0 m by the TCU. The TCU compares the contents of 116 7-b tag-match registers with 10 7-b read address ports; a match between a read address and tag-match register N activates read-enable word line for that read port. The RLU manages the writing of the tag-match registers and the validation of their contents; up to four rename mappings may be changed per clock cycle. The register file can also be viewed as consisting of 116 nearly identical words. Each word has 64 b of data storage, a 7-b tag-match register, ten 7-b comparators, an Address Valid register bit, and portions of the various ROM s and decoders. Each word has a unique physical address, which lies in the range from 0 to 115. The physical address is used to identify the word and the storage located within it. For example, changing the tag-match register at a given physical address will change the read address needed to access the data stored in this word; this mechanism is needed for the register-renaming scheme implemented within the processor. III. REGISTER RENAMING Register renaming is a technique used to dynamically convert standard microprocessor instructions into data-flow instructions; it was first described in a paper by R. M. Tomasulo 0018 9200/95$04.00 1995 IEEE

ASATO: A 14-PORT 3.8-ns 116-WORD 64-b READ-RENAMING REGISTER FILE 1255 Fig. 2. Register file block diagram. Fig. 3. Renamed-read critical path. [2]. Register renaming requires the association of each generalpurpose register that can be specified in an instruction with a tag. Inside the processor, each time a register appears in the destination field of an instruction, it is assigned a new tag that corresponds to the physical address of a word in the register file. In this way, the results of the instruction can be written directly into the register file using the tag as an address. Circuitry external to the register file ensures that words being used to hold register data needed by the processor are not reassociated with new registers. References to registers in issued instructions are converted by the processor into unique 7-b numbers called logical addresses. For each register that needs to be renamed, the RLU writes the register s logical address into the tag-match register located at the physical address of the new tag. The RLU also sets the Address Valid bit at the physical address of the new tag and clears the Address Valid bit at the physical address of the old tag. This enables comparisons using the tag-match register located at the new physical address and disables comparisons using the old location. Future reads to the renamed register will access the new storage word and ignore the contents of the old one. A logical address presented to a read port of the register file will match the contents of one of the tag-match registers. The matching register activates a read-enable word line, and data is read out from that word. Since the physical address is also used as a rename tag, the read operation uses the ROM s to output the current tag associated with a logical address. Both the data and the rename tag are sent to the processor s dataflow machine, which uses the data and tag to resolve data dependencies and initiate the execution of instructions. IV. RENAMED-READ CRITICAL PATH The critical path through the register file is the renamed-read path shown in Fig. 3. Read addresses are buffered into true and complement forms and compared against all 116 7-b tag-match registers. For each read address, at most one match occurs; this is signaled by a rising voltage on the low voltage-swing node. A single-ended sense amplifier level-shifts this voltage into a full-swing read-enable word line. This accesses a word within the ROM, pulling down ROM bit lines. A single-ended sense amplifier converts these low-voltage-swing signals into full-swing CMOS outputs. Since the output of the ROM is a rename tag, these outputs are used to detect whether data currently being written into the register file is needed by an Fig. 4. 14-port data storage bit cell. instruction that is reading the register file; if this is the case, the write data is multiplexed or bypassed in place of the data being read from the register file. For this reason, the ROM path has been optimized at the expense of the read data outputs. Under typical conditions and with 1.8 pf output loading, the delay from read addresses to ROM outputs is 3.0 ns. One optimization to decrease the delay of the ROM outputs was the insertion of buffers between the read enable word lines of the 7-b ROM s and the 64-b DSA; this decreased the loading on the word-line sense amplifiers. Once buffered, the read enable word lines (RE0-RE9) activate bit-line pulldowns in data storage cells such as the one shown in Fig. 4. The bit lines are then level-shifted and buffered by the single-ended sense amplifiers on the outputs. Under typical conditions and with an output loading of 1.0 pf, this path was simulated to be 3.8 ns. V. SENSE AMPLIFIERS Although using differential sense amplifiers could have yielded a faster circuit, single-ended sense amplifiers were used throughout the design to reduce area by 20% overall. The use of precharged circuitry was rejected due to the uncertainty in the arrival times of the read addresses and due to its susceptibility to noise. A self-timed register file would have been less sensitive to arrival times, but the area of the TCU was wire-limited, so any additional wires used for differential sensing or dummy bit-lines would directly increase the height of the register file; this was judged to be unacceptable. The register file was implemented using modified staticload sense amplifiers, such as the one shown in Fig. 5. These

1256 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 Fig. 5. Sense amplifier. Fig. 8. Noise margin simulation. Fig. 6. Fig. 7. Tag match circuit. Sense amplifier equivalent circuit. sense amplifiers were used in pseudo-nmos circuits, such as the tag-match comparison circuit shown in Fig. 6. Transistors such as are used to form a comparator circuit; if any bit in the tag register is unequal to the corresponding bit in the read address, node is pulled down. Also, if the Address Valid register bit for this word is cleared, it disables a match by pulling down using transistor. In normal operation, the TEST input of each sense amplifier is set to, yielding an equivalent circuit shown in Fig. 7. From node, transistor appears to be a 450 A current source. Current may flow from to only if transistor is conducting. The inverter formed by and has a switching threshold set to 900 mv; this means that as soon as rises to 900 mv, transistor is cut off, preventing from rising further. When this occurs, the current through charges node up to the 3.3-V, raising read-enable word line output RE to and turning off. When pulled down, node may range from 475 mv to 0 V, depending on how many pull downs are conducting. The highest voltage is set by the resistive divider formed by transistors and and must be below the switching threshold of the inverter formed by transistors and. At these voltage levels, current is pulled from into the pull-down network, lowering s voltage and setting RE to. Since all of the current flowing through empties into, as soon as all pull-down transistors are shut off, the voltage on will rise as fast as the current can charge the 0.25 pf wiring load on. When reaches 900 mv, shuts off, and begins to charge. If is pulled down again, the delay through the inverter allows to be partially discharged before is turned on again, improving the speed of the falling transition. An important feature of the sense amplifier design is the use of the TEST signal and transistors and. When TEST is set to, the current through is shut off, and node is clamped to. In this state, all internal nodes sit at or, and all static power dissipation is turned off, allowing quiescent-current testing to locate manufacturing defects by their current draw. The speed of the circuit depends on the rate at which the current flowing through can charge from to the threshold of the inverter. The noise margin of the sense amplifier depends on the difference between this input high level and the maximum input low level, set by the voltage divider of and. Making and wider decreases the input low level and increases the noise margin, but it also increases the capacitance on. Raising the threshold of the inverter will also improve the noise margin but would also increase the amount of time needed to charge. Clearly, there is a trade-off between performance and noise margin. The noise margin of the sense amplifier was tested using the circuit shown in Fig. 8. The injection of noise was simulated by inserting a noise voltage into the supply of the pulldown network; demonstrating that the sense amplifier could be switched by the pull-down network in all process, voltage, and temperature corners proved that the circuit had a noise margin that was at least equal to the noise voltage. The sense amplifier circuit itself was highly immune to noise voltages inserted into its power supplies due to the feedback between and the inverter. VI. PULSED-POWER SENSE AMPLIFIER The speed of the sense amplifier depends on the amount of current that can source. The amount of current, however, only makes a difference at the times when is switching. If switches at predictable times within a clock cycle, the current may be reduced once has reached its final value

ASATO: A 14-PORT 3.8-ns 116-WORD 64-b READ-RENAMING REGISTER FILE 1257 Fig. 11. Variable-width pulse generator. Fig. 9. Pulsed-power sense amplifier. Fig. 12. Three-stage programmable delay. Fig. 10. Waveforms of A pulsed-power tag match circuit. without any effect on the switching speed; this reduces the total power dissipation of the sense amplifier. The circuit can be speeded up by adding an additional current-sourcing transistor to the sense amplifier, shown as in Fig. 9. The high-power enable signal HP is pulsed to only during the part of the cycle where could be switching. An example of the use of pulsed-power techniques in the tag match circuit is shown in the waveforms of Fig. 10. The active-low high-power enable signal HP falls to at the same time that the address input changes. In 1.8 ns, rises from 0 V to the 900 mv switching threshold of the inverter. At this time, is shut off, and all of the current sourced by is used to charge, which triggers a rise in the read-enable word line RE, 2.0 ns after the input transitions. After this point in time, the sense amplifier has reached its final value, so may be turned off by deasserting HP. By pulsing HP as shown in the waveforms of Fig. 10, the power dissipation of a sense amplifier such as the one shown in Fig. 5 is reduced from 2.22 mw to 1.70 mw. Since the TCU simultaneously compares 116 tag registers against 10 read addresses, it contains a total of 1160 pulsed-power sense amplifiers; the use of pulsed-power techniques therefore saves a total of 600 mw of power. VII. VARIABLE-WIDTH PULSE GENERATOR The high-power enable signal HP is generated by a programmable pulse generator shown in Fig. 11. The heart of the pulse-generator is a four-stage variable-delay buffer similar to the three-stage one shown in Fig. 12. The variable delay was created by using multiplexers to choose paths through different numbers of delay elements. For example, if, then the path from input to output passes through no extra delay elements; but if, the path selected passes through five extra delay elements. The four-stage variable pulse-width generator used in the register file can create delays from 1.75 ns to 5.5 ns in 0.25 ns steps. The control signals for the pulse generator are set through the scan chain used to test and initialize the processor. Different settings are used to test for manufacturing defects and compensate for process variations. For example, the control signal LO, when set to, turns off all pulsed-power transistors; in this state, a defect in a transistor such as can be detected. Similarly, signal HI, when set to, allows the register file to operate at its highest power dissipation and its highest possible speed. Ideally, the pulse width should be set to its narrowest setting with the processor operating at its fastest clock rate in order to minimize the power dissipation without slowing down the rest of the chip. VIII. CONCLUSION Superscalar microprocessors require that their register files have many read and write ports to allow multiple instructions to be issued and multiple results to be written in each clock cycle. The high clocking rates of high-performance microprocessors require that these same register files have short delays. This challenges register-file designers to create large layouts that can operate at very high speeds. Small, fast layouts can be designed using NMOS-like circuitry, low voltage-swing bit lines, and single-ended sense amplifiers. Modulating the power dissipation of sense amplifiers can help reduce the overall power dissipation without any loss in speed; the resultant reduction in supply current may also allow narrower power bus wires to be used. Finally, replacing a standard register file architecture with a content-addressable memory matching circuit allows registers to be renamed and read in the same clock cycle, which can reduce the depth of a processor s pipeline.

1258 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 ACKNOWLEDGMENT The author would like to thank M. Shebanow, R. Montoye, J. Gmuender, W. Simmons, T. Patil, A. Ike, J. Zasio, M. Chiang, T. Lu, D. Tovey, M. Simone, H. Nguyen, S. Hale, and W. Walker for their technical contributions. The author would also like to thank Fujitsu Limited, of Kawasaki, Japan, for their significant contribution to the Fujitsu SPARC64 1 processor project. REFERENCES [1] T. Williams et al., SPARC64: A 64-b 64-active-instruction out-oforder-execution MCM processor, this issue, pp. 1215 1226. 1 SPARC64 is a trademark of SPARC International, Inc., licensed by SPARC International, Inc., to HAL Computer Systems, Inc. [2] R. M. Tomasulo, An efficient algorithm for exploiting multiple arithmetic units, IBM J., vol. 11, pp. 25 33, Jan. 1967. Creigton Asato received the B.S. degree in electrical engineering from the California Institute of Technology in 1985 and the M.S. degree in computer science from the Leland Stanford Junior University in 1988. From 1985 to 1992, he worked on standard cell and data-path compiler libraries at VLSI Technology, Inc., San Jose, CA. He joined HAL Computer Systems, Inc., A Fujitsu Company, located in Campbell, CA, in 1992 and has worked on register files and other large circuit blocks.