LPRAM: A Novel Low-Power High-Performance RAM Design With Testability and Scalability. Subhasis Bhattacharjee and Dhiraj K. Pradhan, Fellow, IEEE

Size: px

Start display at page:

Download "LPRAM: A Novel Low-Power High-Performance RAM Design With Testability and Scalability. Subhasis Bhattacharjee and Dhiraj K. Pradhan, Fellow, IEEE"

Meryl Watts
5 years ago
Views:

1 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY LPRAM: A Novel Low-Power High-Performance RAM Design With Testability and Scalability Subhasis Bhattacharjee and Dhiraj K. Pradhan, Fellow, IEEE Abstract To date, all of the proposals for low-power designs of RAMs essentially focus on circuit-level solutions. What we propose here is a novel architecture (high) level solution. Our methodology provides a systematic tradeoff between power and area. Also, it allows tradeoff between test time and power consumed in test mode. Significantly, too, the proposed design has the potential to achieve performance improvements while simultaneously reducing power. In this respect, it stands apart from other approaches where power reduction results in speed reduction. The basic approach here divides the RAM into modules, interconnecting these modules in a binary tree where the tree can be reconfigured dynamically during normal operation and during test mode. Furthermore, during test mode, most of the RAM can be switched off, which provides major power reduction, while test-application time is reduced. The aspect ratio of the modules is allowed to vary as a design parameter. The chosen aspect ratio for module impacts power/access time/area tradeoffs. Such novel features make the proposed methodology of potential practical significance. Also, a design tool is developed which inputs various parameters, such as desired power/performance, giving outputs basic design parameters, such as the needed number of modules, area overhead, and resulting test speed-up. Index Terms Embedded RAM, leakage power, low power, lowpower RAM (LPRAM), low-power testing, memory architecture, RAM, testable RAM. I. INTRODUCTION FURTHER progress in low-power very large scale integration (VLSI) technology, including low-power RAM designs, is crucial for the semiconductor industry. Additionally, the success of future system-on-a-chip (SOC) depends heavily on innovations in low-power embedded RAM design. All previous works on RAM focus on circuit-level solutions. There are mainly three directions in which research has targeted design of low-power RAM [2], [3], [7], [8]. Specifically, these are reduction in 1) charging capacitance; 2) operating voltage; and 3) static current. Proposed methodology here departs radically from all these and provides an architectural high-level solution. This does not preclude application of the lower circuit-level techniques for low-power design, in addition. Therefore, any existing circuit-level techniques can also be applied to our proposed methodology to achieve further power savings. However, a unique feature of our design that cannot be accomplished through the circuit approach is that power reduction is achieved with potential performance and test improvements. Manuscript received December 12, 2002; revised March 28, This work was supported in part by EPSRC (U.K.) and is based on D. K. Pradhan s A Low Power RAM Design (patent filed). This paper was recommended by Associate Editor K. Chakrabarty. The authors are with the University of Bristol, Bristol BS8 1UB, U.K. ( pradhan@cs.bris.ac.uk). Digital Object Identifier /TCAD Fig. 1. On-chip RAM. The simultaneous reduction in delay and power is achieved by reduction in the size of both word and bit lines. The conventional wisdom dictates that any power reduction must also result in speed reduction. What we propose here stands apart in that, while the power is reduced, the speed is actually increased. The overhead here is in terms of increased area. In particular, the proposed design has significant potential for application in the design of on-chip memories as shown in Fig. 1. Here, both power and test concerns pose major challenges. Also, our proposed design provides certain speed advantages. It has the potential to achieve higher speed and, also significant, it guarantees uniform access to all the cells. This is a byproduct of our novel layout strategy for the cell arrays. The power reduction targets normal operation of the RAM as well as during the testing of the RAM. The proposed methodology allows for systematic tradeoff between area, power, and performance. In addition, our design differs from all existing approaches in its unique ability for power reduction during both normal operation and testing. The speed of testing can also be varied, allowing varying levels of power dissipations. Another unique feature of our design methodology is that, unlike conventional design, the speed is improved while power is reduced, the tradeoff here being the area. There is an area increase over conventional RAM designs. The design methodology is recursive and has the unique feature of being scalable in that one can synthesize larger designs using smaller designs. A power estimation model for the proposed design is developed. This model demonstrates significant power savings. Also developed here is a model for area estimates. This is used to estimate the area increase for proposed design over traditional design. What is apparent is that the proposed design allows for smooth trade off between area, power, and performance. This paper is organized into Sections II X. Section II reviews the previous works on low-power RAM. The proposed architecture is discussed in Section III, followed by its design methodology, described in Section IV, with detailed discussion on its various modes of operations in Section V. Estimates of power, area, and performance, along with a comparison to the traditional RAM, are discussed in Sections VI VIII, respectively. Section X discusses the testing procedure with the test structure proposed in Section IV. A case study addressing all these issues is discussed in Section IX, showing the effect of aspect ratio on power, performance, and area /04$ IEEE

2 638 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY 2004 II. REVIEW One of the popular techniques for low-power design is the reduction in supply voltage [8], [10], [16]. However, there are limits to this approach. Decreasing the supply voltage requires corresponding reduction of the threshold voltage. Also, the noise and dc current considerations limit reducing the supply voltage arbitrarily. Another approach for decreasing power is to reduce the charging capacitance of RAMs [7], [10]. The charging capacitance can be reduced by partial activation of multidivided data line (DDL) and/or multidivided word line (DWL) [7]. The proposed approach here provides a systematic technique for reduction of word line capacitance in a manner which is useful in both operational and test mode. Further, our approach differs from earlier approaches [6] in that we don t share decoders between cell array partitions therein, providing additional power savings. However, the word and data line segmentation techniques proposed earlier [2], [3], [7], [8] can still be applied to our modular partitioning technique, providing additional power savings. Our technique also allows multibanking within modules to attach further power reduction. Other prior research on how to reduce power focused on techniques to reduce the charging capacitance during the data retention period, as well as lowering the refresh frequency [12]. Preserving the refresh busy rate, as low as possible proportionately, for a large RAM, increases the charging capacitance of a word linewiththemaximumrefreshtimeofthecell.doublingthe maximum refresh time at each generation reduces power during refresh mode. However, this approach can be cumbersome and is limited by the cell leakage current. In [2], another scheme is proposed to utilize long word line for refresh operation and divided word line for normal operation to attain reduced power during normal operation. The proposed technique allows a reduction in retention power as well, and is different from prior approaches, providing a higher level solution for low-power design. III. PROPOSED ARCHITECTURE The proposed architecture partitions the RAM into a number of modules, where each is a smaller RAM module with decoder and refresh circuitry. The modules are then interconnected by an H-tree [1], which provides for planar layout as well as the incorporation of a particular built-in-self-test technique. A new feature of our design is that modules are allowed to have arbitrary aspect ratio. As demonstrated here, this allows major power/performance tradeoffs. Another new feature now proposed is the switching off of portions of the RAM during both normal operation, as well as during testing. Such a dynamic reconfiguration capability allows for a smooth tradeoff of test application time and power dissipation during test. Importantly, the built-in-test structure proposed here differs significantly from earlier designs. Rather than activating all modules for parallel read and write, we allow parallel read/write to a group of only a small number of modules, simultaneously. Because the rest of the RAM is switched off during testing, test power is drastically reduced. During the normal mode, major power savings are achieved because the modular design explicitly reduces the length of the word line activated. Fig. 2. Conceptual schematic of LPRAM architecture. An integral feature of the proposed methodology is the ability to tradeoff both power and performance with area, as described. Also, the ability to tradeoff test power with test-application time, as described, is of significant importance. IV. DESIGN OVERVIEW Our design for low-power RAM assumes cells divided into equal-sized modules, representing the size of the RAM in bits and, the number of address lines (assuming an bit organization). These modules appear as leaf nodes in a complete binary tree (Fig. 2). The depth of the tree and the number of modules or leaf nodes are related by. The size of each node is, where. Note that the root node is at level one. The parameters and define the properties of this architecture. A large means a higher granularity, a higher degree of power saving, speed-up, and testability, with increased chip size. (The design can also be configured using a -way tree, where is a power of two). Our low-power design relies on making modules of a different geometry than the earlier testable version. Also, major innovation in the test and refresh circuitry is proposed. The following highlights the key differences between the traditional approach to low power for a RAM that is built, using multiple cell array partitions, versus the proposed LPRAM. 1. The traditional cell array partition does not use any H-tree layout. Ours uses H-tree layout for laying out different cell arrays. This assures that, independent of the number cell arrays (modules) and independent of the size of the modules, the cells are generally equidistant from the read/write port. This will allow more predictability of the delays, because the delays are equally balanced in embedded RAM design. 2. In our approach, each cell array (module) has an independent refresh and decoder circuit. In the conventional cell-array portioning, these are shared. What we show is that this feature helps to achieve performance improvement during normal operation. 3. Also, the proposed design allows power reduction during refresh. Each module has fewer words so the

3 BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 639 words can be refreshed at a slower speed. This coupled with the fact that the number of bits in each word is smaller we obtain a quadratic effect. However, since all nodes have to be refreshed in parallel, this quadratic savings reduces to a linear factor. It should be noted that the total energy required to refresh stays the same. 4. Also, independent decoding and refreshing is essential for parallel testing. This traditional cell-array partitioning can suffer from correlated failures reducing fault coverage. 5. The partitioned approach we have allows for an additional low-power mode, by being able to switch off portions of the RAM, at ease. This can be a major advantage when battery power is a concern. Although this can also be done in traditional cell-array portioning, this additional low-power mode in our LPRAM is much more flexible and versatile. 6. The H-tree layout also has the advantage of being able to pipeline multiple bits, through the H-tree, providing an additional bandwidth potential. This is not possible in traditional cell-array partitions. As shown in our paper, different kinds of address mapping is possible, because of this modular approach we have taken. 7. The H-tree circuit, itself, can be built with wider and faster buses, making the delay in the H-tree negligible. This particular H-tree has decoders which are very simple, and can be built differently than the cell arrays, for additional speed. 8. Unlike cell array partitions, we have the potential to achieve significant speed advantages, BOTH during normal operation as well as during testing. Testing can be a major concern and our low-power RAM (LPRAM) achieves higher test speed, while reducing the power consumption. 9. Although the comparisons done here are done assuming only four cell-array partitions within our module, there is no reason why more partitions cannot be used, within each module providing greater savings in power and higher speed. 10. Our design approach is RECURSIVE by nature. This has the advantage of design reuse, using a thoroughly optimized, smaller RAM design to build a larger one. As we progress through the generations of RAM design, the ability to use a recursive approach can be of significant advantage in speeding up design, and verifying the design. Our design methodology has the unique feature of being scalable as one can build larger RAMs using smaller RAMs. A Simplified Model of RAM for Comparisons: In this paper, we assume a simplified model of RAM, as shown in Fig. 3. This model is used here for both conventional RAM and for the modules used in the proposed architecture. The simplified model is used because it admits developing simple and accurate expressions for comparing power, area, and performance estimates, as shown later. Since we are using this model for the basis of comparison, this does not compromise the basic results and conclusions obtained. Fig. 3. Simplified model of RAM architecture. Based on our simplified model of RAM, we observe that a conventional RAM (Fig. 3) can be thought as a special case of low-power high-performance RAM with where cells are arranged in four quadrants, each holding cells arranged in a two-dimensional (2-D) matrix of rows and columns. The address bus is divided into two equal (near equal, when is odd) parts, one half used to decode the row, and the other to select the column. For the sake of comparison, we assume a four-quadrant architecture, but the architecture allows each module to be built out of more numbers of cell array partitions. Basically, two types of nodes are used in our design: memory nodes and switch nodes. Memory nodes have the cell array based on the traditional multisubarray (for example, four quadrant) organization with independent control units, refresh circuitry, and certain built-in test circuitry. Each module itself can also be designed with a larger number of subarrays, as in current designs. For the sake of modeling, we propose that each module containing cells is arranged in four quadrants, each quadrant holding cells. Each quadrant is a 2-D array of memory cells arranged in rows, each row containing cells. But, unlike conventional RAM, we divide the address bus ( address lines) into two parts and respectively, to give preferably a nonunit aspect ratio. These and address lines are separately decoded in the row and column decoders, respectively, to give rows and columns. We define, the aspect ratio of each quadrant in LPRAM. Additionally, each memory node contains some tristate switches on the runs of power line(s), to cut it off from the power source when required. The number of such switches will depend on the maximum number of elements active at any time and on the power-line layout. The control of these switches is discussed in the latter part of this section. The switch nodes are simple 1-out-of-2 decoders with buffers. As Fig. 4 shows, the memory nodes are connected hierarchically, using the switch nodes, and laid out in an H-tree layout. Let each memory node be identified by, where. Therefore, as shown in Fig. 4 (for ), the nodes are num-

4 640 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY 2004 Fig. 4. H-tree of LPRAM architecture. bered, consecutively numbered nodes adjacent to each other in the layout. We consider the laying of the memory nodes in rows and columns in the H-tree layout. We define the aspect ratio of H-tree to be equal to. For this initial example, the aspect ratio of the memory node and the H-tree will be assumed to be 1:1, that is and. The address/data/control bus is connected to the root, a switch node. The most significant bit is decoded, generating a left subtree or a right subtree select. The other signals are buffered and propagated down the tree. This action occurs repeatedly at each level until a single memory node is selected. At this point, the remaining address bits are latched into the address buffers of the selected memory node only, and are then used to select a cell within the node. The address buffers of all other nonselected nodes remain completely unchanged, thereby nullifying any possibility of activity within them (other than normal refresh activity). Each cell is identified by the address, where (node address) and (address within a node). Aspect Ratio: As defined earlier, is the aspect ratio of each quadrant of a memory node in LPRAM and is equal to. Since all four quadrants of a memory node are identical in structure, we see that the aspect ratio of a memory node is almost the same as the aspect ratio of the quadrant. This is also shown in Fig. 5. So, we define the aspect ratio of a memory node, and only use for discussion. As both and are powers of two, so is ; i.e., for some to be referred to as the aspect ratio index (ARI) of a memory node. The aspect ratio of LPRAM depends on: 1) the aspect ratio of the individual module and 2) the aspect ratio of the H-tree layout (Fig. 5). This figure depicts an LPRAM with 16 memory nodes, shown to have a chip aspect of 2:1, where the individual memory node has an aspect ratio of 1:2, and the aspect ratio of the H-Tree layout is 4:1. We define to be the ARI of the H-tree layout, such that, the aspect ratio of the H-tree layout. The ARI of the LPRAM, and the corresponding aspect ratio of the LPRAM is. It should be noted that the aspect ratio of the RAM chip (denoted as ) is defined as the ratio of the two sides. Since we do not make any distinction between width and height at line chip level always. All other s are defined as the ratio of width divided by height. We illustrate the difference between and using Figs Fig. 6 shows the layout of a LPRAM with 16 modules, where and, producing. Whereas, Fig. 7 shows another layout of the same LPRAM with 16 modules, where and, producing. However, the chip aspect ratios of all of them are the same and equal to 2:1. It should be noted that the conventional tradeoff is the lower the power the lower the speed. However, the proposed design methodology achieves power savings, while at the same time achieving higher speed up to a certain point as shown later. The tradeoff here is with the area. The proposed design increases the area. It is important to note that the traditional relationship between power and performance does not hold here. Normally, as power reduces, speed also reduces. However, in the proposed methodology, the reduced power design has improved performance. However, there is an increase in area as the power is reduced. It will be simpler to discuss the various properties and estimates with respect to ARIs, rather than to the aspect ratio. Therefore, from this point, we will focus only on ARIs. The corresponding aspect ratio can be easily computed from the known ARI. In summary: the aspect ratio of the chip is ; the aspect ratio of a memory node is ; the aspect ratio of the H-tree layout is ; the aspect ratio of the LPRAM layout is. For a given size of RAM, in a traditional model as well as in a LPRAM model, we cannot get any arbitrary aspect ratio. For example, it can be easily seen that if we are to design a RAM of size or 4 M, we cannot get any configuration of rows and columns, such that the resulting chip has the aspect ratio of 2:1. However, for any given specified size of the RAM, one can realize the RAM with various aspect ratios, such as 1:1, 4:1, or 16:1, etc., which correspond to ARIs 0, 2, 4, etc., respectively. So, even if a traditional RAM is designed with a nonunit aspect ratio, the ARIs (and, correspondingly, the aspect ratios) are restricted by ; i.e., ARI is even if and only if is even, and it is odd if and only if is odd. It is also easy to see from the argument above that if and only if is odd (i.e., is even and modules), then can have only even values. Similarly, can take even values if and only if is even. We know for an LPRAM, and, thus, if is odd, is even and vice versa. Lemma 1: For a given size of RAM,, the ARI of any LPRAM is odd (even) if and only if is odd (even). Proof is given in the Appendix. Lemma 2: For a given RAM of size and the given chip aspect ratio, such that either both and

5 BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 641 Fig. 5. LPRAM with chip aspect ratio 2:1 and node aspect ratio is 1:2 for low power and higher speed. Fig. 6. LPRAM with chip aspect ratio = 2:1, high power, and low speed. Fig. 8. Test structure of LPRAM with four-way built-in comparison. Fig. 7. LPRAM with chip aspect ratio = 2:1, low power, and high speed. are odd numbers or both are even numbers, then there are exactly ways to construct the LPRAM to meet the given aspect ratio. Proof is given in the Appendix. As illustration, consider a 64 M DRAM. This has cells and let the of the chip be 2. This provides altogether distinct possible architectures. So, we have a large flexibility in power/performance tradeoffs. However, all these H-Tree layouts are not advantageous, with respect to wire length and, as wire length increases, performance decreases due to longer critical path length. So, only a particular range of variation in the aspect ratio of the H-Tree layout, say, is of practical importance. To keep the formula simple, we assume the ARI of the H-Tree layout and the memory nodes to be 0 (i.e., aspect ratio 1:1) while deducing power and performance estimates. It has been verified that little variation in the aspect ratio of the H-Tree layout produces little difference in the power estimates and performance. Instead, the variation of the aspect ratio of the individual module greatly affects the power, performance, and area of the module (all addressed in Sections VI X). What is key here is that an optimum aspect ratio can be used for individual memory nodes to reduce the power dissipation and increase the speed, while, at the same time, achieving the given aspect ratio of the chip by varying the aspect ratio of the H-Tree layout, as long as it is tolerable. Test Structure: Fig. 8 shows the test structure to be used for the proposed low-power LPRAM during testing. All these modules have been divided into quadrants (shown as the dotted boundary in Fig. 8), each quadrant holding (a power of two, assumed to be four in the figure) modules. So, we have. In each quadrant, comparators are placed between

6 642 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY 2004 Fig. 9. Address mapping of LPRAM. adjacent modules, shown in Fig. 8,,,. The output of all those comparators is fed to a input OR gate, centrally located in that quadrant. The output of all these OR gates is tagged and sent as a single FAIL line, to generate an error during testing. So, in each quadrant, all the modulo adjacent nodes are compared simultaneously, eventually leading to a speed-up of fold during testing. V. MODES OF OPERATION The low-power testable RAM (LPRAM) has three modes of operation: 1) normal mode; 2) test mode; and 3) standby mode. There are two additional inputs, of which one (TEST) is used to activate the RAM in test mode and the other in low-power mode. In test mode, test data is fed into the RAM and any discrepancy in testing raised by an additional output pin FAIL. In low-power mode, the modular design allows switching off portions of the RAM. Dynamic reconfiguration based on workload can be easily accomplished in standby mode. This mode of operation can be useful when the system is not very active. Switching off portions of RAM can be done easily in our architecture by the memory management unit, to save additional power. We propose to overlay an additional switch structure for routing the. This will allow switching out half or three fourths of the RAM by disabling appropriately. Test Mode Operation: The LPRAM can be put into test mode by activating the TEST pin. Test data is fed into LPRAM, as usual, through the external tester by addressing as,,. These bits are ignored during testing, and data is written parallel into all nodes simultaneously, in the th quadrant. By Test Write, the writing up of data to all locations identified by the address is conveyed. Similarly, by Test Read, the parallel reading of all locations addressed, routing the data internally to the OR gate and finally to the FAIL line, is conveyed. Testing proceeds by activating each one of these quadrants, one at a time. The extra pins provided for the low-power reconfiguration, as described above, are used here in test mode, switching out all other quadrants. Identical data is simply written into all the modules in the quadrant; the data is then read back and compared against each other internally for test. Thus, all modules in a quadrant can be tested simultaneously. The testing time to test all the modules in any quadrant is the same as testing any single module providing considerable speed-up. Low-Power Structure and Operation: RAMs full capacity is often not fully utilized, a small fraction active most of the time. So, a technique which enables switching off of the portions of RAM not in use, but dissipating power due to leakage and refreshing (at chip and board levels) is enormously helpful. We provide here a chip-level technique. However, how much of the LPRAM can be switched off depends on the number of additional input pins (called DIV pins) allowed: with one DIV pin, either half or three-fourths of the RAM can be switched off. Switching off larger portions of the RAM can be done using additional pins. Because LPRAM is so modular, it can be accommodated by FULL and DIV lines controlling the tristate switches planted inside the memory nodes. This mode is very useful for handheld devices, particularly when battery power is below certain thresholds. The stepped nature of this configurability provides additional flexibility to the operating system to select a range of battery power thresholds to better utilize power, rather than wasting. Address Mapping: Basically, two ways to map the addresses exist: one is to have consecutive addresses within each module, and the other addresses are interleaved across modules. In Fig. 9, we have shown two different mappings of 32 addresses (given in hexadecimal) into modules. Address bits are divided into two parts, and, here represents the module number 0 through 7 and represents the address within the module. In Fig. 9(a), the least significant two bits are changed to produce consecutive addresses within the same module. The mapping shown in Fig. 9(b) has the advantage of being able to access multiple addresses through pipelining. Buffers can be placed on the switch nodes to facilitate this. This will further impact speed.

7 BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 643 Fig. 10. Power dissipation within a memory node. VI. POWER-ESTIMATION MODEL AND COMPARISONS The following is the active power equation for CMOS RAM of size cells (i.e., cells arranged in rows, and each containing cells in each quadrant of a four-quadrant memory module), given by [2] Data Retention Power of Conventional DRAM: In the data-retention mode, internal data is retained and refreshed without any access from outside. The refresh operation is performed by reading data of all the cells on a single word line, and restoring them to their original values. The refreshing circuitry selects each of the word lines in order, and during the whole time (called refresh busy time), the RAM is not accessible from the outside. For high-performance RAMs, refresh busy time is expected to be as low as possible. The refresh cycle frequency equals, where is the refresh time interval of cells in the retention mode, and increases with reducing junction temperature. In general, is much smaller than the, which is provided in specification and depends on the cell technology for the trench capacitor. The power consumed for refreshing cells can be derived as (3) where is an external supply voltage, is the active current drawn by the selected cells, and is the data retention current required by any inactive or nonselected cell. is the output node capacitance of each decoder, is the internal supply voltage, is the total capacitance of the CMOS logic and driver circuits in the periphery. Let represent the total static (dc) current of the periphery, and is the operating frequency. When we need to access a cell within a RAM, all the cells along the row, containing specific cell, are selected simultaneously. As mentioned earlier, we are using a simplified model of RAM, as shown in Fig. 3. This model is used for both conventional RAM and the modules used in the proposed architecture. This helps in developing easy-to-understand expressions for power, area, and performance estimates and comparisons. Equation (1) can be simplified for high-frequency DRAM operation (Fig. 10), and by the use of the CMOS NAND decoder, as well as by elimination of very low dc components, yielding the following reasonable approximation. Data Reading Power of Conventional DRAM: The destructive readout of a DRAM cell requires successive operations of amplification and restoration for the selected cell on every data read. Here, each cell is basically a trench capacitor, requiring charging and discharging during each reading. This is accomplished by a latch-type CMOS sense amplifier on each data line. So, during the reading of a data line, the associated trench capacitor is charged and discharged with a large voltage swing of (usually V) and with charging current of, where is the data line capacitance. The active power consumption during read is given by (1) (2) From (2) and (3), it follows that the following factors are crucial to reduce the power during any read/write cycle: 1) reducing charging capacitance (, ); 2) lowering the external and internal voltages (,, ); 3) reducing static current ; and 4) reducing refresh cycle frequency. As mentioned, several techniques have been offered to reduce circuit parameters. These techniques can be used in conjunction with our proposed architectural solution to low-power design. It is interesting to note that reducing design parameters like and can also reduce power consumption. Therefore, for instance, if previous researchers have proposed segmenting the word line, the proposed low-power architecture allows a systematic way to reduce. It further allows reduction of the to reduce power, not previously possible by the circuit level techniques. In the LPRAM, data is read out or written into by first choosing a selected module by the tree decoder, and power is being dissipated only by the decoder (switch nodes) on its path (Fig. 11). The address is then decoded in selected modules to locate the exact cell containing data. For example, in Fig. 11, only those switch nodes that are hatched consume power while reading a data from module. No other switch node is activated at all. This observation is used for modeling power for switching nodes. However, the switch nodes consume a very small fraction of overall power. Consider the example of a 16 M DRAM, with and. The same size of RAM implemented using LPRAM architecture will have and, with 16 nodes of 1 M each. In addition, if DWL is used for 16 divisions of the word line, then for traditional RAM, and the corresponding value for LPRAM is 64. The power reduction in the proposed RAM is achieved primarily by reducing these parameters. In the following, we develop various equations for power estimates.

8 644 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY 2004 Fig. 12. Access power reduction in LPRAM during normal mode. Fig. 11. Power dissipation model during normal mode in LPRAM. Data Reading Power of LPRAM in Normal Mode: The data read-out power for LPRAM can be formulated as where could be as small as, and is the effective capacitance seen in the tree decoder of LPRAM. Equation (5) provides the expression for the tree capacitance,. Estimation of : In the tree decoder, each switch node consists of a simple one-out-of-two decoder and buffers. The decoder is a one bit decoder consisting of one level of logic. Additionally, each decoded signal is controlled by the preceding subtree select (chip enable for the first level), and this introduces another level of logic. In each switch node, one bit of the address is decoded, and the rest of the address bits are simply transmitted. At each node, the signal has to drive a load of (two gates each offering a load of ), and the output gate has a drive capability of. The bus width is assumed to be. All bus lengths of the tree are computed with respect to, the length of the vertical side of the LPRAM (Fig. 17). The input buffer drives the bus up to the root node-length,. Let, the number of levels in the tree, be assumed odd. The length of the bus connecting level 1 to level 2 is. Thus, if is the capacitance of metal over field oxide, then the load offered by the bus, between levels 1 and 2, is. Each node is connected to two nodes at the next lower level. Therefore, a buffer at level has to drive two buffers at level, each offering a load of. Thus, this load can be modeled as. The total load that has to be driven at level 1, by the second gate, is and which is parallel to. The total capacitance seen at level 1 can, therefore, be represented as. The capacitance at level 2 is the same as level 1 because the bus lengths are the same. Further, after every two successive levels, the length of the bus to be driven decreases by half of the (4) level before. For example, levels 3 and 4 have to drive buses of lengths of ; and subsequently, levels 5 and 6 have to drive buses of length, and so on. Let. In general, the bus length to be driven by the node at level can be expressed as. A tree of depth will have decoding stages. Therefore, the total capacitance over the entire tree, from level 1 to the leaf nodes, can be modeled in parallel operation, as, which evaluates. So, the capacitance value seen from the root of the tree to the accessed node is given by Data Retention Power of LPRAM: The LPRAM achieves a corresponding reduction in retention power as well because of the reduction in both and architectural parameters. The equation for data retention power is given by Refreshing is done independently within each module. Also, we have. Let us assume is of the form as the ratio between and ; i.e.,, then. So, by an appropriate choice of, both the data read out power and the data retention power can be reduced! We have calculated the power dissipation of the proposed LPRAM for a large range of module sizes, and for four different RAM sizes, 4, 16, 64, and 256 M. The reduction in power dissipation over the traditional RAM is illustrated in Fig. 12 with a range of aspect ratios of individual memory node. These savings are shown as percentages of reduction in power dissipation over the same size of conventional RAM. For the sake of comparison, we have considered the same number of partitions (5) (6)

9 BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 645 Fig. 13. Retention power reduction in LPRAM during normal mode. (four partitions or four-quadrant) in the traditional RAM, and the individual memory node of the LPRAM. From Fig. 12, we see that for the same size of RAM, we achieve greater access power savings when the aspect ratio of the individual memory node is greater. However, when aspect ratio become too large, the retention power increases. But, from Fig. 13 we also see that there is a savings of the retention power as well. (The test mode power savings is discussed later). It may be noted that the reduction in retention power is the result of potential Q-fold reduction in refresh frequency because each module has fewer words and a potential Q-fold reduction in number of bits per word. However, because the Q modules have to be refreshed in parallel, the net effect is at best linear reduction. One must also observe that while the refresh power can be reduced, the total energy needed to refresh remains the same. VII. PERFORMANCE IMPROVEMENT AND COMPARISONS In this section, we demonstrate potential performance improvements attainable by the proposed architecture. Node Delay: The primary delay in accessing data within a memory node is due to: 1) the selection of a word line (row decoding); 2) enabling the selected word line; and 3) the charge transfer between the selected cell. Column decoding is performed in parallel with the above operation, and does not appear in the critical delay, as long as that delay is less than the added delay of 1) and 2). Note that operation 3) cannot be done before column decoding; the address is buffered before it drives the decoder. Address Decoding Delay: Using the CMOS NAND decoder [7] for low-power operation, only one out of rows is charged for every addressing, and as assumed earlier, at each stage, address lines are decoded by switching a series of CMOS gates at each stage, except the first. Then, the capacitative load seen by the output of one stage to the next will be. For the CMOS NAND decoder, delay at every stage will be proportional:. The delay in decoding the address can be. Word Line Enable Delay: The most common model of word line enable delay is. Bit Line Delay: The general form of bit line delay is, where, and is the resistance of the transistor. So, we model the total delay along the critical path within a memory node of LPRAM as. Delay in Traditional RAM: We compute the delay in accessing data in traditional RAM; i.e.,, using the formula as given for, by setting and keeping all other circuit parameters unchanged. So,. Delay in LPRAM: As each module of the LPRAM follows the traditional architecture, we will compute both the additional cost of the signal propagating up and down the tree, as well as the delay within the node. The delay up the tree is less than the delay down the tree, since only one buffer delay is introduced at each node. In propagating the signals down the tree, the address bits are buffered and decoded. The data signals propagating up the tree after a read are simply buffered. However, conservatively, the delay up the tree is taken to be the same as the delay down the tree. Therefore, the total delay along the critical read access path for the LPRAM architecture can be modeled by. Tree Decoder Delay: In the tree decoder, each switch node consists of a one bit decoder, consisting of one level of logic. Additionally, each decoded signal is controlled by the previous subtree select (chip enable for the first level), and this introduces another level of logic. To estimate the worst case delays, we model the delay at each switch node as the sum of the signal propagation delay through two levels of logic, coupled with the delay for driving the bus structure and the gates at the next level. We have already seen in Section V that the total load offered to the bus structure of H-tree is, in parallel with. Therefore, the delay over the entire tree, from level 1 to the leaf nodes, is given by. Assuming inputs to the tree are also buffered, we have delay from the input to the root node, which is in the center of the layout, as. The total tree delay is the sum of the previous expressions, given by,. For simplicity, in the above analysis of the tree decoder delay, we have assumed or 1. In practice, we will keep the aspect ratio of the H-tree within 4:1; i.e.,, and will not produce any noticeable deviation from what is given above. The access time for the LPRAM architecture, as proposed here, is given by. This equation has been evaluated for RAM of sizes 4, 16, 64, and 256 M, with other parameters, as given in Appendix. Fig. 15 shows the percentage reduction in delay; i.e.,, of LPRAM with respect to the traditional architecture. Here, the tradeoff is between the gain in performance due to partitioning, versus the additional delay in traversing the tree. Importantly, we can see a steady improvement in the performance for higher RAMs. Like access power, performance of the LPRAM improves when the aspect ratio of individual memory node is increased. These graphs in Fig. 15 illustrate the performance improvement as the number of nodes increases for the same size of RAM. As expected, finer granularity results in shorter word length within each module resulting in faster RAMs. Furthermore, it may be observed that as the aspect ratio is increased, we obtain improvement in performance as well. This can be also explained by the fact that the higher the aspect ratio, the smaller the word line. However, the key observation to be made is that

10 646 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY 2004 Fig. 14. Uniform distances for various cells. there is a point of diminishing return. Increasing aspect ratio beyond a certain point does not improve performance. Performance Improvement in On-Chip Memory Designs: The proposed H-Tree layout for the cell arrays has the following when applied to unique advantage on-chip memory designs such as caches. In these applications, not only is speed very important but so is the ability to provide uniform access to different cells can be of practical significance. The H-Tree layout ensures uniform path lengths to various cells from the root node. In traditional designs the path lengths to various cells can vary significantly. However, in our design all of the cell arrays have the same distance from the root node. Here, the cell arrays have the same distance from the root node. This is further illustrated in Fig. 14. This uniform distance property, holds for any arbitrary sized H-tree layout. Further Speed-Up Using Pipelining Techniques: As illustrated in Fig. 9, the proposed architecture admits two different modes of address mapping noninterleaved as shown in Fig. 9(a) and interleaved as shown in Fig. 9(b). The interleaved mode will admit access to consecutive addresses in a single cycle. By placing data buffers in the switch nodes one can further access a block of data simultaneously consider the address mapping as shown in Fig. 16. Any consecutive addresses,, and reside in four different modules in the interleaved mode. By placing buffers in the switch nodes one can access more data in consecutive addresses in and out simultaneously through the pipelining as shown in Fig. 16. This can be specially effective when data transfers are done in blocks as in caches. VIII. AREA ESTIMATES WITH NEW TECHNOLOGY AND COMPARISONS As before, we use the simplified representation of RAM, shown in Fig. 3, in developing area-estimation formulas. Since these estimates are used for comparisons only, any other estimates should yield similar results. Area of a Memory Node: As mentioned earlier, within each memory node of LPRAM, the address bus of width is divided into two parts and and is decoded by the row and column decoders, respectively. This is done in such a way as to get a desired aspect ratio of. So, we get and. The row decoder using CMOS NAND gate selects one out of rows. Assuming a stage decoder, at each stage address lines are decoded by switching a series of CMOS gates. The area of the row decoder at the first stage,, [13] is given by, where, is the area of a single CMOS inverter needed, per output. All other stages require, as there is an extra signal line from the previous stage. The number of decoders in each stage is. The Fig. 15. Fig. 16. Performance enhancement in LPRAM. Pipelined block transfer of consecutive addresses. total area for the row decoder is then given by. The decoder area is a function of, the number of address lines and. The number of address lines decoded at each stage, depending on, in general, vary between two and four in most practical designs. Now, considering the width of the row decoder to be, we approximate the height of the row decoder to be.we similarly define, and. We take, and the column decoder and the sense amplifier are similarly characterized by and, respectively. Because the CMOS NAND decoder takes much more space, we need to put enough space between two rows to accommodate the decoder portion. Let the area of each cell be. Because the small trench capacitance area required for the individual cell is very small, the estimated height will be dominated by the height of the row and column decoders. We get the width of the node to be the maximum of, denoted as. Similarly, we calculate the height of the node to be. Here, we have added extra space, equivalent to, required for the sense amplifier. The timing and control in DRAM is implemented by a timing chain generated by delay elements. As in [13], this is imple-

11 BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 647 Fig. 18. Area model of LPRAM. Fig. 17. Area model of a memory node. mented as. The memory node has address bits; therefore, it requires address buffers whose area is given by, and the data buffer is characterized by. Then, the area of each of the memory nodes (Fig. 17) is computed as (7) Area of Traditional RAM: Now, the area of the traditional RAM can be computed using (7), with small changes in the parameters. We set the address bus width equal to, and the aspect ratio of RAM to 1:1 (2:1), when is even (odd). So, here we use and.as will always remain greater than, the number of stages in the address decoder may need to be increased. We compute the area of the traditional RAM, using (7), with, between two and four. Area of a LPRAM: We assume the bus structure of the LPRAM is implemented such that the address bus is multiplexed (usually done with little, if any, performance penalty since the column address is required some time after the row address). The lower address bits can be multiplexed; the upper bits propagate directly for the subtree and the final node select. Therefore, the bus carries address lines, one data line, two lines (TEST & FAIL) for testing, two lines (FULL & DIV) for low-power configuration structures, and one for. The area of the LPRAM architecture (Fig. 18) can be computed by using the following parameters. Let be the width of the bus, let be the length of the horizontal side of the chip, and let be the length of the vertical side of the chip. Therefore:, where is the pitch, the difference between neighboring signals, in, and. Because the aspect ratio may be other than 1, depending on the value of, we explicitly need the height and the width of each node, rather than its area. Therefore, the area of the nodes and the bus structure is. Fig. 19. Area overhead in LPRAM. Finally, certain input buffers would be required to drive the tree, the area for which can be estimated as. The area of the LPRAM can, therefore, be expressed as. The area requirements of this architecture are analyzed and compared to the traditional architecture. Significant area increases for LPRAM architecture are seen for large numbers of nodes. However, the proposed architecture may be best-suited for defect-tolerance techniques. For example, any single defect can be tolerated by switching out half of the tree. By exploring such defect tolerance techniques, one may be able to obtain acceptable yield levels. To compute this increase, four sizes of RAMS are evaluated: 4, 16, 64, and 256 M, over a series of values of. Fig. 19 shows the percentage of increase in the area of the LPRAM over the traditional implementation. As expected, the key factor in area increase is the number of nodes. However, as shown below, the larger the number of nodes, the greater the performance improvement.

12 648 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY 2004 IX. IMPACT OF ASPECT RATIO ON POWER, PERFORMANCE, AND AREA At this point, we are now ready to draw some relation between aspect ratios ( and ) and the power consumption, performance, and area overhead in LPRAM. From all the previous analysis, it is clear that dominates the power consumption and performance. So, reducing will reduce power as well as improve performance, an aspect ratio of the module with being preferable. However, we cannot reduce arbitrarily, as this will increase gradually, which may even cross, thereby significantly increasing the data retention power consumption compared to traditional RAM. We will, therefore, prefer to be 1:2, 1:4, or 1:8, depending on the size of the RAM and the number of memory modules. For example, if we need to design a RAM of size or 16 M, one of the optimum LPRAM implementations could be to divide the RAM into 16 modules (i.e., ), and use. So, we will get and. Additionally, if we need to satisfy a chip-aspect ratio of 2: 1, we have two choices, as shown in Figs. 5 and 7, either able to be used, depending on the requirements. Choosing the organization given in Figs. 5 or 7 will not give any performance penalty, either, for the following reason. While computing the delay of the H-tree layout and, for simplicity, we have assumed a unit aspect ratio of the H-tree layout. However, all of the calculations, as well as the bus length, have been measured with respect to, the length of the vertical side of LPRAM. We had it in mind that the high-performance low-power LPRAM will be one with. So, will always be greater than, and using for our analysis will yield a much more conservative result. The fact can be even verified in Figs. 5 and 7, by observing that all the wire segments in the H-tree layout are less than. It has been also found that the area of the LPRAM is minimum, when both and are equal to one. This is because the area is mainly dominated by the decoder area which changes significantly, even if a single address line moves from row decoder to the column decoder and vice versa. So, for the same RAM of size 16 M, another alternative is to have. Such an implementation, however, will consume more power with low speed than the previous one, but could be laid in a smaller chip area than the previous one. X. TESTING The testability technique used here enables the of nodes to be tested in parallel, as mentioned earlier. Depending upon the size of the RAM and the number of modules in LPRAM, we set the value of. Thus, we get a test time saving of fold, without dissipating much power as well. A test algorithm with steps now definitely requires steps only. Testing the RAM involves three sets of tests: 1) testing the tree decoder; 2) testing the built-in test structure (BITS); and 3) testing the memory nodes. We will discuss the testing of the memory nodes only, the test procedure of the other parts being the same as given in [1]. Testing of Memory Nodes: At this stage, it simply looks like a RAM of size instead of. The tester, instead of performing the usual read and write, performs Test Read and Test Write, and the FAIL line is monitored to see which pattern fails. After going to all addresses of one module (i.e., after exploring address spaces), we jump address spaces, as they have been tested in parallel, previously. We stop when all the address spaces have been tested. Any test algorithm can be modified to do this. For example, the MATS, presented in [14] can be modified as follows. 1) Place RAM in Test Mode 2) For to do Begin 3) For to do in parallel Test Write 4) For to do in parallel Begin a) Test Read (BITS internally verifies that all set to 0) b) Test Write End 5) For to do in parallel cells have been Test Read (BITS internally verifies that all cells have been set to 1) End 6) Return the RAM to Normal Mode. Power Reduction During Test: The following elaborates on potential power savings during test. Consider, for example, the MATS algorithm [4]. A test cycle in MATS is comprised of 4 access to RAM. And assuming the tester runs at least as fast as the memory chip is, then we get For Conventional RAM: No. of test cycles. i.e., Total no of accesses Energy dissipation per test cycle Total Energy dissipation Total testing time For LPRAM: No of cells accessed per test cycle No of test cycles or, equivalently, i.e., Energy consumption per test cycle Total Energy dissipation Total testing time

BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 649 TABLE I PROOF PROCEDURE OF LEMMA 1 Fig. 20. Reduction in power during testing in LPRAM (q =4).

13 BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 649 TABLE I PROOF PROCEDURE OF LEMMA 1 Fig. 20. Reduction in power during testing in LPRAM (q =4). Speed up in testing. So, this speed-up is irrespective of the testing algorithm being used. As all modules are tested simultaneously, the peak power consumption during testing also grows closer to folds, compared to normal operation. However, LPRAM consumes very low power for accessing, and the test data are written and read locally within the quadrant, with up to five; therefore, we still get a reduction of about 20% power in 256 M RAM, depicted in Fig. 20, compared to the traditional RAM. At the same time, we get a four-time reduction in test time. XI. CONCLUSION A novel architecture for LPRAM is proposed. The LPRAM architecture saves about 35% power during normal operation for a 256 M RAM, compared to the traditional RAM. Also, for a 256 M RAM, LPRAM provides about 20% reduction in power during testing, with a 75% saving in test time. Thus, it reduces power consumption, both during normal operation and testing. Significantly, the proposed architecture achieves a higher speed (about four times higher) than the traditional architecture because of reduced word line length. Also, the performance enhancement is achieved because a much smaller number of cycles is needed for refreshing, with reduced refresh busy time. In addition, the BITS allows significant reduction in test time. It also indicated the strategy to attain further speed-up through address mapping combined with pipelining. There is an increase in the area over traditional RAM. This increase, however, may not impact the yield because the RAM nicely allows defect tolerance through reconfiguration. For example, the LPRAM with certain defects, can be reconfigured to small RAM half the size. Also, the defects within each module can be repaired using spare rows and columns. Again, we highlight the distinctions between the approach presented and the traditional approach of multiple cell arrays. 1. Our approach used the H-tree layout to equalize delays among different cells. This has a major advantage of delay predictability, and fast access. 2. Having independent refresh and decoder circuits, the proposed LPRAM allowed reduction in normal power and test power, as well as refresh power. Also, the independence is crucial to the built-in test strategy. 3. The LPRAM has an additional mode of low power which allows, in a sleep mode, reduction of power further, by switching off modules. 4. The design, unlike the cell-array partition, is both conceptually and in terms of implementation RECURSIVE in nature. This allows for ease in implementation verification and in design, itself. 5. The traditional approach is ad hoc, and is a circuit-based approach; ours, on the other hand, is systematic and architectural. So, any circuit-based approach can be employed to further reduce power. Our approach does not preclude any circuit-based approach. 6. We also have shown different types of address-mapping, which can provide some interesting advantages in achieving multiple bit access. APPENDIX A. Proof of Lemma1 Proof: Consider that is odd, so. Then, either and are both odd or both even. The first and second rows of Table I depict the possible assignment of,, and the corresponding assignments for,, and. The case when is even follows accordingly. B. Proof of Lemma2 Proof: From Lemma 1, it is necessary that either both and have to be even, or both have to be odd. Let be divided into two parts and as was done for LPRAM, such that, where is the number of levels in the tree, and is the number of address bits in each memory node of the

14 650 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY 2004 LPRAM. We divide into two parts ( and ), such that. Also, must be divided (as there are only modules in a tree of depth ) into two parts, and, for laying out into an H-Tree layout; then. This implies that there will be and modules on the horizontal and vertical side, respectively, in the H-Tree layout. Similarly, is divided into two parts ( and ) such that, contributing toward the aspect ratio of each module. Therefore, we get, and the number of possible ways by which the aspect ratio criteria can be achieved depends on the number of solutions of this last expression. Substituting by and by, we get. Also, we know that the number of positive integer solutions of the equation ;,aninteger, is.itis further noted that, though chip ARI is positive only, this can be both positive and negative (i.e., or, both cases, chip aspect ratio ). So, the number of possible configurations is given by or, equivalently,. Process and Design Parameters: The process parameters used in technology-dependent computations have been based on the example 18- m CMOS process given in [12]. The parameters are then scaled, as detailed in [14], based on the minimum feature size used for a particular technology using constant field model. Let be the scale factor. Therefore,, where (corresponding to the 18- m technology which is taken as the base technology). The value of the gate capacitance is approximated by the value of the gate-oxide capacitance and is a function of the oxide thickness. We have used pf m, and. We have used 0.1 ns for our computation. The source/drain-junction capacitance is an important parameter for estimating the bit-line capacitance. It consists of two parts: the planar- or junction-area capacitance and the sidewall- or junction-peripheral capacitance. If the drain/source region has a dimension of, then the resultant junction capacitance can be expressed as. This is the capacitance at 0-V bias. The value at any other voltage can be computed by. For our computation, we use (the minimum required for a contact of metal), (typical precharge voltage), and (the junction built-in potentila) and. The values of the other parameters are: pf m, capacitance of metal over poly; pf m, capacitance of metal over field; pf m, capacitance of poly; pf m, junction-area capacitance; pf m, junction sidewall capacitance; pf m, capacitance of the memory cell; m, area of a DRAM cell. Parameters for Power Estimation: (Parameters explicitly required for power estimation in the RAM)., the voltage swing in the RAM; 200 MHz, operating frequency of the Traditional RAM; where ;, internal supply voltage; amp, dc static current. Bit-Line Capacitance: The source/drain-junction capacitance is important for computing the equivalent capacitance of the bit line, and is obtained as explained previously. The bit-line capacitance can be estimated as, where the width of the metal line is, and the length is. Constants Characterizing the Other Functional Blocks: (These values are taken directly from), row and column decoder pitch/bit;, depth of sense amplifier/bit;, area of address buffer;, area of data buffer;, area of timing and control unit;, pitch of metal in the bus structure. REFERENCES [1] N. T. Jarwala and D. K. Pradhan, TRAM: A design methodology for high-performance, easily testable, multimegabit RAM s, IEEE Trans. Comput., vol. 37, pp , Oct [2] K. Itoh, K. Sasaki, and Y. Nakagome, Trends in low-power RAM circuit technologies, Proc. IEEE, vol. 83, pp , Apr [3] K. Itoh, Trends in megabit DRAM circuit design, IEEE J. Solid-State Circuits, vol. 25, pp , June [4] P. Mazumder and K. Chakraborty, Testing and Testable Design of Random-Access Memories. Norwell, MA: Kluwer, [5] S. Rai and V. P. Kirpalani, A modified TRAM architecture, IEEE Trans. Comput., vol. 45, pp , Aug [6] K. Itoh et al., An experimental 1 Mb DRAM with on-chip voltage limiter, in Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1984, pp [7] K. Kimura et al., Power reduction in megabit DRAM s, IEEE J. Solid- State Circuits, vol. SSC 21, pp , June [8] M. Margala and N. G. Durdle, Noncomplementary BiCMOS logic and CMOS logic styles for low-voltage operation A comprehensive study, IEEE J. Solid-State Circuits, vol. 33, pp , Oct [9] A. Bellaouar and M. I. Elmasry, Low-Power Digital VLSI Design, Circuits and Systems. Norwell, MA: Kluwer, [10] J. S. Caravella, A low-voltage SRAM for embedded applications, IEEE J. Solid-State Circuits, vol. 32, pp , Oct [11] A. K. Sharma, Semiconductor Memories Technology, Testing and Reliability. Piscataway, NJ: IEEE Press, [12] D. C. Choi et al., Battery operated 16 M DRAM with post package programmable and variable self refresh, in Symp. VLSI Circuit Dig. Tech. Papers, May 1994, pp [13] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A Systems Perspective. Reading, MA: Addison-Wesley, [14] R. Nair, Comments on An optimal algorithm for testing stuck-at faults in random access memories, IEEE Trans. Comput., vol. C-28, pp , Mar [15] S. Ravi, G. Lakshminarayana, and N. K. Jha, Testing of core-based systems-on-a-chip, IEEE Trans. Computer-Aided Design, vol. 20, Mar [16] N. C. C. Lu and H. Chao, Half- V bit-line sensing scheme in CMOS DRAM, IEEE J. Solid-State Circuits, vol. SSC 19, pp , Aug [17] A. Chandra and K. Chakraborty, Low-power scan testing and test data compression for system-on-chip, IEEE Trans. Computer-Aided Design, vol. 21, pp , May [18] R. P. Dick, G. Lakshminarayana, A. Raghunathan, and N. K. Jha, Power analysis of embedded operating systems, in Proc. IEEE Design Automation Conf., June [19] S. Bhattacharjee and D. K. Pradhan, A Low Power RAM Design, U.K. Patent filed, June 2003.

BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 651 Subhasis Bhattacharjee received the B.E. degree in computer engineering from S. V.

He was a Senior Software Engineer with Wipro Ltd., Bangalore, India, and a Research Project Engineer with ISI, Calcutta, India, where he is now a Research Fellow.

15 BHATTACHARJEE AND PRADHAN: LPRAM: NOVEL METHODOLOGY FOR LOW-POWER HIGH-PERFORMANCE RAM DESIGN 651 Subhasis Bhattacharjee received the B.E. degree in computer engineering from S. V. Regional College of Engineering and Technology, India, in 1996 and the M.Tech. degree in computer science from the Indian Statistical Institute, Calcutta, India, in He was a Senior Software Engineer with Wipro Ltd., Bangalore, India, and a Research Project Engineer with ISI, Calcutta, India, where he is now a Research Fellow. His research interests include very large scale integrated design, logic synthesis, and distributed systems. Dhiraj K. Pradhan (S 70 M 72 SM 80 F 88) received the M.S. degree from Brown University, Providence, RI, and the Ph.D. degree from the University of Iowa, Iowa City. He currently holds a Chair in computer science at the University of Bristol, Bristol, U.K. Recently, he had been a Professor in the Electrical and Computer Engineering Department, Oregon State University, Corvallis. Previous to this, he had held the COE Endowed Chair Professorship in Computer Science at Texas A&M University, College Station, and also serving as a Visiting Professor at Stanford University, Stanford, CA. Additionally, he held a Professorship at the University of Massachusetts, Amherst, where he also served as Coordinator of Computer Engineering. He has been with the University of California, Berkeley, Oakland University, Rochester, MI, and at the University of Regina, in Saskatchewan, Canada. He has contributed to very large scale integrated computer-aided design and test, as well as to fault-tolerant computing, computer architecture, and parallel processing research, with major publications in journals and conferences, spanning 30 years. He holds two U.S. patents. He has served as coauthor and editor of various books, including Fault-Tolerant Computing: Theory and Techniques, Vols. I & II (New York: Prentice-Hall, 1986), Fault-Tolerant Computer Systems Design (New York: Prentice-Hall, 1996), and IC Manufacturability: The Art of Process and Design Integration (Piscataway, NJ: IEEE Press, 2000). Professor Pradhan is a Fellow of ACM. He has served as Guest Editor of special issues in prestigious journals, such as the IEEE TRANSACTIONS ON COMPUTERS. He has also worked as an editor for several journals, including IEEE Transactions and JETTA. He has served as the General Chair and the Program Chair for various major conferences. He has received several awards, including the 1996 IEEE Transactions on Computer-Aided Design Best Paper Award, with W. Kunz, on Recursive Learning: A New Implication Technique for Efficient Solutions to CAD Problems Test, Verification and Optimization and the Humboldt Prize, Germany. In 1997, he was awarded the Fulbright-Flad Chair in Computer Science.

THE latest generation of microprocessors uses a combination

THE latest generation of microprocessors uses a combination 1254 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File Creigton Asato Abstract A 116-word by 64-b register file for a 154 MHz