VHDL Implementation of High-Performance and Dynamically Configured Multi- Port Cache Memory

2010 Seventh International Conference on Information Technology VHDL Implementation of High-Performance and Dynamically Configured Multi- Port Cache Memory Hassan Bajwa, Isaac Macwan, Vignesh Veerapandian and Xinghao Chen Department of Electrical Engineering University of Bridgeport, 221 University Ave, Bridgeport, CT, USA hbajwa@bridgeport.edu Abstract This paper presents the implementation of 64x64 multi-port Static Random Access Memory (SRAM) and newly proposed dynamically configured multi-port SRAM in VHDL (VHSIC hardware description language). It uses a dynamic memory partitioning algorithm where a VHDL test-bench is developed to verify the functionality of the dynamically configured memory. Results demonstrate that critical memory operations such as read miss, write miss and write bypass can be performed using newly proposed low power, area efficient dynamically configured memory. Index Terms SRAM, VHDL, Dynamically configured memory, Multi-port Cache Architecture. I. INTRODUCTION As on-chip cache size has increased considerably in recent high-performance microprocessor technologies, power dissipation and leakage current in SRAM have become critical [1]. Sub-threshold leakage current flowing from drain to source, even when the transistor is not operating, has been a dominant leakage current in high performance microprocessors where L1 and L2 cache occupy majority of die area [2-4]. As transistor size scales down, power dissipation becomes a serious problem that limits overall system performances [3, 5]. In high performance systems employing multi-core technologies, bit line leakage current can contribute to as much as 50% of the overall cache memory leakage power [6]. The trend of using multi-port cache in modern microprocessor technologies [7] has exuberated this further, as leakage current dissipation scales proportionally with the area of the circuit [8]. Since the word and bit lines cover the foot-print of the entire cache section, duplicating the word and bit lines for multiple ports results in large silicon area and increases bit line discharge and power dissipation. A VHDL implementation of a newly proposed dynamically configured multi-port SRAM is presented where the focus is on the results of the prototype implementation of a 64x64 memory bank. The flexibility of the dynamically configured memory is demonstrated where using Isolation Control Line (ICL) it is divided into virtual blocks or sub-banks. The proposed algorithm involves transmission gates as the isolation nodes on the bitlines so as to isolate the array into virtually two different memory blocks thereby eliminating the need of two different pair of bitlines as is needed in a conventional dual-port memory. This way, it not only saves the silicon area due to the elimination of a pair of bitline but also reduces the power consumption for pre-charging the additional bitline pair. BSIM4 PTM (Predictive Technology Model) is used for the design of the memory cell and for the RC delay calculation. Based on these RC values, the delay for individual transistors is calculated and incorporated into the VHDL code so as to show the performance of the proposed design in the presence of the parasitic resistances and capacitances. The rest of the paper is organized as follows. Section II analyzes present SRAM technologies based on the single and dual-port memory cell. In section III we review some previously proposed architectural and circuit techniques which reduce leakage current and bit line latency. In section IV we will present the VHDL implementation for the prototype followed by the VHDL simulation results and conclusions in sections V and VI. II. PRESENT SRAM TECHNOLOGIES Figure 1 shows the classic 6-transistor SRAM cell in a single-port configuration. When the word line (WL) is selected, the SRAM cell is connected to the pair of bit lines ( and ) via access transistors T5 and T6. Figure 2 shows the SRAM cell with dual ports, each is hardwired with dedicated word and bit lines, and how duplicating bitlines causes bit-line leakage current to multiply. The word and bit lines and access transistors T7 and T8 for the second port would almost double the silicon area [9] of the singleport configuration. Sub-threshold leakage current, drainsource current of the transistor when the transistor is operating in weak inversion is a major contributor towards SRAM leakage current [10]. Pre-charging as well as keeping the bit lines high, causes significant power dissipation and contributes heavily to the total power dissipation. When 0 is stored, transistors T1, T5 and T4 dissipate leakage current. When 1 is stored, T2, T3 and T6 dissipate leakage current. True Bit T5 T1 T2 WL 0 1 Figure 1. A Single-Port SRAM Cell Figure 2 shows the classic hard-wired dual-port memory architecture, where each SRAM cell is accessible by two ports with dedicated word and bit lines for each. The addition of the word and bit lines and access transistors T7 and T8 would almost double the silicon area. The dual-port (as well as multi-port) memory architecture has been implemented with instruction and data cache in multi-core processors in recent years. The most important advantage of T3 T4 T6 Compliment Bit 978-0-7695-3984-3/10 $26.00 2010 IEEE DOI 10.1109/ITNG.2010.243 1212

this architecture is that it can execute multiple cache accesses simultaneously. Therefore, it doubles (in the case of dual-port) or multiplies (in the case of multi-port) the bandwidth of a single-port cache. True Bit (Port 1) T5 T7 True Bit (Port 0) T1 T2 WL (Port 0) 0 1 WL (Port 1) Figure 2. A Dual-Port SRAM Cell III. TECHNIQUES TO REDUCE LEAKAGE POWER Most of the recent research activities in this area are geared towards the reduction of sub-threshold leakage current in on-chip cache. Among process and circuit level techniques dynamic Vt, dual-threshold voltage, reducedgate SRAM (RG_SRAM) and gated Vdd have been discussed in [11-14]. Dynamic Vt SRAM [11] reduces leakage current in cache memories by switching cache lines to high Vt if the access has a small probability. The dualthreshold voltage technology [12] uses high threshold voltage devices to reduce leakage current; low threshold voltage devices are used where high performance is required. Reduced-gate SRAM uses two additional pass transistors connected between the cross couple inverters to decrease gate leakage current [13]. Gated Vdd [3] reduces leakage power by suing high threshold transistor between a virtual ground and GND to cut off the power supply to the memory cell in a low power mode. Among architectural techniques bit-line segmentation [7, 16-17] and sub-banking have reduced leakage power significantly. Most of the architectural techniques are combined with relevant circuit techniques to suppress unnecessary leakage power. Albonesi [14] reduces power dissipation by enabling only a small portion of the L2 cache at a time. Zhu et. al. [15] designed low power SRAM by enabling banks to switch between active and standby mode. Bit-line segmentation efficiently reduces leakage current by shortening the bit-line length dynamically. Karandikar et. al. [16] divides bit lines into hierarchical segments to reduce bit-line capacitance and adds parallel bit lines to access SRAM cells. Adding parallel bit lines in the above architecture, however, has the drawback of larger memory area. Yang et.al. [17] explored this architecture further and proposed hierarchical bit lines with local sense amplifiers. One major source of energy dissipation is charging/discharging the whole bit-line. Rao [7] divides bitlines into smaller segments such that segments higher than the current access cell are isolated from bit-line pre-charge. Although this approach incurs additional delay, it reduces the length of bit-lines for accessing cells near the physical T3 T4 T6 T8 Compliment Bit (Port 0) Compliment Bit (Port 1) ports, hence, reducing latency and power dissipation. The cache memory configurations employed in the above mentioned approaches use fixed bank size and duplicated word and bit lines (without providing dual- and multi-port accesses), hence, incur either moderate performance degradation or large area overhead. IV. VHDL IMPLEMENTATION FOR DYNAMIC MEMORY PARTITIONING It is often reported that a large number of ASIC designs meet their specifications first time, but fail to work when plugged into a system. VHDL is chosen as a HDL (Hardware Description Language) for synthesizing the proposed SRAM prototype due to the many benefits involved. A VHDL specification can be executed in order to achieve a high level of confidence in its correctness before commencing design, and may simulate one to two orders of magnitude faster than a gate level description. Also behavioral simulation can reduce design time by allowing design problems to be detected early on, avoiding the need to rework designs at gate level and it also permits design optimization by exploring alternative architectures, resulting in better designs. VHDL descriptions of hardware design and test benches are portable between design tools, and portable between design centers and project partners. It permits technology independent design through support for top down design and logic synthesis. There are several approaches to memory synthesis such as Random logic using flip-flops or latches, Register files in datapaths, RAM standard components, RAM compilers, etc. The first approach uses large vectors or arrays in the HDL code. The synthesizer will map these elements to arrays of flip-flops or latches depending on how the timing of the assignments is handled. The second approach uses a synthesis directive or hand instantiation to synthesize a memory to a datapath component. Usually the datapath components are constructed from latches in a regular array. The third approach uses standard components supplied by an ASIC vendor. For example, we can instantiate a small RAM using CLBs in a Xilinx FPGA. The last approach, using a custom RAM compiler and is the most area-efficient approach. It depends on having the capability to call a compiler from within the synthesis tool or to instantiate a component that has already been compiled. VHDL allows multidimensional arrays so that we can synthesize a memory as an array of latches by declaring a two-dimensional array. The proposed architecture employs a DMP (Dynamic Memory Partitioning) technique that uses isolation nodes to partition a cache memory block into two virtually independent sections based on real-time access addresses of multiple ports. Figure 3 shows placement of isolation control line (ICL) and isolation node on each of the bit lines to divide an SRAM block into the upper and lower sections, which are to be accessed by the upper and lower ports, respectively. A selected ICL turns off isolation nodes based on the real-time access addresses at the upper and lower ports. Compared with the hardwired dual-port SRAM as shown in figure 2, DMP can provide dual-port accesses without the need of the second pair of bit lines and effectively reduce leakage current, bit line latency and silicon area. 1213

Isolation nodes placed on bit lines introduce additional latency. Rao. Et. Al. [7] showed that bit lines segmented by 8 isolation nodes pose no significant performance degradation. With DMP the placement of isolation nodes is of strategic importance. Placing isolation nodes between adjacent word lines provides the highest degree of flexibility for DMP, but would be overkill if the applications do not need or utilize such fine-grain DMP capability. In principle isolation nodes are placed for every n word lines, where n is determined based on the statistical patterns of access addresses of targeted applications. In figure 3, the bit lines across 512 word lines which are divided into 8 groups, each contain 8 sub-groups. Each group contains a local sense amplifier block and a local pre-charge block. An ICL and two isolation nodes are placed between sub-groups, each across 8 word lines. When an ICL turns its isolation nodes off, the upper and lower ports can access desired cells from different sub-groups. Each group s local sense amplifiers ensure that the group s bit line section accesses cells on 64 word lines separated by 8 isolation nodes with minimal additional delay [7]. most significant bits of the two addresses are compared using pre-computation based optimization technique for low power and it is determined whether these are same or different. Hence, if the two addresses are in the same subbank, it is treated as same address and the ICL signal is disabled indicating that all the isolation nodes are ON. The memory array is now acting as a dual port memory with a single pair of bitlines. The ICL generator is further a comparator that compares the first three most significant bits of the two incoming addresses and based on the fact whether they are same or different evaluates the output ICL, which is a three bit logic vector, to be one of the 8 possible combinations. ICL equal to 000 is considered as a unique value showing that all the transmission gates or isolation nodes are ON. Rest of the cases where ICL is between 001 and 111 indicates the respective Isolation node being OFF. For instance, ICL being 010 indicates that the isolation node number 2 between sub-bank having the first three most significant bits to be 001 and that to be 010 is OFF. The ICL signal is used by the controller to determine whether the two addresses are same or different and accordingly assign the respective states in order to issue the Read Write Enable signal. It is also used by the SRAM Isolated Module in order to determine whether it has to incorporate a delay in case of two simultaneous Write operations. Second module is a controller, which basically a finite state machine is having eight possible states based on the combination of read and write operations on the two ports as shown in the figure 4. Figure 3. ICL and isolation nodes placement The top level VHDL module of the proposed design consists of three main sub-modules, ICL (Isolation Control Line) Generator, a Controller and the SRAM isolated module. The purpose of ICL Generator is to calculate the ICL signals based on the two incoming addresses being same or different. These ICL signals when ON turns OFF the respective transmission gates so as to virtually divide the memory into two different sections. The entire array of 64 words is divided into 8 different sub-banks. The transmission gates are placed between these sub-banks, which when OFF will divide the upper sub-bank from the lower ones. The controller uses the Read and Write signals on the main module in order to generate a RWE (Read Write Enable) signal, which in turn controls the SRAM isolated module to process the incoming data. The SRAM isolated module in turn consists of an array of 64X64 bits, which are selected to read from or write into a specific 64 bit word based on the decoder addresses. Since the memory is divided into 8 subbanks, in order to calculate the ICL signals, the first three Figure 4. State diagram for the Controller module The proposed design involves the legal memory operations which are Read-Read, Read-Write, Write-Read and Write-Write, first being for port A and second being for port B. All these operations are carried out in different states where the RWE signal, which is a two bit logic vector, is fashioned to indicate which of the combinations is being dealt with. For instance, if the current operation is Read-Read, which means that both ports A and B are being read, then RWE is triggered to be 00, if it is a Read-Write, then RWE is 01 and so on. In other words, a 0 indicates that it is a Read operation and a 1 indicates that it is a Write operation, two bits used for indicating the operations on the two ports. All the states except the state where there is a simultaneous Write operation revert back to the reset state. It is only in this simultaneous Write state, 1214

where the next state is not the reset state so as to provide an additional clock for the delay of the second port being written. Thus the SRAM isolated module using the RWE signal acknowledges this extra clock cycle and hence delays one of the ports. Rest of the operations of Read-Write, Write-Read and Read-Read are accomplished in a single clock cycle and in a single state from the viewpoint of the controller. The third module is the SRAM isolated module, which uses RWE signal from the Controller and ICL signal from the ICL generator module in order to read from or write into the memory array. This is further accomplished by intentionally delaying one of the ports when there is a simultaneous Write operation. It uses a 64X64 bit memory array and three separate processes for reading and writing. An additional process is used to equalize the addresses if they are in the same sub-bank. It intentionally equalizes the two addresses if the ICL signal is 000 indicating that the two addresses being encountered are in the same sub-bank having 8 memory cells. V. VHDL SIMULATION RESULTS Quartus II design software is used to write and simulate the VHDL code for the proposed design. The various modules of section IV are port mapped in a single top level VHDL description, tools such as Timing analyzer and RTL viewer are used to demonstrate the performance. Also BSIM4 PTM (Predictive Technology Model) has been used to first calculate the size of the 6-T SRAM cell (0.604µm 2 ) and also the related RC delay due to the parasitic resistances and capacitances for 65nm technology. Based on this initial analysis, the delay for each SRAM cell was found to be 7.2fs, delay for each bit-line was found to be 6.7ns and that for each word-line was found to be 0.312ps. The resulting values are incorporated into the VHDL description and the performance of the proposed design is presented in the presence of calculated parasitics. Figure 5 shows the Timing simulation waveforms for the simultaneous writes and reads with two addresses being same. port B since it is delayed due to a write-write operation on both the ports. Figure 6 shows the resulting timing waveforms with two addresses being different. Here due to the availability of only one pair of bit-lines, separate data on ports A and B is made possible by the isolation node, which is now 011 indicating that node number 4 is turned OFF thereby virtually isolating the memory between the two ports. Figure 6. Timing Simulations for two addresses being different Other legal memory operations involving write-read or read-write operations are shown in figure 7. Figure 7. Read-Write & Write-Read Operations From the figures 5-7, it can be seen that the output at Ports A and B is delayed considering the cell, bit-line and word-line delays. The RTL view of the proposed design is as shown in the figure 8, where the three sub-modules and their interconnections along with their inputs and outputs can be seen. Figure 5. Timing simulations for two addresses being same As seen in figure 5 above, the two addresses are same, 101, based on the fact that the first three most significant bits are same, which indicates that these are in the same subbank. Also a point worth noting is the isolation node being 000 indicating that all the nodes are ON. As can be seen from the figure, data on port A is over-written by data on VI. CONCLUSION An area and energy efficient multi-port memory architecture is proposed. It employs new DMP techniques, which use isolation nodes and control lines to dynamically partition the bit-lines of a memory block into virtuallyisolated sections, so that they can be accessed simultaneously and independently. Compared with the classic hardwired multi-port memory architecture the new DMP facilitates efficient designs with no significant impact 1215

to the timing of memory operations. It reduces the use of silicon area largely due to the elimination of additional bit lines. On an average, bit line pre-charge and leakage currents are reduced to half the value of typical hardwired multi-port memory. Shorter active bit lines also means less latency. The results from the implemented VHDL architecture showed that the proposed design follows all the legal operations of the conventional memory. Cache memory often contributes a large part to the total system power dissipation. This happens as the bit lines remain pre-charged even when not accessed. The proposed architecture reduces leakage power by using bit line isolation and selective pre-charging. Dynamically configured memory not only reduces leakage current by eliminating pass transistors in hardwired multi-port memory, but also reduces the bit line leakage power to half by eliminating additional bit lines. For a memory core with N rows and M columns the leakage current is reduced to less than half the value of hardwired dual-port memories. Figure 8. RTL view for the proposed SRAM dynamic memory. REFERENCES [1] H. Bajwa, X. Chen, Low-Power High-Performance and Dynamically Configured Multi-Port Cache Memory Architecture, ICEE 07, April 2007. [2] X. Chen and H.Bajwa, "Energy-efficient dual-port cache architecture with improved performances," in IEE Journal of Electronic Letters, Vol. 43, No. 1, pp. 12-14, Jan. 2007. [3] N. S. Kim, K. Flautner, D. Blaauw, T. Mudge, "Circuit and Microarchitectural techniques for reducing cache leakage power," IEEE Transaction on VLSI systems Vol. 12, No. 2, pp. 167-184, Feb. 2004. [4] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. Vijaykumar,, "Gated- Vdd A circuit technique to reduce leakage in deep-submicron cache memories," in Proc. IEEE/ACM Int. Symposium on Low Power Electronics and Design, pp. 90 95, 2000. [5] S. Kim, N. Vijaykrishnan, M. Kandemir and M. J. Irwin, "Optimizing Leakage Energy Consumption in Cache Bitlines" In Journal of Design Automation for Embedded Systems, Vol. 9, No 1, pp. 5-18(14), Mar. 2004. [6] S.-H. Yang and B. Falsafi, "Performance and Energy Trade-offs of Bitline Isolation in Nano-scale CMOS Caches.," presented at the Workshop on Complexity- Effective Design (WCED) held in conjunction with the 30th International Symposium on Computer Architecture (ISCA-30), Jun. 2003. [7] R. Rao, J. Wence, D. Franklin, R. Amirtharajah and V. Akella, "Exploiting Non- Uniform Memory Access Pattern through Bit Line Segmentation.," presented at the Workshop on Memory Performance Issues, in conjunction with High Performance Computer Architecture (HPCA), Feb. 2006. [8] B. Amelifard, F. Fallah, M. Pedram, "Reducing the sub-threshold and Gate-tunneling Leakage of SRAM cells using Dual-Vt and Dual-Tox Assesment," in IEEE Proceedings of Design, Automation and Test, Vol. 1, pp. 1-6, 2006. [9] R. D. Adams, High Performance Memory Testing: Design Principles, Fault Modeling and Self-Test, Kluwer Academic Publishers, 2003. [10] M. Mamidipaka, K. Khouri, N.Dutt, and M. Abadir "Analytical Models for Leakage Power Estimation of Memory Array Structures" In International Conference on Hardware/Software and Co-design and System Synthesis (CODES±ISSS) pp. 146-15 1, 2004. [11] C. H. Kim and K. Roy, A Leakage Tolerant Cache Memory for Low Voltage Microprocessors, in the Proc. of the 2002 International Symposium on Low Power Electronics and Design, pp. 251-254, 2002. [12] J. T. Koa and A. P. Chandrakasan, Dual threshold voltage techniques for Low-Power digital circuits, in IEEE Journal of solid state circuits, Vol. 35, No.7, pp. 1009-1018, Jul. 2000. [13] C. Thondapu, P. Elakkumanan, R. Sridhar, RG-SRAM: a low gate leakage memory design, in the Proc. of the IEEE Computer Society Annual Symposium on VLSI, pp. 295-296, 2005. [14] D. H. Albonesi, Selective Cache Ways: On-Demand Cache Resource Allocation, in Proc. of the 32 nd Annual International Symposium on Microarchitecture, pp. 248-259, No. 1999. [15] Z. Zhu, K. Johguchi, H. J. Mattausch, T. Koide, T. Hironaka, Low Power Bank-Based Multi-Port SRAM Design due to Bank Standby Mode, in Proc. of the 47 th Midwest Symposium on Circuits and Systems, Vol. 1, pp. 569-72m 2004. [16] A. Karandikar and K. K. Parhi, Low Power SRAM Design Using Hierarchical Divided Bitline Approach, in Proc. Int. Conf. Computer Design: VLSI in computers and Processors, pp. 82-88m 1998. [17] B. D. Yang and L. S. Kim, A Low Power SRAM Using Hierarchical Bit Line and Local Sense Amplifier, in IEEE Journal of Solid State Circuits, Vol. 40, No. 6, pp. 1366-1376, Jun. 2005. 1216