Available online at ScienceDirect. Procedia Technology 25 (2016 )

Similar documents
[Kalyani*, 4.(9): September, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

International Journal of Advance Engineering and Research Development SYSTEMATIC ERROR CODES IMPLIMENTATION FOR MATCHED DATA ENCODED

A SURVEY ON ENCODE-COMPARE AND DECODE-COMPARE ARCHITECTURE FOR TAG MATCHING IN CACHE MEMORY USING ERROR CORRECTING CODES

Implementation of Systematic Error Correcting Codes for Matching Of Data Encoded

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

Reliability of Memory Storage System Using Decimal Matrix Code and Meta-Cure

DETECTION AND CORRECTION OF CELL UPSETS USING MODIFIED DECIMAL MATRIX

International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May-2013 ISSN

FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

Error Correction Using Extended Orthogonal Latin Square Codes

HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES

Reduced Latency Majority Logic Decoding for Error Detection and Correction

Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

Hamming Codes. s 0 s 1 s 2 Error bit No error has occurred c c d3 [E1] c0. Topics in Computer Mathematics

Error Detecting and Correcting Code Using Orthogonal Latin Square Using Verilog HDL

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Single error correction, double error detection and double adjacent error correction with no mis-correction code

Available online at ScienceDirect. Procedia Technology 24 (2016 )

J. Manikandan Research scholar, St. Peter s University, Chennai, Tamilnadu, India.

Computer Organization & Assembly Language Programming

/$ IEEE

Majority Logic Decoding Of Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

Enhanced Detection of Double Adjacent Errors in Hamming Codes through Selective Bit Placement

Effective Implementation of LDPC for Memory Applications

AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER. Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P

FAULT TOLERANT SYSTEMS

Design of Low Power Digital CMOS Comparator

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

DESIGN OF FAULT SECURE ENCODER FOR MEMORY APPLICATIONS IN SOC TECHNOLOGY

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Available online at ScienceDirect. Procedia Technology 24 (2016 )

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

An Integrated ECC and BISR Scheme for Error Correction in Memory

Low Power Cache Design. Angel Chen Joe Gambino

An Energy Improvement in Cache System by Using Write Through Policy

FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

CS Computer Architecture

Available online at ScienceDirect. Procedia Technology 25 (2016 )

POWER REDUCTION IN CONTENT ADDRESSABLE MEMORY

An Efficient Carry Select Adder with Less Delay and Reduced Area Application

Efficient Implementation of Single Error Correction and Double Error Detection Code with Check Bit Precomputation

A Low-Power ECC Check Bit Generator Implementation in DRAMs

Low Complexity Architecture for Max* Operator of Log-MAP Turbo Decoder

Hardware Implementation of Single Bit Error Correction and Double Bit Error Detection through Selective Bit Placement for Memory

EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

Design and Implementation of Hamming Code using VHDL & DSCH

Fault Tolerant Parallel Filters Based on ECC Codes

VHDL CODE FOR SINGLE BIT ERROR DETECTION AND CORRECTION WITH EVEN PARITY CHECK METHOD USING XILINX 9.2i

ISSN Vol.05,Issue.09, September-2017, Pages:

Design of Delay Efficient Carry Save Adder

Technique to Mitigate Soft Errors in Caches with CAM-based Tags

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

ARITHMETIC operations based on residue number systems

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

Exploiting Unused Spare Columns to Improve Memory ECC

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

TESTING OF FAULTS IN VLSI CIRCUITS USING ONLINE BIST TECHNIQUE BASED ON WINDOW OF VECTORS

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

ECC Protection in Software

Comparative Analysis of DMC and PMC on FPGA

ASIC IMPLEMENTATION OF 16 BIT CARRY SELECT ADDER

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

DUE to the high computational complexity and real-time

An Area-Efficient BIRA With 1-D Spare Segments

ANADOLU UNIVERSITY DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING. EEM Digital Systems II

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

the main limitations of the work is that wiring increases with 1. INTRODUCTION

Near Optimal Repair Rate Built-in Redundancy Analysis with Very Small Hardware Overhead

Implementation of Decimal Matrix Code For Multiple Cell Upsets in Memory

Design of Majority Logic Decoder for Error Detection and Correction in Memories

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Compact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: , Volume 2, Issue 7, August 2014

IA Digital Electronics - Supervision I

Area Delay Power Efficient Carry-Select Adder

Self-checking combination and sequential networks design

A New CDMA Encoding/Decoding Method for on- Chip Communication Network

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver

Design and Verification of Area Efficient High-Speed Carry Select Adder

Overlapped Scheduling for Folded LDPC Decoding Based on Matrix Permutation

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

Low Power Floating-Point Multiplier Based On Vedic Mathematics

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

DESIGN OF CACHE SYSTEM USING BLOOM FILTER ARCHITECTURE

Design and Implementation of IEEE-754 Decimal Floating Point Adder, Subtractor and Multiplier

Content Addressable Memory with Efficient Power Consumption and Throughput

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

Cache Organizations for Multi-cores

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

Analysis of Soft Error Mitigation Techniques for Register Files in IBM Cu-08 90nm Technology

Design of an Efficient 128-Bit Carry Select Adder Using Bec and Variable csla Techniques

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P.

Adaptive ECC for Tailored Protection of Nanoscale Memory

Transcription:

Available online at www.sciencedirect.com ScienceDirect Procedia Technology 25 (2016 ) 544 551 Global Colloquium in Recent Advancement and Effectual Researches in Engineering, Science and Technology (RAEREST 2016) Improved Architecture for Tag Matching in Cache memory Coded with Error Correcting Codes Mary Francy Joseph a, Anith Mohan b* a PG scholar,college of engineering,munnar 685612, India b Assistant professor,college of engineering,munnar 685612,India Abstract Cache memories serve as accelerators to improve the performance of modern microprocessors. Caches are vulnerable to soft errors because of technology scaling. So it is important to provide protection mechanisms against soft errors. Tag comparison is critical in cache memories to keep data integrity and high hit ratio. Error correcting codes (ECC) are used to enhance reliability of memory structures. The previous solution for cache access is to decode each cache way to detect and correct errors. In the proposed architecture ECC delay is moved to the noncritical path of the process by directly comparing the retrieved tag with the incoming new information which is encoded as well, thus reducing circuit complexity. For the efficient computation of hamming distance, butterfly weight accumulator is proposed to reduce latency and complexity further. The proposed architecture checks whether the incoming data matches the stored data. The proposed architecture reduces the latency and hardware complexity compared with the most recent implementation. 2016 2015 The The Authors. Authors. Published Published by Elsevier by Elsevier Ltd. Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of RAEREST 2016. Peer-review under responsibility of the organizing committee of RAEREST 2016 Keywords: Error correcting codes; hamming distance ; cache memory; soft errors 1. Introduction Cache memories are utilized for the low-latency access for data and instruction memory. Caches store instructions and data that are frequently used during operations. Microprocessors will first check whether a piece of information is in a cache. 2212-0173 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of RAEREST 2016 doi:10.1016/j.protcy.2016.08.143

Mary Francy Joseph and Anith Mohan / Procedia Technology 25 ( 2016 ) 544 551 545 For that the address of the information in the cache is compared to all the cache tags that might contain that address. If the information is found there then there is a cache hit otherwise there will be cache miss. In the case of cache miss the piece of information will be found from the main memory, but it will simply take longer [1]. Caches must be designed in such a way to have as low latency as possible, or the components will be disqualified from serving as accelerators and the overall performance of the whole system will be deteriorated. The reliability of cache memory is affected by soft errors. Soft errors can corrupt the instruction and data in the cache memory. Soft errors are not reproducible and can cause malfunctions in the hardware. Errors in cache memory can easily circulate into the processor registers and other memory elements. Soft errors in digital circuits are mitigated using large number of techniques [3]. To combat against soft errors cache memories use error protection mechanisms such as parity codes and SEC-DED (single-bit error correction and double-bit error detection) codes. The caches, in addition to data and instructions store tag field to identify each cache line. Caches have to operate with low latency, the use of error correcting codes (ECC) is challenging because complex error protection schemes for tag bits can increase the latency [2]. Data comparison circuits are widely used in cache memories to perform tag matching. Data comparison is usually in the critical path of the pipeline stage that is devised to increase the performance of the system. The flow of the succeeding operations is determined by the output of the comparison. Protection using ECC increases the latency. In cache, the tag directory is protected with ECC. When an access is made to the cache, the cache tag directory is accessed first. After retrieving the tag the encoded data go through ECC decoders and ECC correction logic before it is compared with tag field of the incoming address [4]. In this case, critical path is too long in cache systems designed for high speed access. Another solution for matching problem is encode and compare method. In this method the incoming data is encoded and compared with the retrieved data that is encoded [6]. Complex decoding from the critical path is eliminated. The method does not look whether the retrieved data is exactly same as the incoming tag. Instead it looks whether the retrieved data is in the error correctable range of the codeword of the incoming data [7]. In this paper a new technique to efficiently implement tag matching in cache memory is presented. Here the comparison of the tag is done in parallel with the encoding process to generate parity. The latency and complexity for tag matching is reduced considerably. 2. Related Work Tag arrays are typically protected with error correcting codes. Soft errors in tag bits cause pseudo hits and pseudo misses. Pseudo hit is a hit that is actually a miss in the absence of soft error is called pseudo hit. Pseudo hit lead to use the incorrectly matched data and is likely to propagate into other elements of the system. Pseudo miss is a miss that is actually a hit when there is no error [5]. In this case data is fetched from main memory producing performance degradation. To protect data and tag bits error correcting codes are used. When the tag array is protected with error correcting codes the latency is increased due to ECC logic. Many techniques have been presented to perform tag matching. 2.1 Decode and compare method The tag array is protected with error correcting codes. Here the encoded data is read from the tag array first. As in Fig. 1 the encoded data go through ECC decoders and ECC correction logic before it is compared with tag of the incoming address.

546 Mary Francy Joseph and Anith Mohan / Procedia Technology 25 ( 2016 ) 544 551 Fig. 1. Decode and compare architecture The k-bit tag is stored in the form of an n-bit code word after being encoded by (n k) code. Decoding and correction before being compared with the incoming tag increases the critical path in a cache designed for high speed access. The total cache access time is the sum of the tag array access time, time to decode and correct, time needed for tag comparison and the time to select the matched way of the tag array. In the case of a cache miss, the incoming tag is encoded by ECC Gen logic and used to replace the way. 2.2. Encode and compare method To resolve the drawbacks of decode and compare method, encode and compare method is presented. In this method the ECC decoding and correction is removed from the critical path. For data comparison the absolute value of the stored information are not important, but rather the relative value to the incoming data is important for comparison result. This method has three steps.1) encode the incoming data 2) calculate the Hamming distance 3) compare the distance with the Tmax and Rmax. Let Tmax and Rmax denote the maximally correctable and detectable errors respectively. Given an input data v, and its encoded form is denoted as V. The retrieved codeword from memory is denoted as U. Thus the hamming distance is d=dist (V, U). The above mentioned step 3 has the following outcomes. d=0, d 0 and d Tmax Tmax < d Rmax d > Rmax U is valid and equals to V U equals to V with errors U has uncorrectable error U is not equal to V The encode and compare architecture is presented in Fig.2. The two tags are matched if d is in either the first or the second ranges. Maintaining the error correcting capability the decoder is removed from the critical path by introducing an encoder. Encoder is much simpler than the encoder. Thus the complexity overhead significantly reduces for this architecture.

Mary Francy Joseph and Anith Mohan / Procedia Technology 25 ( 2016 ) 544 551 547 Fig. 2. Encode and Compare architecture Fig. 3. SA based architecture for encode and compare method Since the above method needs to compute the hamming distance, [6] presented a circuit to compute the hamming distance. The circuit in Fig.3 performs XOR operations for every pair of bits in X and Y so as to generate a vector representing the bitwise distance of two code words. The subsequent half adders (HAs ) are used to count the number of ones in two adjacent bits in the vector. The number of ones is accumulated by passing through the following SA tree. In the saturated adder tree, the accumulated value z is saturated to Rmax + 1 if it exceeds Rmax. The final accumulated value indicates the range of d. The SA can be replaced with the parallel counter circuit. It is a multiple input circuit that counts the number of active input responses in the binary coded form. Using parallel counter to get accurate sum would increase the latency especially for two words that differ a lot. So saturated adder is preferred. But the compulsory saturation needs additional logic circuitry, the complexity of SA is higher than conventional adders. 3. Proposed Architecture In this section a new architecture that can reduce the latency and complexity for tag matching by using the concept of systematic codes is presented. In the encode compare method using SA, the comparison of two code words are done after the incoming tag is encoded. The data part of a systematic code word is immediately available for matching operation but the parity part is available after the encoding operation. By using this concept encoding

548 Mary Francy Joseph and Anith Mohan / Procedia Technology 25 ( 2016 ) 544 551 Fig. 4. Proposed architecture for tag matching process to generate the parity parts and tag comparison can be done in parallel. The architecture for computing hamming distance contains multiple butterfly formed weight accumulators. The proposed architecture is shown in the Fig.4. BWA count the number of ones among input bits.bwa consist of multiple stags of half adders. Each output bit of a HA is associated with a weight. The half adders in a stage are connected in a butterfly form so as to accumulate carry bits and sum bits of the upper stage separately. If an output bit of a half adder is set, the number of ones among the input bits in the paths reaching the HA is equal to the weight of the output bit. In Fig.5 (a) for example if the carry bit of the shaded half adder is set, the number of ones among the input bits i.e., A,B,C and D is 2. At last stage of Fig.6 (a) the number ones among the input bits is calculated as, We need the range of hamming distance, not the precise value so again it is possible to simplify the circuit. For example when Rmax = 1 two or more than two ones among the input bits can be regarded as the same case that is in the fourth range. We can replace several half adders with simple OR gate tree, as in Fig.5 (b). This is an advantage over SA. The overall architecture is explained in detail here. The bitwise difference vector for either data or parity bits is generated by each XOR stage. Hamming distance is computed using the BWA. Each BWA is in the revised form as in Fig.5 (b) and generates an output from the OR gate tree and several weight bits from the HA trees. In the interconnection such outputs are fed into the second level. The output of the OR gate tree is connected to the OR gate tree at the second level. The remaining weight bits are connected to the second level BWA s according to the weights. Each BWA at the next level is associated with a weight of a power of two that is less than or equal to Pmax, Where Pmax is the largest power of two that is not greater than Pmax + 1. As the weight bits associated with the fourth range are all ORed in the revised butterfly weight accumulators, the powers of two that are larger than Pmax will be avoided. For example (8, 4) SEC-DED is considered. The circuit is shown in Fig.6. Here Rmax = 2, Pmax = 2 so there are only two butterfly weight accumulators associated with weights two and one at the second level. Bits of weight 4 fall in the fourth range so they all are ORed. The remaining bits associated with eights 2 or 1 are connected to

Mary Francy Joseph and Anith Mohan / Procedia Technology 25 ( 2016 ) 544 551 549 a b Fig. 5. a) General structure b) new architecture for matching of ECC- protected data. Fig. 6. First and second level circuits for (8, 4) code corresponding BWAs. No hardware complexity is introduced here. Decision unit performs the tag matching by considering the four ranges of hamming distance. The functionality is specified by the truth table by taking the outputs of the previous circuits as inputs. U and V cannot set simultaneously so such terms are included as don t care terms in Table 1. One of the BWA at the first level finishes earlier than that of the other, some components ate second level may start earlier. Similarly some BWA s or the OR gate tree at the next level may provide their output earlier to the decision unit so that unit can begin its operation without waiting for all its inputs. Here the critical path becomes shorter. The proposed BWA can be developed using modified XOR gate. The conventional XOR gate consists of 5 gates. The modified XOR gate has one gate less than that of conventional XOR gate. The modified half adder will have two gates less than that of conventional half adder. The modified XOR gate and half adder are shown in Fig.7.The area of the proposed BWA can be decreased by reducing the number of gates.

550 Mary Francy Joseph and Anith Mohan / Procedia Technology 25 ( 2016 ) 544 551 4. Results Fig. 7. The modified XOR gate and half adder. The proposed architecture is coded in verilog and simulated using Xilinx ISE design suite 14.7. The results obtained are compared with the previous architectures. From the results it is evident that proposed architecture is effective in reducing the latency as well as complexity. The critical path of HA consist of only one gate while that of SA consist of several gates so the proposed architecture achieves lower latency than SA based architecture. The simulation results of the encode and compare architecture with BWA and modified BWA are shown in Fig.8 and 9. Fig. 8. Simulation Result of the Encode and Compare Architecture with BWA

Mary Francy Joseph and Anith Mohan / Procedia Technology 25 ( 2016 ) 544 551 551 Fig. 9. Simulation Result of the Encode and Compare Architecture with Modified BWA Table 1. Performance analysis Architecture No. of slices used Delay (ns) Power(mW) Decode and compare 950 50.72 1941 Encode and compare (SA) 790 43.82 1752 Encode and compare (BWA) Encode and compare (Modified BWA) 620 591 43.22 36.42 1230 970 5. Conclusion and Future work A new architecture for matching the tags protected with ECC is presented. The proposed architecture examines whether the incoming tag matches the stored tag if certain number erroneous bits are corrected. The results shows that proposed architecture can significantly reduce area and power overheads. More importantly, the delay which is a key factor in caches is also reduced. In the proposed architecture only single error is corrected and double error is detected. In future error detection and correction range can be increased and compared its performance with the proposed architecture. References [1] J. D. Warnock, Y.-H. Chan, S. M. Carey, H. Wen, P. J. Meaney, G. Gerwig and W. V. Huott Circuit and physical design implementation of the microprocessor chip for the Enterprise system, IEEEJ. Solid-State Circuits, vol. 47, no. 1, pp. 151 163, Jan. 2012. [2] E. E. Swartzlander, Jr, A review of large parallel counter designs, in Proc. IEEE Comput. Society Ann. Symp. VLSI, 2004, pp. 89 98. [3] S. Kim and A. Somani, Area efficient architectures for information integrity in cache memories, in Proc. Int. Symp. Comput. Archit., 1999, pp. 246 255. [4] S. Mukherjee, J. Emer, T. Fossum, and S. Reinhardt, Cache scrubbing in microprocessors: Myth or necessity, in Proc. Int. Symp. Dependable Comput., 2004, pp. 37 42. [5] W. Wu, D. Somasekhar, and S.-L. Lu, Direct compare of information coded with error-correcting codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 11, pp. 2147 2151, Nov. 2012. [6] Y. Lee, H. Yoo, and I.-C. Park, 6.4Gb/s multithreaded BCH encoder and decoder for multi-channel SSD controllers, in ISSCC Dig. Tech. Papers, 2012, pp. 426 427. [7] M. Tremblay and S. Chaudhry, A third generation 65nm 16-core 32-thread plus 32-scout thread CMT SPARC processor, in ISSCC. Dig. Tech. Papers, Feb. 2008, pp. 82 83.