CHIP multiprocessors (CMPs) integrate multiple CPUs (or

Similar documents
FAULT- AND YIELD-AWARE ON-CHIP MEMORY DESIGN AND MANAGEMENT. by Hyunjin Lee B.S. in Electrical Engineering, Seoul National University, Korea, 1999

A Performance Degradation Tolerable Cache Design by Exploiting Memory Hierarchies

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

WWW. FUSIONIO. COM. Fusion-io s Solid State Storage A New Standard for Enterprise-Class Reliability Fusion-io, All Rights Reserved.

TOLERANCE to runtime failures in large on-chip caches has

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold

hot plug RAID memory technology for fault tolerance and scalability

Power of One Bit: Increasing Error Correction Capability with Data Inversion

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Outline. Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication. Outline

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

Unleashing the Power of Embedded DRAM

An Integrated ECC and BISR Scheme for Error Correction in Memory

Scalable Cache Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Multiprocessors & Thread Level Parallelism

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

nvsram Basics Introduction 1. What is the difference between a standard fast SRAM and the nvsram? Nonvolatile RECALL Operations SRAM Operations

Improving Memory Repair by Selective Row Partitioning

Correction Prediction: Reducing Error Correction Latency for On-Chip Memories

A Low-Power ECC Check Bit Generator Implementation in DRAMs

Low Power Cache Design. Angel Chen Joe Gambino

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Multiprocessor Cache Coherency. What is Cache Coherence?

Memory Systems IRAM. Principle of IRAM

Background. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory.

Phase Change Memory An Architecture and Systems Perspective

Structure of Computer Systems

Definition of RAID Levels

6.9. Communicating to the Outside World: Cluster Networking

Minimizing Power Dissipation during. University of Southern California Los Angeles CA August 28 th, 2007

Efficient Implementation of Single Error Correction and Double Error Detection Code with Check Bit Precomputation

Organic Computing DISCLAIMER

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

An Area-Efficient BIRA With 1-D Spare Segments

A Chip-Multiprocessor Architecture with Speculative Multithreading

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

Breaking the Energy Barrier in Fault-Tolerant Caches for Multicore Systems

Assignment 5. Georgia Koloniari

Author's personal copy

Chap. 4 Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Streaming Computation on Error-Prone Processors. Yavuz Yetim, Margaret Martonosi, Sharad Malik

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

William Stallings Computer Organization and Architecture 8th Edition. Chapter 5 Internal Memory

Advanced Computer Architecture (CS620)

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

Lecture 7: Implementing Cache Coherence. Topics: implementation details

(Refer Slide Time: 1:43)

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Module 6 STILL IMAGE COMPRESSION STANDARDS

WITH continuous technology scaling down, microprocessors

Spare Block Cache Architecture to Enable Low-Voltage Operation

The 80C186XL 80C188XL Integrated Refresh Control Unit

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

AS the processor-memory speed gap continues to widen,

Performance of Graceful Degradation for Cache Faults

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

SF-LRU Cache Replacement Algorithm

Answers to comments from Reviewer 1

Survey on Stability of Low Power SRAM Bit Cells

Reducing Cache Energy in Embedded Processors Using Early Tag Access and Tag Overflow Buffer

Parallel Programming Interfaces

Designing a Fast and Adaptive Error Correction Scheme for Increasing the Lifetime of Phase Change Memories

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

Error Detection and Correction by using Bloom Filters R. Prem Kumar, Smt. V. Annapurna

Computer Systems Architecture

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

Dependability and ECC

RACS: Extended Version in Java Gary Zibrat gdz4

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

Storage. Hwansoo Han

Chapter 2: Memory Hierarchy Design Part 2

ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems

Computer Architecture

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Computer Architecture Memory hierarchies and caches

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

Built-in Self-repair Mechanism for Embedded Memories using Totally Self-checking Logic

The Design and Implementation of a Low-Latency On-Chip Network

80 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 1, JANUARY Flash-Aware RAID Techniques for Dependable and High-Performance Flash Memory SSD

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

Shared Symmetric Memory Systems

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

A Low-Cost Correction Algorithm for Transient Data Errors

k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Transcription:

638 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 5, MAY 2010 PERFECTORY: A Fault-Tolerant Directory Memory Architecture Hyunjin Lee, Student Member, IEEE, Sangyeun Cho, Member, IEEE, and Bruce R. Childers, Member, IEEE Abstract The number of CPUs in chip multiprocessors is growing at the Moore s Law rate, due to continued technology advances. However, new technologies pose serious reliability challenges, such as more frequent occurrences of degraded or even nonoperational devices, and they threaten the cost-effectiveness and dependability of future computing systems. This work studies how to protect the on-chip coherence directory from fault occurrences. In a chip multiprocessor, cache coherence mechanisms such as directory memory are critical for offering consistent data view to all CPUs. We propose a novel online fault detection and correction scheme to enhance yield and resilience to runtime errors at a small performance cost. The proposed scheme uses smart encoding and coherence protocol adaptation strategies to salvage faulty directory entries. We also develop an online error recovery scheme that protects the directory memory from soft errors. We call our fault-tolerant directory memory architecture PERFECTORY. Evaluation results show that PERFECTORY achieves very high fault resilience: Over 99 percent chip yield at 0.05 percent hard error ratio and 1,934 years MTTF at 1,000 FIT using a 100-processor cluster configuration. PERFECTORY limits performance degradation to less than 1 percent at 0.05 percent hard error ratio and requires significantly smaller area overheads than existing redundancy approaches. Index Terms Chip multiprocessor, cache coherence, chip yield, lifetime reliability. Ç 1 INTRODUCTION CHIP multiprocessors (CMPs) integrate multiple CPUs (or cores ) on a single chip. Today, two to eight-core CMPs are common in general-purpose systems. The number of cores in a chip is expected to grow significantly in the coming years [4], [8], [25], [41]. Another major processor chip design trend is to include large on-chip memory. In certain cases, over 50 percent of the chip s area is dedicated to cache memory [26]. As future large-scale CMPs evolve, they will necessarily have to incorporate unique features found in existing multiprocessor systems such as a directory-based cache coherence mechanism [17], [19]. This paper proposes PERfectly Fault-tolerant direc- TORY memory (PERFECTORY), a new on-chip directory architecture, to deal with the growing probability of fault occurrences in CMPs. There are a number of reasons why fault tolerance in directory memory is of growing importance. First, future large-scale CMPs are expected to use a directory cache coherence protocol [9], [12], [53]. So far, most commercial CMPs have less than 8 cores and use snooping-based cache coherence protocols [15], [46] due to their simple implementation and good performance for small-scale multiprocessing. However, as in multiprocessors with a large number of processing nodes [17], [28], a directory cache coherence protocol will be necessary to avoid on-chip network contention for larger scale CMPs. Second, the current processor chip design trend more cores and larger cache memory implies an increase in the. The authors are with the Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260. Manuscript received 1 Nov. 2008; revised 11 Mar. 2009; accepted 6 May 2009; published online 10 Sept. 2009. Recommended for acceptance by C. Bolchini and D. Sciuto. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-2008-11-0547. Digital Object Identifier no. 10.1109/TC.2009.138. on-chip directory memory size. Fig. 1 shows the ratio of the directory area to the shared L2 cache area with a varying number of cores. While a 64-core CMP s full-map directory area is modest (13 percent of the L2 cache s data area), its area overhead grows quickly to nearly 50 percent if we scale the core count to 256. There are more area-efficient directory schemes than the full-map directory [2], [18]. However, as shown in Fig. 1, even the most area-efficient schemes still use a significant chip area at a higher core count. Finally, future process technologies and chip operating conditions can make the on-chip directory memory (and memory components in general) more susceptible to faults. Continuous device scaling results in more frequent hard errors at manufacture time and more severe variations in device characteristics [7], [40]. Moreover, thermal variations, soft errors, and aging effects add to the probability of a memory failure after a CMP chip has been deployed [7], [45]. For example, Borkar foresees the possibility of an extremely unreliable chip design environment, where 20 percent of all on-chip devices are unreliable [7]. According to this projection, the hard error ratio (HER) of memory cells could mount to between 0.04 percent and 0.1 percent. An HSPICE simulation study with industry standard parameters also shows a 0.1 percent SRAM error rate [3]. Furthermore, voltage scaling techniques, which have been proved valuable for power management, can cause a large fraction of memory cells to become unreliable. Wilkerson et al. [51] show that the bit failure rate can be higher than 0.1 percent with aggressive voltage scaling. Unlike existing degradable memory design strategies that discard faulty memory entries or even kill the whole chip [22], [29], [41], PERFECTORY salvages faulty directory entries using smart encoding and cache coherence protocol adaptation strategies. In a 64-core CMP, it can tolerate up to 0018-9340/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society

LEE ET AL.: PERFECTORY: A FAULT-TOLERANT DIRECTORY MEMORY ARCHITECTURE 639 Fig. 1. Ratio of directory memory area to L2 cache data memory area. Three directory memory schemes are shown: Full map, Dir8CV2, and sparse Dir8CV2 [18]. Dir8CV2 and sparse Dir8CV2 are among the most area-efficient directory memory designs (at performance cost). nine hard errors in a single directory entry. The number of tolerable errors increases as the core count grows. PERFEC- TORY trades program performance slightly to enforce cache coherence when errors occur in the coherence directory. It performs online testing of directory entries efficiently as they are updated or referenced. Such capabilities enable PER- FECTORY to dynamically detect faults and correctly operate in the presence of multibit errors, leading to substantially improved chip yield. For example, at 0.2 percent HER, yield increases to 100 percent with PERFECTORY. In comparison, traditional yield-enhancing schemes, such as redundancy [5], [14], [32], fail to produce meaningful yield under such a severe error condition. Due to its robust online error detection and correction capability, PERFECTORY also significantly improves the directory memory s lifetime reliability against runtime errors caused by variations, aging, and single event upsets. PERFECTORY s mean time to failure (MTTF) is calculated to be 1,934 years at 0.05 percent HER and 1,000 FIT (fault in time), when using a 100-processor configuration. Compared with existing yield improvement strategies, such as row-column redundancy and error correction code (ECC), PERFECTORY has a significantly smaller area cost less than 2.5 percent of the directory memory area for 1,024 cores. Its performance slowdown is also minor well less than 0.1 percent. A state-of-the-art disabling scheme has a four percent slowdown at the same HER of 0.05 percent. Much existing research focuses on protecting the cache memory [3], [24], [27], [29], [30], [38], [43], but no previous work specifically studied the fault tolerance of the CMP coherence directory memory. Our proposed scheme is drastically different from other existing work that specifically focuses on the cache memory. We make the following contributions in this paper:. We develop a novel online error detection scheme for hard and intermittent faults in directory memory with very small area overhead. This scheme detects faults by checking a directory entry s integrity on each directory entry read and write event.. We propose a novel online error correction scheme. We utilize redundant directory entry bits to encode error correction information. Moreover, when data sharing information is found damaged, we use a speculative correction strategy to guarantee correct program execution with negligible performance overhead.. We develop a robust soft error detection and correction technique that uses a single parity bit per directory entry. The technique detects soft errors in a directory entry having multiple hard errors.. We describe the cache coherence controller design and implementation issues when the proposed scheme is used.. We evaluate the proposed fault-tolerant directory memory architecture in terms of yield, resilience to runtime errors, chip area cost, and performance impact, using Monte Carlo simulation methods and a detailed CMP architecture simulator. The rest of this paper is organized as follows: As background, Section 2 presents an overview of cache coherence protocols, gives a brief survey of applicable conventional fault masking schemes, and discusses why these schemes are not cost-effective for on-chip directory memory. In Section 3, we describe PERFECTORY s algorithms and mechanisms in detail. Section 4 evaluates PERFECTORY and compares it with other existing schemes and Section 5 concludes the paper. 2 BACKGROUND 2.1 Directory-Based Cache Coherence in CMPs Many multiprocessor systems implement shared main memory and private cache memory in each processing node to ease programming and enable fast access to memory [17], [28]. They use a coherence protocol to guarantee that processors access up-to-date memory data in the presence of data sharing via private caches. In general, there are two types of coherence protocols, broadcast-based and directorybased. Broadcast-based protocols resort to broadcasting messages for coherence actions using a snooping bus while directory-based schemes send one-to-one messages using a coherence directory. Broadcast-based snooping protocols are not scalable to large multiprocessors because coherence messages use up much bus bandwidth when there are many processors sharing data [2]. Directory-based protocols are more scalable due to the accurate data sharing information kept in the coherence directory. For instance, when a CPU issues a read request to the shared memory, the coherence directory is consulted first (using a one-to-one message) to identify a CPU that has up-to-date data. On a write request, the directory is consulted and the directory controller, if needed, will forward invalidation messages to specific CPUs that actually have the corresponding cache block. As the number of cores in CMPs increases, scalable directory-based coherence protocols will be used in future CMPs [9], [12], [53]. Fig. 2a shows a CMP architecture that uses a directory-based cache coherence protocol. Each core in the figure has a portion of the directory memory and the shared L2 cache. The coherence directory memory is consisted of a series of entries that record coherence-related information for

640 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 5, MAY 2010 Fig. 2. (a) A 16-core CMP architecture with shared L2 cache. Both coherence directory and the shared L2 cache are distributed. Each directory entry, associated with an L2 cache block, has a sharer field and a state field. The directory entry example in the figure shows that the corresponding cache block is shared by two cores, cores 2 and 7. (b) and (c) SGI Origin2000 [28] protocol examples: (b) L2 cache block states and their description, (c) L1 cache block states and their description. associated cache blocks. Each entry has two fields, sharer and state. The sharer field shows which cores have copies of a shared data block. The sharer field is typically a bit vector (one bit per core) whose width is the number of cores in the system. Hence, given N cores, the sharer field would hold N bits. The state field keeps the coherence state of the block. While different cache protocols may use a different set of states, there are five states that appear commonly in the literature [2], [23], [48]: I (Invalid), S (Shared), E (Exclusive), M (Modified), and O (Owned). The cache block state information is kept in the directory memory and each core s private cache. As an example, Figs. 2b and 2c list the coherence states an L2 and L1 cache block can have, respectively, in the SGI Origin2000 multiprocessor system [28]. Because the size of the state field does not increase with the number of nodes, we note that the sharer field will dominate the directory memory area in future large-scale CMPs (i.e., N>16). In the directory entry example of Fig. 2a, cores 2 and 7 have a private copy of a shared cache block in the L2 cache of core 15. The corresponding directory entry for the block in core 15 has the Shared state and its sharer shows that cores 2 and 7 have cached the block in their L1 cache. The coherence state of the same cache block in cores 2 and 7 is Shared, (i.e., read only). When cores 2 and 7 read the block, they do not need to take coherence actions. However, when either core intends to update the block, it must acquire the exclusive privilege for the block first by coordinating with the cache coherence controller. The controller then sends appropriate coherence messages to cores that have the cache block to avoid potential stale data accesses, according to the coherence protocol. Memory accesses from cores may or may not entail coherence actions. For instance, when a core reads from a valid cache block that is already in its private cache (cache read hit), no coherence actions are needed. However, when a core intends to update a shared cache block or access a block that is not currently in its private cache, coherence actions are incurred. In the former case, the required coherence actions include invalidating all current sharers of the cache block (using the information in the corresponding directory entry) and updating the directory entry with the new cache block state and owner. In the latter case, the missing cache block will be brought to the core s private cache (the directory is checked for another processor to provide the data) and the corresponding directory entry will be updated. Depending on a block s sharing status, the sharer field and the state field for that block may be updated together or separately. When a cache block is newly brought to the onchip memory hierarchy, a corresponding directory entry will be allocated and its content, both the sharer field and the state field, updated. When a new core reads from an existing shared cache block, the block s sharer field will be updated, but the state field remains unchanged. Yet in another example, when a core writes to a shared cache block, both the sharer field and state field of the directory entry need be updated. 2.2 Conventional Approaches to Masking Faults As we showed previously, maintaining the data sharing information in the directory memory accurately is necessary for correct program operation. In this section, we describe three conventional fault covering schemes for memory structures, which could be used for the directory memory as well: circuit-level redundancy, error detection and correction codes, and graceful degradation. Circuit-level redundancy. Redundant rows, columns, and subarrays have been used to tolerate hard errors at the

LEE ET AL.: PERFECTORY: A FAULT-TOLERANT DIRECTORY MEMORY ARCHITECTURE 641 time of manufacture. When faulty bits are detected, damaged blocks with faulty bits are replaced by sound redundant blocks, i.e., rows, columns, or subarrays [5], [14], [32]. Moreover, with help from the built-in self-test (BIST) circuitry, mapping of circuit-level redundancy can be done at power-on time, rather than manufacture time [22]. When there are more faulty blocks that can be covered by available redundancy, however, two choices remain: Discarding the chip, which causes yield loss, or salvaging remaining memory capacity by removing faulty memory elements from usage. The latter approach, graceful degradation, is discussed below. Circuit-level redundancy is considered inefficient. It is not scalable because of the large area overhead reserved during design time. Recovering from a single-bit fault requires replacing a whole row or column. Because a chip built with future technology (e.g., 32 nm) is expected to have many more faults [7], guaranteeing a target yield using circuit-level redundancy will inevitably lead to a large chip area overhead. Error detection and correction codes (EDC/ECC). Single error detection (SED) codes, such as parity and single error correction double error detection codes (SECDED), are widely used in memory designs [1], [25], [39]. Although multibit error correction codes are available (e.g., double error correction), they are often not suitable for use in timecritical memory due to their computational complexity and high power consumption. SECDED requires m þ 2 code bits for 2 m data bits. A popular configuration of SECDED is 8-bit code for 64-bit data word (i.e., m ¼ 6). ECC is mostly employed for soft error protection in recent processors [15], [22], [39]. However, ECC can cover up to the number of errors, determined by its correction limit (e.g., 1 bit for SECDED, 2 bits for DECTED) regardless of the error sources. With ECC s versatility, there have been attempts to use ECC to cover hard faults [3], [44]. The main trade-off in this case is enhanced yield (hard fault is covered) versus sacrificed immunity against other types of faults such as wear-out fault and soft error. If the single error correction capability of SECDED is used to cover a hard fault in a cache block, for example, the corresponding block becomes vulnerable to a soft error. A two-dimensional error coding was proposed to guarantee soft error immunity when a cache block is damaged by a hard fault [24]. Although such a smart hardware scheme increases yield and lifetime reliability considerably, multibit correction in this scheme usually requires a huge cycle penalty, e.g., complete test of memory using BIST. Multibit error correction schemes with small cycle penalty have been proposed [6], [34]. However, these schemes large area cost makes them inadequate for the directory memory. Graceful degradation schemes. Due to the deficiencies of other approaches, researchers have considered graceful degradation strategies as a viable approach to covering faults in various memory structures [3], [10], [22], [27], [29], [30], [38], [43], [51]. The basic idea of graceful degradation is, when applied to cache memory, to delete a faulty cache portion so that the processor will not use the cache blocks that fall into the faulty portion. Because the loss of a few cache blocks will degrade processor performance to only a limited extent, a graceful degradation approach is more flexible and cost-effective than the redundancy approaches [29], [30], [38], [43]. Recently, Intel introduced a graceful degradation mechanism for low-level cache memory, dubbed Cache Safe Technology [10]. It disables a cache block when it detects that an intermittent fault in the cache block transitions to a hard fault, based on observed ECC actions. The maximum number of disabled blocks is limited to 32 to minimize the added hardware resources and performance penalty. A similar technique has been adopted in the POWER5 [41] and POWER6 processors [22]. There is programmable steering logic that is initiated by a BIST circuitry and is activated during processor initialization to replace faulty bits. When the L3 cache memory detects faults that exceed a predetermined threshold, the processor invokes a dynamic L3 cache line delete action. 2.3 Fault-Tolerant Multiprocessor Systems There have been efforts to tolerate system-level faults in large-scale parallel systems [16], [33], [35], [50]. These works focused on the communication node failures and damaged packets so that the whole system can be salvaged with intermittent or hard faults. However, they only considered failures on nodes (i.e., processing units in a large system), not fault occurrences in a processor s memory elements. The focus of our work is to protect the on-chip directory memory from fault occurrences. 3 PERFECTORY In this section, we describe the online fault detection and correction algorithms of PERFECTORY. We will also discuss its hardware controller design issues. 3.1 Hardware Model We assume a tile-based CMP architecture that employs an MESI-based coherence protocol as our baseline design. Our coherence protocol is similar to the one used in the SGI Origin2000 [28], whose directory structure and states are depicted in Fig. 2. The SGI Origin2000 was a groundbreaking multiprocessor system that used a directory-based cache coherence protocol. The details of the protocol have been well documented in [36]. We assume that the state field for the L2 cache s directory has three bits, two for three basic states (i.e., E, S, I states) and one for corresponding three busy states. The sharer field uses the full-map encoding method and records which cores share a block using an N-bit vector for an N-core CMP configuration. In this paper, our main focus is to protect the sharer field in the coherence directory because it is much larger than the state field in a large-scale CMP. The sharer field size also grows with the increasing core count and cache capacity. We assume that the state field is protected with other design methods such as redundancy, ECC, or special and hardened memory cells [51]. Thus, we do not consider fault occurrences in the state field. We use a bit-level fault model, where bits in the directory memory are independently struck by a hard error, a transient error, or a soft error. Hard errors are caused by defects and severe process variations. Random process variations are difficult to control in advanced process

642 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 5, MAY 2010 Fig. 3. (a) Original sharer field encoding example (16 cores). (b) The proposed ECC-pointer pair encoding for the same example. (c) The number of tolerable errors in one directory entry using the proposed ECC-pointer pair encoding method grows with the number of cores. technologies, especially in sub-45 nm regimes, and can become a dominant source for many memory cell failures. Transient errors occur to weak memory cells when the chip s operating conditions change, such as voltage, frequency, temperature, and lifetime. We classify a hard or transient error as either a stuck-at-1 or stuck-at-0 error. Once a stuck-at error occurs to a memory cell, its content is fixed to 1 or 0 thereafter. The severity of error occurrences is measured with HER, which is the ratio of the failed memory cell count to the total memory cell count. Equivalently, HER is the probability of having an error in a memory cell. A soft error may flip a memory cell s content from 0 to 1 or 1 to 0, but does not affect the functionality of the cell. We use failures-in-time (FIT, 10 6 hours) as the rate of soft error occurrences to a given memory block. 3.2 Hard Error Detection and Correction Strategies We make two key observations that are useful for tolerating faults in the coherence directory. First, when a block is in the exclusive state, the directory entry s sharer field normally has only a single-bit set (to value 1 ) and all other bits reset (to value 0 ). Effectively, the encoding in the sharer field in this case is the one-hot coding, which is very sparse. Hence, we can extract exploitable redundancy in the sharer field if we employ a denser coding method than one-hot. Second, when a block is in the shared state, updating the sharer field involves one of the two possible atomic actions: 1) One bit is set (from 0 to 1) when adding a new sharer or 2) All bits are reset to 0 when invalidating the corresponding cache block. A latent fault in the sharer field (e.g., failure to set a bit in action 1 because the bit is stuck-at-0) may turn into a program error if the directory controller is unaware of the fault and fails to send an invalidation message to the corresponding core in action 2 later. Fortunately, there is an opportunity to avoid the error situation, if the directory controller is capable of detecting a hardware fault before taking action 2, by speculatively invalidating potential sharers. Such a conservative, speculative invalidation strategy may degrade program performance if speculation is wrong (i.e., there was no error caused by the hardware fault). 3.2.1 Protecting Exclusive Directory Entries Based on the first observation above, let s consider how directory entries in the exclusive state can be protected. The key idea is that we binary encode the core number (i.e., pointer) instead of using the one-hot coding. We squeeze bits that can be used to embed ECC. Figs. 3a and 3b depict how the sharer field is encoded using the conventional onehot coding method and our ECC-pointer method, respectively. With the embedded ECC, when there is a bit error in the sharer field, the correct core number can be restored. In the figure, we showed a 16-core CMP example. Hence, a pointer requires four bits and its ECC is four bits. In general, pointer encoding requires dlog 2 Ne bits given N cores. There are ðn dlog 2 NeÞ remaining bits in the sharer field. Provided that we use SECDED to protect an encoded pointer, we need ðdlog 2 dlog 2 Nee þ 2Þ bits. Therefore, an ECC-pointer pair requires ðdlog 2 Neþdlog 2 dlog 2 Nee þ 2Þ bits. To further strengthen the error correction power, multiple ECC-pointer pairs can be embedded using our method, as in Fig. 3b. We have space to record two ECC-pointer pairs in this example. The proposed ECC-pointer encoding is capable of correcting three errors and detecting four errors when the errors are evenly distributed to the upper and lower halves of the directory entry. When more than three errors concentrate in one half, either upper or lower, straightforward error correction is not possible because the SECDED code cannot differentiate the erroneous half from the error-free half. In this case, one may inspect each bit (e.g., to see if it s stuck-at-1 or stuck-at-0) to perform elaborate errata-based error correction. We do not pursue such a strategy in our current work because the probability of this case is extremely low. One may also choose to disable heavily faulty directory entries (and cache blocks) with a degradable cache mechanism [29], [38]. The proposed encoding strategy has a desirable property of correcting more bit errors as the number of cores increases. Fig. 3c depicts how the number of tolerable errors grows with the number of cores. For example, when there are 256 cores, 19 ECC-pointer pairs can be embedded, which provides more than 37-bit error recovery within a directory entry. Note also that the proposed strategy works equally well to combat hard errors, intermittent errors, and soft errors. Because the proposed encoding strategy is based on ECC, we do not need separate procedures to detect and correct errors for directory entries in the exclusive state. For errors that occur to directory entries in the shared state, however, we need separate error detection and correction steps. 3.2.2 Protecting Shared Directory Entries Our second observation forms the basis for protecting directory entries in the shared state: An error may propagate when the directory controller is unaware of the error and fails to deliver an invalidation message. Hence,

LEE ET AL.: PERFECTORY: A FAULT-TOLERANT DIRECTORY MEMORY ARCHITECTURE 643 Fig. 4. Read-time detection algorithm. (a) Hardware architecture. (b) Example of a benign stuck-at-1 error. (c) Example of a stuck-at-1 error causing a speculative invalidation message and potential performance degradation. (d) Example of a stuck-at-0 error that lends itself to correction. (e) Example of a benign stuck-at-0 error causing a speculative invalidation. one would be able to guarantee correct program operation if the directory controller is capable of detecting errors and sends out invalidation messages speculatively and conservatively on detecting errors. This implies that the error detection and correction for shared directory entries need to be done at invalidation time. Nevertheless, to offer soft error immunity, PERFECTORY detects faults and updates directory entries whenever there is a read or update event to a shared directory entry. Soft error immunity is discussed separately in Section 3.3. We propose a proactive online error detection algorithm for sharer bits using a simple bit inspection method. To detect hardware errors in a directory entry, PERFECTORY performs on-the-fly test of the entry by applying bit patterns to the entry and reading from it. For example, a stuck-at-1 error is detected when 0 was written and 1 is read. Similarly, a stuck-at-0 error is detected when 1 was written and 0 is read. Our detection algorithm can detect any number of hard and intermittent errors and is exercised whenever a directory entry is read ( read-time detection ). Fig. 4 shows how the read-time detection algorithm works. Our online read-time detection algorithm works in four steps. First, the directory controller reads the sharer field of a directory entry whenever the entry is referenced, e.g., to invalidate a cache block. The read value is saved in a temporary register Register 1 in Fig. 4a. Second, the value in Register 1 is negated (i.e., bits are flipped) and is written back to the sharer field. Note that the original value in Register 1 is intact. Third, the directory controller reads out the same entry once again and stores it to Register 2. Finally, the controller finds hardware errors by comparing the contents of Register 1 and Register 2 using a bit by bit comparator ( XNOR logic). A specific bit position with 1 from the comparator has a hardware error. The type of error (stuck-at-1 or stuck-at-0) can be determined by inspecting the same position bit in Register 1. Figs. 4b, 4c, 4d, and 4e illustrate four examples of error scenarios. We explain how PERFECTORY handles an error based on the examples. The first case we consider is a stuckat-1 error. In fact, a stuck-at-1 error is benign and does not lead to a coherence problem in our framework there will be potentially more invalidation messages than needed at invalidation events, but there is no harm to the correctness of the program execution. Fig. 4b shows an example where a bit hit by a stuck-at-1 error is set (to have the value 1). In this case, PERFECTORY does not incur any further action at a later invalidation time than an error-free entry. Fig. 4c gives another benign error example (failure to store 0 due to the hardware error), where PERFECTORY will conservatively generate one additional invalidation message to core 2. While a processor can continue operating correctly in the presence of a benign error, PERFECTORY may log this error to the operating system for other management purposes. Now, let s consider a stuck-at-0 error. A stuck-at-0 error should be recovered because it can lead to inadvertently dropping necessary invalidation messages, and hence, can cause incorrect program execution. Fig. 4d shows a stuckat-0 error that wouldn t allow writing 1 to the bit position. In this case, PERFECTORY detects the error and correctly generates two necessary invalidation messages (to cores 0 and 1). Fig. 4e depicts the second case of a stuck-at-0 error. This time, the directory controller generates three invalidation messages to guarantee the program s correct execution, to cores 0, 1, and 2 (speculated). PERFECTORY detects that bit number 2 is faulty. However, because it is unable to decide whether the bit was previously set to 1 or 0, it speculatively generates an invalidation message for core 2. The above examples show that PERFECTORY is capable of catching all hard errors that occur in a directory entry and their propagation is effectively prevented by conservatively generating invalidation messages. In a CMP that uses PERFECTORY, cores that receive an unnecessary invalidation message (generated speculatively on a detected error) are simply dropped (or acknowledged dependency on the coherence protocol) when the specified cache block is not

644 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 5, MAY 2010 Fig. 5. Soft error protection examples using the update-time detection. (a) Undetectable intermittent fault; parity is preserved. (b) 1-bit hard/ intermittent fault detected at write time; parity is updated. (c) Benign multibit faults including parity. (d) Parity fault is detected, parity update fails, and error compensation is done by updating the data value. found locally. PERFECTORY generates at most k unnecessary invalidation messages when the directory entry has k errors. Surely, not all of these invalidation messages incur performance penalty as we see in Fig. 4d. Note also that PERFECTORY does not use any additional storage space other than the two registers (Register 1 and Register 2 in Fig. 4a) in the directory controller. 3.3 Protecting Directory Entries from Soft Errors The proposed online error detection algorithm of PERFEC- TORY works well against hard errors and intermittent errors. However, it is not capable of detecting soft errors because they come from one-time events such as a hit by alpha particles. To offer immunity against soft errors, we propose to introduce a single parity bit to the sharer field and to detect errors at update times ( update time detection ). Imagine that a soft error occurred to a directory entry. At a later time, when the same directory entry is read, its parity bit will detect that a soft error has occurred since the last read of the same entry. The situation becomes complex if a hard error (due to aging, for example) had occurred to the same directory entry. This unexpected hard error may interfere with the parity bit semantics. That is, a hard error that occurs after the last read time may look like a soft error according to the parity bit. To overcome this complication, we propose a novel idea of updating the parity bit whenever we detect a (new) hard error by further utilizing the sharer bits for the purpose of soft error immunity. We calculate and update the parity bit based on the adjusted sharer field that includes a hard error. If the parity bit itself is damaged from a new hardware error, the sharer field is modified to still guarantee soft error immunity. The goal of this coordinated sharer field and parity bit update algorithm is to tackle multiple hard errors and one soft error in a single directory entry. Fig. 5 illustrates how our hard error parity update is done, assuming an odd parity scheme. Fig. 5a first describes the situation when a new hard error is detected. This case, however, does not require correcting the parity bit because the error has not damaged the original data. Hence, in this case, a soft error will be caught by the parity bit normally. Fig. 5b presents a case where the new hard error flips a bit in the original data. Therefore, the parity bit indicates an error condition even though the error didn t come from a soft error. In this case, PERFECTORY needs to update the parity bit (from 1 to 0) once it identifies the hard error. When the controller accesses this block later, it can detect a soft error using the parity. Once a soft error is detected and invalidation is needed, the controller sends invalidation messages to all cores. Finally, although some of them are speculative invalidation messages, they do not affect the program correctness. They have negligible performance impact because soft errors are an extremely rare event. Figs. 5c and 5d show situations when the parity bit is damaged. In Fig. 5c, parity update is unnecessary. Although the parity bit has a hard error, the original parity value is the same as the error value (1 in this case) and the fault in the parity bit does not affect the soft error detectability. However, Fig. 5d is the case where we are unable to update parity because of the damaged parity bit. In this example, to make parity odd, we replace a 0 with a 1 in bit 3 (forcing core 3 into becoming an artificial sharer). Therefore, two additional invalidation messages can be generated later: One from the hard error and the other from the parity compensation. If there is a soft error later, PERFECTORY can detect it by using the odd parity. Although PERFECTORY has excellent hard, intermittent, and soft error immunity, it has a limitation. If multiple errors occur in a single directory entry after the last access and at least one of them is a soft error, PERFECTORY is unable to

LEE ET AL.: PERFECTORY: A FAULT-TOLERANT DIRECTORY MEMORY ARCHITECTURE 645 error among five outputs, that output value is discarded. If more than one ECC-pointer pair has uncorrectable errors, the arbiter compares the five outputs and selects the output. In an extreme case, if all five decoders are unable to recover from errors, that directory entry is disabled. The online detector, whose architecture is shown in Fig. 4a, interfaces between the cache controller and the directory memory. Fig. 6. PERFECTORY controller architecture. detect them all. For example, one harmful hard error and one soft error occur after the last access of the directory entry. Because PERFECTORY primarily focuses on hard/ intermittent errors and this multibit error case is extremely rare, we do not consider this case in this paper. We note that existing ECC schemes are also unable to handle this case. We also note that there are design practices that prevent multiple soft errors by interleaving bit lines [37], [39]. 3.4 PERFECTORY Controller Architecture Fig. 6 depicts the hardware architecture of the PERFEC- TORY controller. It consists of two parts: ECC encoder/ decoder for the Exclusive state and online error detector for the Shared state. The output of these two blocks goes to the cache controller with the recovered sharer information. When the state is Shared, the delivered sharer information may include speculation. The ECC-pointer encoder/decoder block has multiple encoder/decoders and an arbiter. In a 64-core directory, maximum five ECC-pointer encoder/ decoders can be embedded. When the directory entry is updated to assume the state Exclusive, the ECC encoder converts the core number into the ECC-pointer format. When the directory entry is read, the five ECC decoders decode the five ECC-pointer pairs in the sharer field in parallel and produce results. If there is any uncorrectable 4 EVALUATION This section evaluates the effectiveness of PERFECTORY and other memory fault covering schemes in protecting the coherence directory. We use yield and lifetime reliability as two key metrics. When comparing PERFECTORY with graceful degradation schemes, we also use performance and hardware cost (chip area) as metrics. 4.1 Experimental Setup We use a 64-core CMP model similar to the one in Fig. 2a for all evaluations. It has 64 tiles each with a two-issue in-order core, linked with an 8 8 mesh network. Simulation parameters are summarized in Table 1. We compare four yield enhancement schemes in our experiments: Circuit-level redundancy, ECC without redundancy, ECC with redundancy, and PERFECTORY. To evaluate their yield, we first calculate the directory memory area under the four schemes. We then inject hard errors to the directory memory area using a uniformly random distribution. Based on the errors injected, we evaluate if all the errors can be covered by the yield enhancement schemes that we evaluate. Our evaluation methodology uses 100 Monte Carlo simulation runs per error injection level. With regard to resilience to runtime errors (soft error), we compare three schemes in this section: ECC, ECC with disabling, and PERFECTORY, using the MTTF metric. ECC with disabling is the combination of ECC and a state-of-theart memory block disabling scheme similar to the Intel Cache Safe Technology [10]. When ECC is unable to correct an error (e.g., a two-bit error), the ECC-disabling scheme discards the erroneous block and does not further use it. We assume that the maximum number of disabled blocks is eight for the ECC-disabling scheme. In this experiment, our machine model is a 100-processor cluster, where each processor is a 64-core CMP. We assume a runtime error rate TABLE 1 Simulated Systems and Workload Parameters

646 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 5, MAY 2010 Fig. 7. (a) Yield versus HER. R32 is 32 redundant blocks per way (i.e., 512 redundant blocks for 16 way). R32+ECC and R16+ECC are ECC with 32 or 16 redundant blocks. (b) Yield/Area versus HER. of 1,000 FIT per megabit [42] and measure how long the modeled coherence directory can tolerate runtime errors given a hard error injection level. When evaluating PERFECTORY and disabling schemes in terms of performance degradation, we use a cycle-accurate trace-driven CMP architecture simulator [11]. In our simulation, contention in DRAM memory ports and network switches is modeled, including network queuing delay, network link delay, and delays caused by network congestion. We use four SPLASH-2 parallel benchmarks [52] to construct workloads. Each application runs with 16 threads. To fully exercise our 64-core CMP machine model, we form 64-thread workloads by either running together four copies of a benchmark program ( homogeneous ) or running four different programs ( heterogeneous ). Hence, we have four homogeneous workloads (from four benchmark programs) and one heterogeneous workload with all four benchmark programs. Each program runs in one of the four geometric groups with 16 cores: NW, NE, SE, and SW. Workload parameters are summarized in Table 1. For presentation, we selected three HER values: Light (0.05 percent), medium (one percent), and heavy (10 percent). The HER is simply the ratio between the number of injected errors and the total number of available memory bits. 4.2 Result 4.2.1 Yield Fig. 7a presents yields for four yield enhancing techniques: Redundancy, ECC, ECC+redundancy, and PERFECTORY. Note that we use one very large redundancy only configuration (i.e., R128 that has 128 redundant rows) and two ECC configurations with moderate redundancy (i.e., R16+ECC and R32+ECC. They have 16 and 32 redundant rows, respectively). The result reveals that PERFECTORY achieves higher yield than other schemes by a large margin, especially when there are many errors. At 0.2 percent HER, all other schemes except PERFECTORY have 0 percent yield, whereas PERFECTORY s yield is still 100 percent. It is clear that the yield of circuit-level redundancy and ECC falls quickly as HER is increased. The redundancy schemes tolerate errors only up to the predetermined redundant row count. ECC endures only one-bit error for each block. As HER is increased and directory entries begin to have multiple errors, ECC fails quickly. Although the combination of ECC and circuit-level redundancy schemes has much improved resilience against errors compared with either of them alone, their effectiveness drops much earlier than PERFECTORY. As discussed before, PERFECTORY does not use additional storage space while circuit-level redundancy and ECC schemes require additional silicon area. Note that both yield and silicon area affect the cost of a chip. To expose the area overhead of each scheme in yield calculation, Fig. 7b compares the same set of schemes using a slightly different metric, yield per area. The result illustrates the higher costeffectiveness of PERFECTORY than other schemes even more clearly. 4.2.2 Resilience to Runtime Errors To illustrate runtime error resiliency, MTTF for 100 processors is shown in Fig. 8a. The result shows that ECC does not provide high resilience against runtime errors. MTTF for ECC at 0.05 percent HER is more or less than one month (36 days). This is because ECC does not provide soft error immunity for the blocks that were already struck by a hard error. As HER is increased, many directory entries become vulnerable to soft errors. ECC with disabling improves the error resilience, but not by a large margin. With 0.0005 percent HER, ECC with disabling has a relatively high MTTF (89 years). However, at 0.05 percent HER, its MTTF falls quickly to less than one year (288 days). PERFECTORY shows a much longer MTTF than the other schemes 1,934 years at 0.05 percent HER. The resilience of PERFECTORY against soft errors comes from its capability to detect a soft error in the presence of multiple hard errors with online testing. The previous result showed that disabling adds to MTTF. Naturally, one may want to implement an unlimited disabling scheme to extend MTTF further. The key issue in an unlimited graceful degradation scheme is the performance loss as we continue to delete more and more resources. Hence, we will address the question of how many directory entries remain operational as more and more hard errors are injected under an unlimited disabling scheme? before we actually study the performance and area cost of PERFEC- TORY and an unlimited disabling scheme in the next section. One can think of hard errors in this experiment as errors caused by aging [45]. Fig. 8b presents the result the effective capacity of the coherence directory (ratio of usable entries to all entries) as HER is varied. We compare three

LEE ET AL.: PERFECTORY: A FAULT-TOLERANT DIRECTORY MEMORY ARCHITECTURE 647 Fig. 8. (a) MTTF (days) versus HER. ECC only fails at the first multibit error. ECC+8 can disable up to eight blocks but fails at the ninth multibit error. PERFECTORY fails at the first block that it is unable to correct (Section 3.2.1). (b) Effective capacity versus HER. ECC+Disable combines ECC with unlimited disabling. Disable is an unlimited disabling scheme. schemes: Disable (unlimited disabling), ECC+Disable (unlimited disabling with ECC), and PERFECTORY+Disable (unlimited disabling with PERFECTORY). The result shows that PERFECTORY+Disable loses the smallest amount of effective directory capacity as HER is increased. At nearly three percent HER, PERFECTORY+Disable can still use more than 99.9 percent of the total capacity. Disable and, ECC+Disable get only 14.7 and 42.6 percent of the total capacity, respectively, in this heavy damage situation. The result indicates that PERFECTORY is capable of significantly improving a chip s lifetime reliability when augmented with a disabling capability. 4.2.3 Chip Area Overhead PERFECTORY s hardware overhead has two parts: controller logic, shown in Fig. 6, and memory elements (i.e., a parity bit per directory entry). For the controller logic overhead, state selector, online detector, and ECC-pointer encoder/decoder should be considered. Fig. 9 illustrates the area overhead for PERFECTORY, ECC without redundancy, and ECC with redundancy. ECC has 12.5 percent fixed area overhead from (72,64) SECDED code. Therefore, nearly 15 percent area overhead is required for ECC even without redundancy. However, the area overhead for PERFECTORY is less than four percent for more than 64 cores and decreases as the number of cores grows. 4.2.4 Performance Overhead Our performance overhead evaluation looks at two aspects: Low-level circuit latency and high-level program performance. In Fig. 6, there are two circuit paths between the directory memory and the cache controller: One for E state and the other for S state. State selector of the PERFECTORY controller in Fig. 6 can be implemented with inverters. Either E state or S state has at most one inverter delay from the state selector. HSPICE simulation shows that it has less than 0.2 ns latency in 65 nm technology with 1 V supply voltage. Smaller technologies have even less latency. The online detector has much larger latency than the ECCpointer encoder/decoder in Fig. 6. Therefore, if the combined latency of the online detector and state selector s circuit is less than the data cache access latency, PERFEC- TORY can operate without any circuit-level performance penalty. Table 2 compares PERFECTORY s circuit latency with the data cache access latency. PERFECTORY operation is assumed to access the directory memory three times (i.e., read, write, and read). For all technologies, PERFECTORY operation is well less than the L2 cache access latency. For high-level performance simulation, we randomly generated faults according to HER and inject them into the directory memory areas in the modeled CMP chip. Each error TABLE 2 PERFECTORY s Circuit Performance Overhead Fig. 9. Area overhead. For the overhead of ECC and ECC+8, memory for ECC and Encoder/Decoder logic are considered. (72,64) SECDED code is used for more than 64 cores. For PERFECTORY, state selector, multiple ECC Encoder/Decoder, online detector, and one-bit parity are considered. To calculate the hardware implementation cost of ECC Encoder/Decoder, Hsiao s SECDED code implementation is used [21]. Four PMOS and four NMOS transistors are assumed to calculate one XOR gate area [20]. CACTI [49] is used to estimate access latencies. L2 cache model follows Table 1. SRAM only model is assumed for the directory memory. 1.0 V supply voltage is assumed.

648 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 5, MAY 2010 Fig. 10. Homogeneous workload: (a) additional L2 misses, (b) additional invalidation messages, and (c) program slowdown. All results are normalized to the no error configuration. is either stuck-at-1 or stuck-at-0. We compare three schemes in this section: Disable (unlimited disabling), ECC+Disable (ECC with unlimited disabling), and PERFECTORY+Disable (PERFECTORY with unlimited disabling). Figs. 10a, 10b, and 10c present the number of misses, the number of invalidation misses, and program slowdowns for the homogeneous workload under the three schemes. Similarly, Fig. 11 gives the program slowdowns for the heterogeneous workload. While our heavy error scenario (10 percent HER) is unlikely in today s processor chips, Intel s Borkar suggested the possibility of an extremely unreliable chip design environment, where 20 percent of all on-chip devices are unreliable [7]. When a chip s supply voltage is lowered to save energy, a large chunk of memory devices may become temporarily unreliable [51]. Therefore, our study using the heavy error scenario is a meaningful limit study that evaluates our proposed and other existing schemes in terms of their error resiliency under extreme conditions. Additional L2 cache misses in Fig. 10a come from disabled cache blocks due to hard errors. Disable and ECC+Disable schemes have significantly more cache misses than PERFECTORY. PERFECTORY has negligible additional cache misses for the light and medium error conditions (0.05 and one percent HER). In the heavy error condition (10 percent HER), PERFECTORY limits the additional misses to 62.6 percent whereas other two schemes have more than 800 times the original misses. Additional invalidation messages in Fig. 10b come from two sources: Reduced cache capacity and speculative invalidations. Speculative invalidations make PERFECTOR- Y+Disable have more invalidation messages than other two schemes. However, PERFECTORY has significant less invalidation messages for OCEAN with one percent HER. In this case, reduced cache capacity of Disable and ECC+ Disable schemes results in more cache blocks being replaced at the bottom of LRU stacks, incurring invalidation Fig. 11. Heterogeneous workload. Individual program s slowdown is normalized to the no error configuration.

LEE ET AL.: PERFECTORY: A FAULT-TOLERANT DIRECTORY MEMORY ARCHITECTURE 649 messages. Interesting results are shown for 10 percent HER. Disable and ECC+Disable have almost no invalidation messages (indicated by 100 percent). If there are few available directory blocks for the heavy error condition, accesses miss in the L2 caches and go to the main memory. In this case, there are no invalidation messages because the directory does not manage any coherence information. However, PERFECTORY has huge additional invalidation messages because it recovers faulty blocks and generates speculative invalidation messages. Program slowdowns in Figs. 10c and 11 are a result of additional L2 cache misses and additional invalidation messages. L2 misses increase not only average L2 cache access time but also network traffic because memory access traffic consumes on-chip network bandwidth. They may also increase memory access time because of heavier contention at the memory ports. Invalidation messages also add to the on-chip network traffic and potentially increase the network contention. Our results show that under the heavy error scenario, PERFECTORY limits program performance degradation to one percent for the homogeneous configuration and 16.9 percent for the heterogeneous configuration, at 10 percent HER. The Disable and ECC+Disable schemes, however, suffer from significant performance degradation: On average, more than 10 and 3.5 times for the homogeneous and heterogeneous configurations, respectively. A large amount of additional cache misses and invalidation messages cause this performance degradation. Even under a heavy error scenario, PERFECTORY was able to limit the overhead of additional, speculative invalidations. This is because PERFECTORY generates speculative invalidation messages only when there are errors and the original sharer information has discrepancy with the error pattern. Note also that invalidation messages are generated only when there is a state transition from Shared to other states, such as Exclusive and Invalid. Three benchmarks Barnes, LU, and Ocean out of four have few state transitions from Shared to other states [13]. This program behavior limits the chance of program performance degradation due to the speculative invalidation messages. Finally, although the Disable and ECC+Disable schemes do not generate speculative invalidation messages, their reduced cache capacity leads to not only additional cache misses but also additional invalidation messages. Eventually, they lead to larger performance degradation than in PERFECTORY+Disable. 5 CONCLUSION The core count in CMPs is increasing fast and so is on-chip cache memory capacity. The on-chip coherence directory grows in size as the core count and the on-chip cache capacity increase. Considering that chips can have more frequent fault occurrences with future technologies, it becomes imperative to provide robust mechanisms to protect the onchip coherence directory memory for reliable computing on next generation CMPs. This paper proposed PERFECTORY, a comprehensive, low-overhead microarchitectural scheme to detect and correct hardware errors that occur in the coherence directory memory. Our evaluation results show that PERFECTORY achieves very high chip yield for the directory memory at 0.2 percent HER, yield is 100 percent with PERFECTORY, whereas a 32-way row redundancy scheme s yield is virtually zero. Furthermore, PERFECTORY significantly improves the lifetime reliability of the directory memory. The MTTF using a 100-processor cluster configuration is 1,934 years with PERFECTORY at 0.05 percent HER and 1,000 FIT, whereas ECC s MTTF is only 36 days under the same condition. Compared with existing yield improvement strategies and soft error protection mechanisms, PERFEC- TORY has a substantially smaller chip area cost and performance slowdown. It requires only one bit per directory entry. Finally, PERFECTORY shows less than 0.1 percent program performance degradation at 0.05 percent HER, whereas a state-of-the-art disabling scheme has four percent program performance degradation. ACKNOWLEDGMENTS This work was supported in part by the US National Science Foundation (NSF) grants CCF-0811352, CCF-0811295, CNS- 0720483, CCF-0702236, and CNS-0551492. REFERENCES [1] H. Ando et al., Accelerated Testing of a 90nm SPARC64V Microprocessor for Neutron SER, Proc. Third Workshop System Effects on Logic Soft Errors, 2007. [2] A. Agarwal et al., An Evaluation of Directory Schemes for Cache Coherence, Proc. Int l Symp. Computer Architecture (ISCA), 1988. [3] A. Agarwal et al., A Process-Tolerant Cache Architecture for Improved Yield in Nanoscale Technologies, IEEE Trans. VLSI Systems (TVLSI), vol. 13, no. 1, pp. 27-38, Jan. 2005. [4] AMD Dual-Core/Quad-Core Processors, http://www.amd.com, 2009. [5] D.K. Bhavsar, An Algorithm for Row-Column Self-Repair of RAMs and Its Implementation in the Alpha 21264, Proc. Int l Test Conf. (ITC), pp. 311-318, Sept. 1999. [6] C. Argyrides et al., Matrix Codes: Multiple Bit Upsets Tolerant Method for SRAM Memories, Proc. IEEE Int l Symp. Defect and Fault-Tolerance in VLSI System (DFT), pp. 340-348, Sept. 2005. [7] S. Borkar, Microarchitecture and Design Challenges for Gisele Integration, Proc. Int l Symp. Microarchitecture (MICRO), keynote speech, Dec. 2004. [8] S. Borkar et al., Platform 2015: Intel Processor and Platform Evolution for the Next Decade, Technology@Intel Magazine, Mar. 2005. [9] J. Chang and G.S. Sohi, Cooperative Caching for Chip Multiprocessors, Proc. Int l Symp. Computer Architecture (ISCA), 2006. [10] J. Chang et al., The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series, IEEE J. Solid-State Circuits (JSSC), vol. 42, no. 4, pp. 846-852, Apr. 2007. [11] S. Cho et al., TPTS: A Novel Framework for Very Fast Manycore Processor Architecture Simulation, Proc. Int l Conf. Parallel Processing (ICPP), pp. 487-494, Sept. 2008. [12] S. Cho and L. Jin, Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Proc. Int l Symp. Microarchitecture (MICRO), pp. 455-465, Dec. 2006. [13] D.E. Culler and J.P. Singh, Parallel Computer Architecture: A HW/ SW Approach. Morgan Kaufmann Publishers, 1998. [14] J.R. Day, A Fault-Driven, Comprehensive Redundancy Algorithm, IEEE Design and Test of Computers, vol. 2, no. 3, pp. 35-44, June 1985. [15] J. Dorsey et al., An Integrated Quad-Core Opteron Processor, Proc. IEEE Int l Solid-State Circuits Conf. (ISSCC), Feb. 2007. [16] R. Fernandez-Pascual et al., A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures, Proc. Int l Symp. High Performance Computer Architecture (HPCA), Feb. 2007. [17] K. Gharachorloo et al., Architecture and Design of AlphaServer GS320, Proc. Int l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), Dec. 2000. [18] A. Gupta et al., Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes, Proc. Int l Conf. Parallel Processing (ICPP), pp. 312-321, Aug. 1990.

650 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 5, MAY 2010 [19] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, fourth ed. Morgan Kaufmann, 2007. [20] D.A. Hodges et al., Analysis and Design of Digital Integrated Circuits in Deep Submicron Technology, third ed. McGraw-Hill, 2003. [21] M.Y. Hsiao, A Class of Optimal Minimum Odd-Weight-Column SEC-DED Codes, IBM J. Research and Development, vol. 14, no. 4, pp. 395-401, June 1970. [22] IBM, IBM System p570 with New POWER6 Processor Increases Bandwidth and Capacity, IBM US Hardware Announcement, pp. 107-288, May 2007. [23] D.V. James et al., Distributed-Directory Scheme: Scalable Coherent Interface, Computer, vol. 23, no. 6, pp. 74-77, June 1990. [24] J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J.C. Hoe, Multi-Bit Error Tolerant Caches Using Two-Dimensional Error Coding, Proc. Int l Symp. Microarchitecture (MICRO), Dec. 2007. [25] P. Kongetira et al., Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005. [26] K. Krewell, Best Servers of 2004, Where Multicore Is the Norm, Microprocessor Report, Jan. 2005. [27] D. Lamet and J.F. Frenzel, Defect-Tolerant Cache Memory Design, Proc. IEEE VLSI Test Symp. (VTS), Apr. 1993. [28] J. Laudon and D. Lenoski, The SGI Origin: A ccnuma Highly Scalable Server, Proc. Int l Symp. Computer Architecture (ISCA), June 1997. [29] H. Lee, S. Cho, and B.R. Childers, Performance of Graceful Degradation for Cache Faults, Proc. IEEE CS Symp. VLSI (ISVLSI), 2007. [30] H. Lee, S. Cho, and B.R. Childers, Exploring the Interplay of Yield, Area, and Performance in Processor Caches, Proc. Int l Conf. Computer Design (ICCD), Oct. 2007. [31] D. Lenoski and J. Laudon, The Directory-Based Cache Coherency Protocol for the DASH Multiprocessor, Proc. Int l Symp. Computer Architecture (ISCA), pp. 203-214, June 1990. [32] A.S. Leon et al., The UltraSPARC T1 Processor: CMT Reliability, Proc. IEEE Custom Integrated Circuits Conf. (CICC), Mar. 2006. [33] Q. Li and S. Vlaovic, Redundant Linked List Based Cache Coherence Protocol, Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed Systems (FTPDS), pp. 43-50, June 1994. [34] P. Mazumder, An On-Chip Double-Bit Error-Correcting Code for Three-Dimensional Dynamic Random-Access Memory, Proc. IEEE Int l Test Conf. (ITC), pp. 279-288, Sept. 1988. [35] C. Morin et al., An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures, IEEE Trans. Computers, vol. 49, no. 5, pp. 414-430, May 2000. [36] M. Plakal et al., Lamport Clocks: Verifying a Directory Cache- Coherence Protocol, Proc. Int l Symp. Parallel Algorithms and Architectures (SPAA), June/July 1998. [37] D.W. Plass and Y.H. Chan, IBM POWER6 SRAM Arrays, IBM J. Research and Development, vol. 51, no. 6, pp. 747-756, Nov. 2007. [38] A.F. Pour et al., Performance Implications of Tolerating Cache Faults, IEEE Trans. Computers, vol. 42, no. 3, pp. 257-267, Mar. 1993. [39] N. Quach, High Availability and Reliability in the Itanium Processor, IEEE Micro, vol. 20, no. 5, pp. 61-69, Sept./Oct. 2000. [40] SEMATECH, Critical Reliability Challenges for ITRS, Technology Transfer #03024377A-TR, Mar. 2003. [41] B. Sinharoy et al., POWER5 System Microarchitecture, IBM J. Research and Development, vol. 49, nos. 4/5, pp. 505-521, July-Sept. 2005. [42] C.W. Slayman, Cache and Memory Error Detection, Correction, and Reduction Techniques for Terrestrial Servers and Workstations, IEEE Trans. Device and Materials Reliability, vol. 5, no. 3, pp. 397-404, Sept. 2005. [43] G.S. Sohi, Cache Memory Organization to Enhance the Yield of High-Performance VLSI Processors, IEEE Trans. Computers, vol. 38, no. 4, pp. 484-492, Apr. 1989. [44] M. Spica and T. Mak, Do We Need Anything More Than Single Bit Error Correction ECC?, Proc. Int l Workshop Memory Technology, Design and Testing, 2004. [45] J. Srinivasan, S.V. Adve, R. Bose, and J.A. Rivers, The Impact of Technology Scaling on Lifetime Reliability, Proc. Int l Conf. Dependable Systems and Networks (DSN), pp. 177-186, June 2004. [46] B. Stackhouse et al., A 65nm 2-Billion-Transistor Quad-Core Itanium Processor, Proc. IEEE Int l Solid-State Circuits Conf. (ISSCC), pp. 92-93, June 2004. [47] C.H. Stapper et al., Synergistic Fault-Tolerance for Memory Chips, IEEE Trans. Computers, vol. 41, no. 9, pp. 1078-1087, Sept. 1992. [48] P. Sweazey and A.J. Smith, A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus, Proc. Int l Symp. Computer Architecture (ISCA), May 1986. [49] S. Thoziyoor et al., CACTI 5.1, Technical Report HPL-2008-20, HP Lab. Apr. 2008. [50] O.E. Theel and B.D. Fleisch, A Dynamic Coherence Protocol for Distributed Shared Memory Enforcing High Data Availability at Low Costs, IEEE Trans. Parallel and Distributed System (TPDS), vol. 7, no. 9, pp. 915-930, Sept. 1996. [51] C. Wilkerson et al., Trading off Cache Capacity for Reliability to Enable Low Voltage Operation, Proc. Int l Symp. Computer Architecture (ISCA), pp. 203-214, June 2008. [52] S.C. Woo et al., The SPLASH-2 Programs: Characterization and Methodological Considerations, Proc. Int l Symp. Computer Architecture (ISCA), pp. 24-36, July 1995. [53] M. Zhang and K. Asanovic, Victim Replication: Maximizing Capacity While Hiding Wire Delay in Tiled Chip Multiprocessors, Proc. Int l Symp. Computer Architecture (ISCA), June 2005. Hyunjin Lee received the BS degree in electrical engineering from Seoul National University in 1999. He has been pursuing the PhD degree in computer science at the University of Pittsburgh since 2005. Prior to joining the University of Pittsburgh, he worked for the System LSI Division of Samsung Electronics Co., Giheung, Korea, and contributed to the development of DVD players. His research interests are in the area of computer architecture, with particular focus on performance, power, and reliability aspects of memory hierarchy design for next generation multicore processors. He is a student member of the IEEE. Sangyeun Cho received the BS degree in computer engineering from Seoul National University in 1994 and the PhD degree in computer science from the University of Minnesota in 2002. In 1999, he joined the System LSI Division of Samsung Electronics Co., Giheung, Korea, and contributed to the development of Samsung s flagship embedded processor core family named CalmRISC(TM). He was a lead architect of CalmRISC-32, a 32-bit microprocessor core, and designed its memory hierarchy including caches, DMA, and stream buffers. Since 2004, he has been with the Computer Science Department at the University of Pittsburgh as an assistant professor. His research interests are in the area of computer architecture and embedded systems, with particular focus on performance, power, and reliability aspects of memory hierarchy design for next generation multicore processors. He is a member of the IEEE. Bruce R. Childers received the BS degree in computer science from the College of William and Mary in 1991 and the PhD degree in computer science from the University of Virginia in 2000. He is an associate professor in the Department of Computer Science at the University of Pittsburgh. His research interests include compilers and software development tools, dynamic translation and virtual machines, computer architecture, and embedded and real-time systems. Currently, he is researching dynamic translation for embedded systems, adaptive chip multiprocessor compilation and hardware architecture, and high-capacity, energy-efficient main memory architecture. He is a member of the IEEE and the IEEE Computer Society.. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.