RECENT developments in semiconductor technology

Size: px

Start display at page:

Download "RECENT developments in semiconductor technology"

Arthur James
5 years ago
Views:

1 1504 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 9, SEPTEMBER 2017 DRAM-Based Error Detection Method to Reduce the Post-Silicon Debug Time for Multiple Identical Cores Hyunggoy Oh, Inhyuk Choi, and Sungho Kang, Senior Member, IEEE Abstract In the post-silicon debug of multicore designs, the debug time has increased significantly because the number of cores undergoing debug has increased; however the resources available to debug the design are limited. This paper proposes a new DRAM-based error detection method to overcome this challenge. The proposed method requires only three debug sessions even if multiple cores are present. The first debug session is used to detect the error intervals of each core using golden signatures. The second session is used to detect the error clock cycles in each core using a golden data stream. Instead of storing all of the golden data, the golden data stream is generated by selecting error-free debug data for each interval which are guaranteed by the first session. Finally, the error data in all cores are only captured during the third session. The experimental results on various debug cases show significant reductions in total debug time and the amount of DRAM usage compared to previous methods. Index Terms Multiple identical cores, DRAM-based debug method, MISR compaction, golden data stream, debug time Ç 1 INTRODUCTION RECENT developments in semiconductor technology have allowed for the integration of a large number of cores into a single system-on-chip (SoC) and the prevalence of multicore designs in modern integrated circuits. However, the demand for multicore features increases the difficulty of verifying or validating those components and the number of errors that escape the pre-silicon verification and manufacturing tests has increased. Consequently, the first silicon is rarely error-free and it is imperative to detect errors during the post-silicon debug stage in order to meet stringent time-to-market requirements. The main goal of the post-silicon debug for multicore system is to detect the errors rapidly in the first silicon in order to avoid the increased cost caused by a silicon respin. There are two types of errors, which are logical and electrical errors [1], [2], [3]. Logical errors are related to designer mistakes caused by the complexity of the design. On the other hand, electrical errors occur in certain electrical environments such as parastic coupling noises, power supply noise, and crosstalk. Typically, debugging electrical errors is more challenging than debugging logical errors because it is difficult to predict and detect electrical errors during the pre-silicon verification [4], [5]. Because of these electrical errors, the post-silicon debug has become a bottleneck of the design implementation process. According to [1] and The authors are with the Department of Electrical and Electronics Engineering, Yonsei University, Seoul, Korea. {kyob508, ihchoi}@soc.yonsei.ac.kr, shkang@yonsei.ac.kr. Manuscript received 21 Nov. 2016; revised 22 Feb. 2017; accepted 28 Feb Date of publication 5 Mar. 2017; date of current version 15 Aug Recommended for acceptance by C. Metra. For information on obtaining reprints of this article, please send to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no /TC [6], engineering costs of the post-silicon debug consume up to 35 percent total implementation time at 90 nm and more than 50 percent of overall design effort at 65 nm. To support a fast and precise design process, design-for-debug (DfD) architectures and the post-silicon debug methods have been introduced. The scan-based debug method has been a well-known technique for the post-silicon debug. To observe as many of the internal signals as possible, logic probing techniques that leverage the reuse of scan chains have been introduced [7], [8], [9], [10]. Scan chains are commonly used in manufacturing tests. Although scan-based methods can achieve high observability, they require circuit operations to be halted, which implies the internal state data are not acquired in real-time but only at a single instant in time. Therefore, this technique is inadequate for analyzing the continuous functional behavior of the circuit. Furthermore, since errors that are difficult-to-detect may appear in any of the circuit states during thousands of clock cycles [11], it is not a desirable technique for the post-silicon debug. As a result, real-time signal tracing methods have been introduced to complement the scan-based method. The trace buffer-based debug method is commonly used to achieve real-time signal observation. This method requires an embedded logic analyzer (ELA) to manage the postsilicon debug. An ELA consists of a control unit, a trigger unit, a sample unit and an offload unit [1], [12]. The control unit monitors the trigger unit, sample unit and offload unit during the post-silicon debug. The trigger unit determines the start or end point for observing the circuit operation and the trace signals are captured via the sample unit which includes a trace buffer. Finally, the captured debug data are unloaded to the external workstation through the offload unit and the debug data are analyzed to detect the error via ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 OH ET AL.: DRAM-BASED ERROR DETECTION METHOD TO REDUCE THE POST-SILICON DEBUG TIME FOR MULTIPLE IDENTICAL the debug software. Since the real time at-speed observation of trace signals is captured in the trace buffer, the trace buffer-based method is helpful for the post-silicon debug. However, the major challenge of the trace buffer-based method is the limited observability because the size of the trace buffer results in DfD hardware overhead. Furthermore, the debug time increases significantly. To overcome the limitations of the trace buffer-based method, the DRAM-based debug method has been introduced [17], [18]. In the FPGA prototype, the debug method using external DRAM has been researched in [17] and this method has allowed for the extraordinary improvement of the FPGA prototype observability. In [18], a massive signal tracing method using on-chip DRAM which can be integrated in a SoC or accessed through-silicon via (TSV)-based 3D-ICs has been introduced. During a debug run, this method detects erroneous intervals using a multiple-input signature register (MISR) and stores the corresponding debug data dump in the DRAM through the trace buffer. Although this DRAM-based method overcomes the limitations of the trace buffer size, this method requires many debug sessions or a large number of buffers in order to debug multicore designs because there is no consideration of the debug case of multicore designs. Because a SoC or 3D-IC has a large number of cores, such as in multicore systems, an improved debug method for multicore designs is urgently needed in order to reduce the debug time. In this paper, a new DRAM-based error detection method for multiple identical cores is proposed to reduce the debug time and DRAM usage significantly. Typically, the multicore designs have evolved to include multiple identical cores because of the performance benefits associated with multiprocessing, and the attractive and economical options for redundant cores that guarantee highly reliable systems [19], [20], [21]. In addition, it is noted that the failures of the identical cores can be different in case of the electrical errors. Hence, the proposed method focuses to detect these electrical errors of multiple identical cores using the characteristic of them. The main contributions of this paper are as follows: A new DRAM-based debug method is proposed that detects error data in only three debug sessions. The first session detects error intervals, second session detects error clock cycles and third session captures only the error data. In order to support the second session, a golden data stream generation method is proposed that does not require storing all golden data, but instead exploits the fact that all CUDs are identical. New architectural features for a DRAM-based DfD using MISRs and comparators are proposed to perform the on-chip error intervals/cycles detection process. Probability models are introduced to estimate the DRAM usage and debug time. With this model, the DfD designer can determine the size of trace buffer and the number of CUDs with respect to the various debug cases. The rest of the paper is organized as follows. Section 2 describes the related works and motivation of the proposed idea. Section 3 discusses the proposed debug frame work and Section 4 describes the analysis of the DRAM-based method effectiveness with probability models. Section 5 provides the experimental results for various debug cases and finally, conclusions are presented in Section 6. 2 RELATED WORKS AND MOTIVATION In order to debug multiple cores of a SoC, some DfD architectures exploiting scan-based or trace buffer-based debug method have been introduced. In [10], a low-cost SoC debug platform based on-chip test architecture has been proposed. This architecture supports multi-core debugging in a SoC with a hardware breakpoint insertion and cycle-based runstop debug steps. Because the test architectures such as test access mechanism (TAM), test bus, IEEE and/or 1,500 test wrapper are reused in this debug platform, it is a cost-effective debug solution. Nonetheless, there is a limitation of a run-stop debug approach as discussed before. In [12], a DfD architecture including distributed ELAs for the post-silicon debug for multicore design in a SoC has been introduced. This architecture handles the issues of allocating the debug data of the multiple cores with a userdefined priority scheme because the resources of the trace buffers are limited. It is an effective solution when multiple cores are debugged and the priority of CUDs is required. However, the limited trace buffer observation is still a critical challenge to reduce the debug time. Therefore, several techniques for the trace buffer-based debug method have been introduced to improve the capacity of the trace buffer [13], [14], [15], [16]. In [13], an iterative error detection method using MISR signatures has been introduced to reduce the number of debug sessions during repeatable debug experiments. First, the entire target observation window is compacted and captured to the trace buffer with an MISR. After transferring the signatures to the external workstation, the erroneous intervals, i.e., error suspect windows, are detected by comparing the acquired signatures to the golden signatures. And in the following debug session, the set of error suspect window is compacted and the method zooms into the error suspect window in this way until detecting the specific error clock cycles. To improve the quality of [13], an on-chip error detection method has introduced in [14] that re-uses the empty area of the trace buffer to store pre-calculated golden signatures, and then compacts the debug data with a higher compaction ratio during error-suspect intervals. In [15] and [16], a 2-D compaction technique has been introduced to expand the observation window. This technique requires three debug sessions. The first session estimates the error rate using a parity generator. In the second session, the error suspect clock cycles are determined through a 2-D compaction using an MISR and a cycling register. Finally, in the third session, the erroneous debug data are selectively captured with pre-calculated tag bits. With this 3-pass methodology, the method can expand the observation window significantly. However, it does have some limitations. First, there is some probability that more debug sessions will be required because the first session, which estimates the error rate, is strongly dependent on the error distribution case. Furthermore, the ability of the 2-D

3 1506 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 9, SEPTEMBER 2017 Fig. 1. Examples of post-silicon debug process for n cores. (a) The previous method. (b) The proposed method. compaction process to detect the error clock cycles is increasingly inaccurate due to misidentification as the target observation window increases. Consequently, these methods are only suitable for short duration debug cases, or as supplements to the other debug methods in long duration debug cases. To overcome the limited capacity of the trace buffer, a DRAM-based debug method has been introduced in [18]. The key principle of this method is to transfer the debug data dump from the trace buffer into a larger on-chip DRAM. First, golden MISR signatures are stored in the DRAM using a trace port such as JTAG. After the debug runs, the debug data dump during the specific interval are captured in the trace buffer and compacted by an MISR set. At the end of the interval, the interval is analyzed comparing the MISR signature and the golden signature. If they are the same, this interval is error-free and the debug data for the next interval are captured in the trace buffer. If not, the captured debug data in the trace buffer are shifted to the shadow buffer and stored in the DRAM. With the above debug process, this DRAM-based debug method reduces a substantial debug time compared to the trace buffer-based method [13], [14], [15], [16]. However, this previous method focuses to debug only one core although a lot of cores should be debugged in the multicore design such as a SoC or a 3-D IC where includes the embedded DRAM (edram) or 3D DRAM [22], [23]. Consequently, the previous method requires multiple debug sessions and a lot of debug time because the debugging process is performed sequentially. The simple example of debugging n cores is described in Fig. 1a. In addition, the overhead associated with the hardware area and memory resources increases as the number of cores increases if multiple cores are to be simultaneously debugged in the previous method. In this paper, on a cycle-accurate deterministic debug environment, a new DRAM-based error detection method for multiple identical cores is proposed to reduce debug time significantly. The concept of the proposed method is described in Fig. 1b. Gray boxes and black lines indicate error intervals and error clock cycles. The main idea of the proposed method is debugging all cores not in sequence but at the same time. In the first session, error intervals for all cores are detected by golden MISR signatures. And error clock cycles for all cores are detected by the golden data stream in the second session. Instead of storing all golden data in DRAM, the golden data stream can be generated by selecting the debug data for each interval which are guaranteed as error-free in the first session. Finally, erroneous data for all cores are captured. Unlike the previous method which stores unnecessary debug interval data dump in the DRAM, the proposed DRAM-based method stores the error clock cycle data by the error detection method which requires only three debug sessions even if multiple cores are present. As a result, the proposed method provides the significant debug time and DRAM usage reduction compared to the previous method when debugging multiple cores. It is important to note that a cycle-accurate deterministic debug phase is not an impractical assumption [13], [24],

4 OH ET AL.: DRAM-BASED ERROR DETECTION METHOD TO REDUCE THE POST-SILICON DEBUG TIME FOR MULTIPLE IDENTICAL TABLE 1 Notations for Debug Experiments Name L M N S n EI index EI tag EC index EC tag GDS tag Representation Number of observed signals Buffer depth and interval length in cycles Length of the observation window in cycles Timestamp length Number of CUDs Error interval index bit Error interval tag bits Error clock cycle index bit Error clock cycle tag bits Golden data stream tag bits Fig. 2. The three debug session flow of the proposed method. [25], [26]. The post-silicon debug generally comprises two different phases: non-deterministic and deterministic. In the non-deterministic debug phase, bug occurrences cannot be reproduced because of asynchronous interfaces, interrupts from peripherals. In this phase, the main goal is to determine how to control the failure and these techniques are introduced in [25], [26], [27], and [28]. When the failures are controllable, the debug environment can be cycle-accurate deterministic. In this deterministic phase, the main goal is to detect the root cause in terms of space and time information as quickly as possible [12], [13], [14], [15], [16]. Unlike the non-deterministic debug phase, the functional tests for very long debug cycles are performed repetitively in the deterministic debug phase and it results in a tremendous debug time overhead. Hence, the proposed method focuses to cycle-accurate deterministic debug envrionments in order to exploit the characteristic of identical cores and reduce the total debug time significantly. 3 PROPOSED DRAM-BASED DEBUG SCHEME As discussed previously, the proposed method consists of three debug sessions. The three debug session flow is described in Fig. 2. In this section, each debug session is explained in detail. Then, considerations of the DRAMbased debug method are demonstrated. Finally, the hardware architecture for the proposed method is introduced. To aid in understanding, please refer to the notations presented in Table 1, which are similar to those used in [18]. 3.1 Debug Session 1 Detecting Error Intervals In the first debug session, the erroneous intervals of all CUDs are detected using an MISR-based compression technique, which has been widely used to identify failing intervals during the post-silicon debug process [13], [14], [15], [16], [18]. First, the golden MISR signatures are generated by simulating the behavioral model or by using a FPGA prototyping board. To capture the debug data in the third session, the length of MISR is set to M cycles. Then, the precalculated golden signatures are uploaded to the DRAM via a serial interface (e.g., JTAG) or a high speed trace port [12]. When the debug process starts, the debug configuration sets the trigger event conditions and selects the debug data. In addition, a golden signature is loaded into the golden signature register from the DRAM. After the functional operation begins, the debug data from each CUD are compacted by the MISR over the course of M cycles, which is the same length as the golden signature. After M cycles, the signatures are compared to the golden one in order to detect that the interval is erroneous or not. If the signature value of a core is the same to the golden signature, then the current interval of the core is error-free. If not, the current interval is an erroneous interval. To check the results of the interval error detection process of all cores, one bit is required per interval. If the bit is set to 1, this indicates least one core is erroneous during the corresponding interval. If the bit is set to 0, this indicates that all cores are error-free in the corresponding interval. This one bit is called EI index in this paper. If EI index is 1, n bits, which are referred to as EI tags, are captured in the EI tag register in order to check the results for each core in the interval. If EI index is 0, then it is not necessary to capture the EI tag. After the capture process, the EI tags are stored in the DRAM before the next interval detection process ends. After that, the next golden signature is loaded and the debug data are analyzed during the next N cycles in this manner. After the first session has completed, the EI tag in the DRAM and the EI index in the register are transferred to the workstation (off-chip) and analyzed in order to generate tag bits for the second session. This on-chip process is described in Fig Debug Session 2 Detecting Error Clock Cycles After the tag bits are transferred to the workstation, the offchip debug process is performed by the debug software. It should be noted that the process of using the debug Fig. 3. The on-chip error interval detection during the first debug session.

1508 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 9, SEPTEMBER 2017 Fig. 5. The example of the golden data stream selector. Fig. 4. The example of the off-chip process before the second debug session.

5 1508 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 9, SEPTEMBER 2017 Fig. 5. The example of the golden data stream selector. Fig. 4. The example of the off-chip process before the second debug session. software is not a burden with respect to the debug time because this process is performed at the same time during the on-chip debug experiments, and the process where the debug data are transferred to/from the debug software. For EI index and EI tag, an error interval matrix, which indicates the information pertaining to erroneous intervals of each core, can be generated. If n is 4 and the number of EI index is 10, the matrix size is 4 10 as described in Fig. 4. During the second debug session, golden data are required in order to analyze the debug data corresponding to each clock cycle. However, it is a tremendous burden to store all of the golden data in the DRAM as N or L increases. To solve this problem, a technique for generating the golden data stream during N cycles is introduced in this section. It exploits the fact that the error-free interval data of a core can be used to compare to the erroneous data of other cores as the golden data because all cores are identical. That is, the golden data stream can be generated by selecting the debug data for each interval which are guaranteed as errorfree in the first session. The golden data stream (GDS) selector is described in Fig. 5. To select the error-free data for each interval, the GDS tag is required. The algorithm for generating the GDS tag is described in Algorithm 1. The GDS tag consists of a GDS index and a core_sel. The GDS index determines how to select the golden data, and the core_sel is the set of selected cores, which are error-free. First, the cores, which are error-free, are identified in the core_sel when the EI index is not 0 (lines 3-9). Then, the algorithm determines the GDS index. If the GDS index is 0, this indicates that the golden data can be selected from the core_sel. If the GDS index is 1, this means that all of the cores are erroneous during the interval, and the golden data should be selected from the DRAM. If the sum of the EI tags is the same to n, then the GDS index is 1. If not, the GDS index is 0 (lines 10-13). After determining the GDS index, the core_sel is selected according to the following rules. First, the core_sel only exists when the EI index is 1 and the GDS index is 0. Then, the number of cores in the core_sel is calculated, and the core is selected which has the maximum number. In the example of Fig. 4, interval 1, 3, 4, 7 and 8 can select the core as the golden data. In this case, core 4 (core_sel ¼ 11) is for intervals 1, 3 and core 3 (core_sel ¼ 10) is for intervals 4, 7 and 8. Algorithm 1. Generating GDS tag Input: EI index, EI tag, n, N, M Output: GDS index, core_sel 1 i ¼ 0; k ¼ 0; 2 for each (EI index(i)) do 3 for each (EI tag(k)) do 4 if (EI index ¼¼0) then 5 ignore(core_sel); 6 else 7 if (EI tag(k) ¼¼0) then 8 store(core_sel, i, k); 9 end 10 if (sum of EI tag(k) ¼¼n) then 11 GDS index ¼ 1; 12 else 13 GDS index ¼ 0; 14 end 15 while (all cases of core_sel) do 16 if ((EI index ¼¼1)&&(GDS index ¼¼0) then 17 calculate_core_number(core_sel, i, k) 18 max_samecore(core_sel, i, k) 19 select_core(core_sel, i, k); 20 else 21 ignore(core_sel); 22 end 23 return GDS tag, core_sel; After generating the GDS tag, the second debug session is initiated in order to capture the information regarding the error clock cycles of each core. First, the pre-calculated tag bits are uploaded to the DRAM. If the cases for which the GDS index is 1 occur, then additional golden data are also uploaded, which requires DRAM usage more. The cases for which the GDS index is 1 are related to the logical errors because the cores are identical. However, these logical errors occur infrequently because the logical errors can be detected in the pre-silicon verification step as previously discussed. As a result, the cases that additional golden data are required occur infrequently in the most practical debug cases. The usage of the DRAM is discussed in Sections 4 and 5. When the debug process starts, the debug configuration is performed and the tag bits are uploaded to each tag register. In the case of the additional golden data, they are uploaded to the trace buffer and shadow buffer in turns because they should be compared to the debug data in a consecutive sequence of clock cycles. In addition, these

OH ET AL.: DRAM-BASED ERROR DETECTION METHOD TO REDUCE THE POST-SILICON DEBUG TIME FOR MULTIPLE IDENTICAL... 1509 Fig. 6. The on-chip error clock cycle detection during the second debug session.

6 OH ET AL.: DRAM-BASED ERROR DETECTION METHOD TO REDUCE THE POST-SILICON DEBUG TIME FOR MULTIPLE IDENTICAL Fig. 6. The on-chip error clock cycle detection during the second debug session. buffers are re-used to capture error data in debug session 3. After the trigger point, the EI index is used to detect the erroneous intervals and the EI tag selects the erroneous cores in real time. If EI index is 0, this interval is bypassed. If not, the erroneous cores are compared to the golden data stream from the GDS selector, and the resulting bits of comparison, which are referred to as EC tags in this paper, are captured in the EC tag on-chip buffer during M cycles. In order to accommodate the worst case scenario in which all of the cores are erroneous, the size of EC tag buffer is n M. Since the capture process is also performed in a consecutive sequence of clock cycles, the shadow EC tag buffer is required. If the EC tag buffer is full, the data are shifted to the shadow buffer and stored in the DRAM. After that, the next tag bits are loaded from the DRAM into the registers, and the debug data are analyzed over the course of N cycles in this manner. After the second session is completed, the stored tag bits in the DRAM are transferred to the workstation and analyzed for the third session. This on-chip process is described in Fig Debug Session 3 Capturing Error Data After transferring the EC tags to the debug software, they are re-generated to detect the erroneous data of all cores in a sequence of clock cycles during the third session. First, the error cycle matrix is calculated using the EC tags and the error interval matrix. If the length of the MISR is 5, then a 4 5 error cycle matrix is generated for each interval. With this matrix, the EC index is generated, which indicates whether or not at least one core is erroneous in the corresponding clock cycle. First, if the EI index is 0, the interval is bypassed. If the EI index is 1, the required number of EC index is M. IfEC index is 1, EC tag is required as much as n and can be re-generated by the error clock matrix. If the EC index is 0, the cycle is bypassed. In this way, EC tag is regenerated. The process is illustrated in Fig. 7. After generating these tag bits, the final debug session is performed. The pre-calculated tag bits are uploaded to the DRAM and the debug configuration is performed. After the trigger point, the EI index is used to detect the erroneous intervals, the EC index detects the erroneous cycles, and the EC tag selects the erroneous cores in real time. The error data Fig. 7. The tag bit generation example for the third debug session. of all cores are captured in the trace buffer. When the trace buffer is full, the captured data are shifted to the shadow buffer and then stored in the DRAM. After that, the next tag bits are loaded from the DRAM and all error data are stored in the DRAM after the debug run. After the debug session has completed, the stored data are transferred to the workstation and then analyzed to find the root-cause of the errors. This on-chip process is described in Fig Considerations of the DRAM-Based Debug Method In the DRAM-based debug method, it is necessary to control the communication with DRAM during functional operation. Hence, some considerations are necessary in order to satisfy the requirements of the proposed method. First, the DRAM should be partitioned in the circuit design process in order to store the debug data. In Sections 4 and 5, the DRAM usage for the previous and proposed method is analyzed using the probability models and experimental results. With this information, the designer can determine the size of DRAM required for the post-silicon debug Fig. 8. The on-chip error data capture during the third debug session.

7 1510 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 9, SEPTEMBER 2017 Fig. 9. Operations of the proposed DfD modules during three debug sessions. (a) The first session. (b) The second session. (c) The third session. process with respect to the various debug cases. Second, sufficient bandwidth is required to shift the trace data from the shadow buffer to the DRAM at-speed. As discussed in [17], [18], this is a reasonable assumption because the post silicon debug is performed as a real applications, and there are available memory resources. In addition, unlike in the previous work, the advantage of the proposed method is that the requirement for memory resources remains the same, although n has increased. Third, the DRAM access operations from the DfD module need to be scheduled in order to avoid interfering with the debug programs. In addition, the memory access latency during each interval should be within M cycles because the debug process is performed periodically every M cycles. As a result, a sufficient trace buffer size is required to perform the DRAM-based debug method. To show how to communicate with the DRAM during the three debug sessions, the operations of the proposed DfD modules are described in Fig. 9. In the first session, each interval is compacted as an MISR signature, compared to the golden signature and then captured in the EI index and EI tag registers. EI index is captured in the register and then offloaded to the workstation after the debug session because the data volume is small. The captured EI tag (n bits) is stored to the DRAM and the next golden signature (L bits) is loaded from the DRAM. This is described in Fig. 9a. In the second session, the debug data from the erroneous cores during the erroneous interval are compared to the golden data stream and captured in the EC tag buffer. This is described in Fig. 9b. Because the capture process is performed during M cycles, the shadow EC tag is required. The dotted line means the process is performed conditionally. If the EI index is 0, it is not necessary to perform the error clock cycle detection. If the EI index is 1, the EC tag is selectively captured by the EI tag. After the EC tag buffer is full, the tag bits are shifted to the shadow one. Typically, the number of captured tag bits is diferent every M cycles. However, it is possible to predict when the EC tag will be full based on the EI tag and EI index, which are calculated before the second session. Consequently, the DfD controller can determine the shift timing. After that, the data are stored in the DRAM. If the additional golden data is required for the next interval, it is loaded into the trace buffer and shadow buffers by turns. This mechanism avoids the case where the additional golden data is required in a sequence. After that, the EI tag is loaded if the EI index is 1. Finally, the GDS tag is loaded for the next interval. In the third session, the error data of all cores are captured in the trace buffer. For the above-mentioned reasons, the shadow buffer is required. As in the second session, the timing when the trace buffer is full is controlled by the EI index, EC index and EC tag. After the trace buffer fills up, the captured data are shifted to the shadow buffer and stored in the DRAM. Finally, for the next interval, the EC index and EC tag are loaded into in each register if the EI index is 1. Then the EC tag is compared to the debug data when the EC index is 1. They are also loaded in turns in order to prevent that the required data from overlapping. This is described in Fig. 9c. Because the store and load operation is irregular in the second and third session, the preserved time area is required and can be determined by the tag bits as discussed above. In addition, there is a limitation for n in order to satisfy the operation in the third session. In the worst case where all cores are erroneous in every cycle, the minimum preserved time is M/n cycles. Hence, the maximum value of n is less than M divided by the write access latency. For example, if the average access latency of the DRAM is 25 ns, and the circuit is operating at a 1 GHz clock frequency, the maximum value of n is 20 when M is 512. In order to adapt the proposed method to a practical multi-core debug case, this limitation should be considered in the practical design process. 3.5 Hardware Architecture of the Proposed DfD The hardware architecture of the proposed method is illustrated in Fig. 10. The debug configuration module is controlled through the trace port, e.g. JTAG during the configuration step. This module controls the trigger point of the debug process and selects the CUDs and debugs data. In addition, the MISR settings and the golden register (GS) are configured during the first session, and the tag bit registers are controlled for the second and third sessions. It should be noted that these buffers and tag bit registers are re-used in order to reduce the hardware area overhead during the three debug sessions. After the start of the debug process, a finite state machine (FSM) controls the debug modules and communications with the DRAM. In session 1, the FSM controls the timing for capturing the EI tag and communicating with the DRAM controller using the interval counter. In the session 2, the FSM controls the GDS selector using the GDS tag in order to generate the golden data stream. If additional golden data is required, the FSM selects the trace buffer and shadow buffer in turns

8 OH ET AL.: DRAM-BASED ERROR DETECTION METHOD TO REDUCE THE POST-SILICON DEBUG TIME FOR MULTIPLE IDENTICAL Name p i P i X i Y Z TABLE 2 Notations for Probability Models Representation Error clock cycle probability of ith core Error interval probability of i th core Random variable as the number of erroneous intervals of ith core Random variable as the number of the cases that all cores are erroneous during the same interval Random variable as the number of the cases that at least one error interval of all cores exsists during the same interval Z Random variable as the number of the cases that at least one error clock cycle of all cores exsists at the same clock cycle DU prev(prop)_i T prev(prop) DRAM usage of i th session Debug time Since the previous method only stores the debug data for the error intervals, the DRAM usage of ith session in the previous method is calculated as: Fig. 10. Hardware architecture of the proposed DfD for three debug sessions. in order to load the data from the DRAM. In addition, the FSM determines the timing for capturing and shifting the EC tag to shadow EC tag. The timing information when the EC tag is full can be calculated with the EI tag before the second session, and is configured in the debug configuration step. After shifting the data to the shadow EC tag, the FSM supports the operation for communicating with the DRAM controller as explained in Section 3.4. In session 3, the FSM controls the timing for capturing the erroneous data in the trace buffer, shifting it to the shadow buffer, and communicating with the DRAM. To satisfy the operations of sessions 2 and 3, the preserved time area is controlled by the FSM and the interval counter. In order to store the data in the DRAM, the adapter is added in front of the shadow EC index and EC tag and shadow buffer. With this adapter, the debug data can be transferred to the memory interface although the frequency of the interface is different from the CUD. 4 ANALYSIS OF THE DRAM-BASED METHOD EFFECTIVENESS WITH PROBABILITY MODELS In this section, probability models are introduced that are used to estimate the DRAM usage and debug times for both the previous and proposed methods. These models help the DfD designer to assess various debug strategies and determine the DfD module. The notations used for the models are presented in Table 2. To easily compare this method to the previous method, some variables that are used in [18] are re-used in this paper. In this case, P i and the expectation of X ðe½x i ŠÞ are described as follows: P i ¼ ð1 ð1 p i ÞÞ M (1) N E ½X i Š ¼ P i M : (2) DU prev i ¼ EX ½ i ŠðS þ LMÞþ LN M ; (3) where S is the size of the time stamp which identifies the corresponding error interval, and LN/M is the number of golden MISR signatures stored in the DRAM. In the proposed method, each session requires different DRAM usage. To calculate the DRAM usage for the proposed method, E½Y Š, E½ZŠ, and E½Z 0 Š are required. They are described as, EZ ½ Š ¼ N M E ½YŠ ¼ N M Yn 1 Y n i¼1 EZ ½ 0 Š ¼ N 1 Yn i¼1 i¼1 P i (4) ð1 P i Þ! (5)! ð1 p i Þ : (6) In the first session, golden MISR signatures are required and the EI index and EI tag are stored in the DRAM every M cycles. Consequently, the DRAM usage for the first session can be described as, DU prop 1 ¼ LN M þ N M þ ne½zš ; (7) where the EI index is N/M and EI tag is ne[z]. In the second session, the GDS tag, the additional golden data, EI index and EI tag are uploaded in the DRAM in order to detect the erroneous clock cycles. The GDS tag and additional golden data can be described as, GDS tag ¼ N M þ log 2nEZ ð ½ Š EY ½ ŠÞ (8) additional golden data ¼ LME½Y Š : (9) The sum of GDS tag, additional golden data, EI tag, and EI index is the total uploaded data volume before starting

9 1512 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 9, SEPTEMBER 2017 the second session. After the debug process starts, the EC tag is stored in the DRAM every M cycles. Hence, the DRAM usage for the second session is described as, DU prop 2 ¼ upload data volume þ ME½ZŠðnP i Þ ; (10) where the captured EC tag volume during the second session. To simplify the calculations, it is assumed that the error occurrence distribution of all cores follows a binomial distribution during the interval, and each P i is the same. In the third session, the EI index, EC index and EC tag are uploaded to the DRAM in order to detect the erroneous data of all cores. As discussed in Section 3.3, the EC tag is regenerated. After the debug process start, the error data of all cores are stored in the DRAM every M cycles. As a result, the DRAM usage for the third session is calculated as, DU prop 3 ¼ N M þ ME ½ Z ŠþnE½Z0Š þ Xn LNp i : (11) As discussed in Section 3.4, the preserved area of the DRAM is required to perform the debug process, and the area should be larger than the maximum DRAM usage across all sessions. That is, the DRAM usage for the previous and proposed method is described as, DU prev ¼ MAX DU prev1 ;DU prev2 ;...:; DU prevn (12) DU prop ¼ MAX DU prop1 ; DU prop2 ;DU prop3 : (13) To calculate the debug execution time, both the on-chip sampling time and communication time are required. The on-chip sampling time is related to the number of clock cycles that elapse from the trigger point until the debug session ends. In the previous method, the total on-chip sampling clock cycles are nn. However, the proposed method requires only 3N. In addition, if all cores are error-free, only N cycles are needed because only the first session is required. That is, the proposed method reduces debug time significantly with respect to the on-chip sampling time. The communication time is the time during which the debug data are uploaded and offloaded through the trace port. In the previous method, the golden MISR signatures are uploaded once, and the erroneous data and corresponding S of each core are offloaded every session. On the other hand, it is necessary to transfer the DRAM usage during all three sessions in the proposed method. Furthermore, it is only necessary to transfer the DRAM contents for session 1 if all cores are error-free. T prevðpropþ can be calculated as follows: T prev ¼ nn f CUD 8 < T prop ¼ : þ LN M þ P n i¼1 EX i N f þ DUprop 1 CUD f trace port EðZ 0 P 3 3N i¼1 f þ DU prop i CUD f trace port i¼1 ð ÞðS þ LMÞ (14) f trace port ð Þ ¼ 0Þ ðotherwiseþ: (15) Fig. 11 shows the expected results from calculating the probability models when N is 2M cycles, L is 32, M is 512 and the ratio of f CUD to f trace port ðaþ is 10. To simplify estimation of the expected results, it is assumed that all p i are the same. First, Fig. 11a shows the progress of the DRAM usage in the proposed method for various p when n ¼ 16. In the very low error rate area, DU prop 1 is the largest. As p increases, DU prop 2 and DU prop 3 also increase, and the DU prop 3 volume is the largest because the amount of stored error data increase. As p increases, DU prop 2 excels DU prop 3. This is because the amount of the additional golden data required increases exponentially as p increases. Even though this increases the DRAM usage overhead, it can be solved in the practical debug case. For example, this additional golden data volume is 0 as long as one core is error-free. This is demonstrated through various debug cases in Section 6. The DRAM usage ratio and debug time improvement ratio are described for various n and p in Figs. 11b and 11c. The DRAM usage ratio indicates DU prop =DU prev and the time improvement is calculated as ðt prev T prop Þ=T prev 100. In most cases of p, the DRAM usage of the proposed method is less than that of the previous one because the erroneous data are selectively captured with small tag bits in the proposed method. After p becomes higher than 0.1 percent, the DRAM usage of the case where n is 4 exceeds that of the case where n is 8 or 16. This is because the additional golden data volume increase as E½Y Š increases when n is small. In the high error rate area (around 1 percent), the DRAM usage ratio increases and it is double when n is 32. In addition, the ratio increases more as n increases. For example, it is 2.7 when n is 64 and it is 4.6 when n is 128. However, it is a reasonable growth in the view of increased n. In addition, it should be noted this memory is a part of the on-chip DRAM and does not require any additional hardware overhead. Consequently, the trade-off between n and the DRAM usage can be determined by the DfD designer during the design process. The debug time improvement ratio is strongly related to n. First, the improvement ratio is constant in the error-free zone because the proposed method requires only one session and the required data volume is deterministic. When n is 4 and p is very low, the improvement ratio is approximately 2040 percent. However, the ratio increases as p increases because the debug time overhead in the proposed method increases slowly when compared to the previous method. Then, the ratio decreases when p is high. This is because additional golden data are more often required as p increases. As n increases, the baseline of the improvement ratio increases and the range of fluctuation decreases. For example, the improvement ratio is always more than 90 percent when n is 32, as described in Fig. 11c. Fig. 11d shows the optimal values of M with respect to the debug time. These are obtained by setting the prop /@M to zero and assuming M is 256, 512, 1,024 or 2,048 in order to simplify the implementation. When p is low, the debug time is the shortest when M is 2,048. This is because M is the interval length of the error detection, and the error data are virtually zero and the intervals are almost error-free. As p increases, the shorter interval can detect the error data minutely during the observation window. That is, the small trace buffer size has the benefit of more debug time improvement as p increases. However, M is limited in the DRAM-based debug method, as discussed in Section 3.4. That is, the designer should consider this trade-off to optimize debug time, hardware area overhead and memory resources.

10 OH ET AL.: DRAM-BASED ERROR DETECTION METHOD TO REDUCE THE POST-SILICON DEBUG TIME FOR MULTIPLE IDENTICAL Fig. 11. Expected results by calculating the probability models with N ¼ 2M cycles, L ¼ 32, M ¼ 512, a ¼ 10. (a) DRAM usage with n ¼ 16. (b) DRAM usage ratio with various n (c) Debug time improvement ratio with various n. (d) Optimal M with various n. 5 EXPERIMENTAL RESULTS This section discusses the experimental results with respect to the DRAM usage, debug time and hardware area overhead in order to illustrate the benefits of the proposed method in the multicore debug cases. The experimental results are presented for an ARM-based processor design [29] and CPU cores in an OpenSPARC T2 [21]. First, each debug module is designed as a Verilog RTL model and synthesized using the TSMC 130 nm standard cell library [30] to estimate the area size. To perform the DRAM-based debug methods, the DRAM is modeled as a Verilog module as in [18]. Faults are randomly injected into the circuits to produce misbehavior according to the various error rates. A 32-bit data bus is used in an ARM-based design and the CPU core in an OpenSPARC T2 uses a 64-bit data bus. The data bus is assumed as the debug data to be observed by the DRAM to compare the performance of the previous and proposed method. 5.1 DRAM Usage and Debug Time Table 3 shows the DRAM usage and debug time of [18] and the proposed method for the debug experiment in which N is 2M cycles, M is 512 and a is 10 with different number of CUDs and error rate. The error rates are presented in the second column, which means how many errors are injected in the experiments. In addition, the error rates of each core can be different although the cores are identical. This is because the electrical errors occur in certain electrical environments. To solve this issue, a Gaussian distribution is used to generate the various error rates in each core. The standard deviation (s) related to the process variation of the multicores and the error rates of each core is changed with s. In this paper, s is set to The third and fourth columns show the comparisons between the previous method [18] and the proposed method. The notation Seq indicates the previous method [18] in which each core is debugged in sequence. In addition, Equation (12), (13), (14), and (15) are exploited to calculate DRAM usage and debug time for experimental results. As shown in Table 3, the proposed method reduces debug time significantly compared to [18], because it reduces the on-chip sampling time as well as the communication time. When the error rate is low, the debug time is strongly related with the sampling time. In this case, the debug time of the previous and proposed method is

11 1514 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 9, SEPTEMBER 2017 TABLE 3 DRAM Usage and Debug Time Comparison for Different Number of CUDs and Error Rate Number of CUDs(n) Error Rate (%) DRAM usage (M Byte) Debug time (M cycles) Seq [18] Prop Seq [18] Prop ARM based Design [29] OpenSPARC T2 [21] relatively small and just proportional to the number of debug sessions and CUDs. However, the communication time significantly affects the debug time as the error rate increases. This is because the amount of stored debug data increases and the frequency of the trace port, which transfers the debug data from the CUD to the external workstation, is relatively slow. In this case, the proposed method has the benefits of the communication time compared to the previous method because the proposed method only requires three sessions and the amount of the required data is significantly reduced through the tag bits (e.g., EI index, EI tag, EC index, EC tag). Furthermore, the required data of the proposed method are much less than those of the previous one as n increases. Hence, the total debug data volume for the communication between the CUDs and the workstation is much lower than in the previous method, and the debug time is correspondingly reduced compared to the previous work. The DRAM usage in the proposed method is always less than that of the previous method except in the debug case where the error rate is only percent. As Fig. 12. Debug time improvement ratio with different number of CUDs and error rate. discussed in Section 4, the maximum DRAM usage ratio is just two when error rate is very high and n is 32, and this is a reasonable increment considering the reduction in the debug time. Fig. 12 shows the debug time improvement ratio with different number of CUDs and error rate for the debug experiment in which N ¼ 2M cycles, M ¼ 512 with uniform error rate in the ARM based design. When n is 4, the improvement ratio decreases as p increases. This is because the proposed method basically requires three sessions. That is, the time benefit of the proposed method is more effective as the number of CUDs increases. Fig. 13 shows the experimental results of the debug time with different values of M for the debug cases where N is 2M cycles, a is 10, and the error rate is 0.01 percent in the ARM-based design. As discussed in Section 4, the DRAM usageanddebugtimearethesmallestwhenm is 256. However, the prerequisite of the DRAM-based debug method is that M should be larger than the memory access latencyinordertocommunicatewiththedram.inaddition, the write access latency should be within M/n to handletheworstcaseinthethreesessionsoftheproposed method. That is, the number of CUDs can be determined by the trade-off between the memory resources and the tracebuffersize.althoughtheproposedmethodhasthe limitation of n when compared to the previous method, the debug time is still significantly reduced, as described in Fig. 13. Fig. 13. The experimental results of debug time with different trace buffer depth.

12 OH ET AL.: DRAM-BASED ERROR DETECTION METHOD TO REDUCE THE POST-SILICON DEBUG TIME FOR MULTIPLE IDENTICAL n TABLE 4 Hardware Area Overhead Comparison Hardware area (2 NAND equivalents) CUD DfD modules spc(s) Seq [18] Multi [18] Prop (0.30%) (1.34%) (1.49%) (0.16%) (1.47%) (1.53%) (0.08%) (1.47%) (1.64%) (0.05%) (1.54%) (1.75%) Fig. 14. The results with different debug method and error rate. (a) Dram usage. (b) Debug time. Fig. 14 illustrates the results of the DRAM usage and debug time with different error rate for the debug cases where N is 2M cycles, M is 512, n is 8, and a is 10. In this figure, the notation Multi indicates the method where multiple cores are debugged at the same time in the previous method [18]. In order to debug n cores at the same time in the previous method, 2n trace buffers are required. Furthermore, the required bandwidth for communicating with the DRAM increases. Nonetheless, it is assumed that these limitations are acceptable in this simulation. As described in Fig. 14a, the DRAM usage increment of Multi is larger than both Seq and the proposed method because the error-suspect data of n cores during the intervals are stored in the DRAM at the same time. For example, if all intervals of all cores are erroneous, the stored data volume is nlm. On the other hand, the DRAM usage of the proposed method is less than Multi because the error data are only stored in the DRAM with three sessions. With respect to the debug time, the required debug time of Multi is less than that of Seq and the proposed method when the error rate is very low. This is because the on-chip sampling time is more important than the communication time when the error rate is very low. However, the communication time dominates as the error rate increases. As a result, the proposed method can reduce the debug time much more than Multi as the error rate increases. This is described in Fig. 14b. 5.2 Hardware Area Overhead Table 4 compares the hardware aspect of debug modules of Seq, Multi and the proposed method with different n in terms of two input NAND (NAND2). They are designed in RTL code and synthesized using the TSMC 130 nm standard cell library [30]. The results indicate only to the logic area and do not account for the on-chip buffers. The spc is the SPARC processor core module and it is used to analyze the hardware overhead in a real multicore system. In the case of Seq, an MISR set, an adapter, a golden signature register, counters and the control logic are required to perform the debug experiment. In addition, the increment of the hardware overhead is almost the same as n increases because the cores are debugged in sequence. On the other hand, Multi and the proposed one require much more hardware overhead than Seq does because multiple cores ared debugged at the same time. The proposed debug module consists of n MISR sets, a comparator, an adapter, a golden signature register, the EI tag, EI index, GDS tag and the control logic with the FSM. Because of these additional debug modules and more complicated control logic, the area overhead is larger than Seq. However, it is slightly larger than Multi because Multi requires additional hardware modules than Seq does. Furthermore, the hardware overhead of the proposed method is about 1.75 percent when there are 32 cores. This result indicates that the hardware area overhead of the proposed method is negligible compared to that of a multicore processor system. Fig. 15 shows the expected results of the number of required on-chip buffers with the different number of cores Fig. 15. Expected results of the number of required on-chip buffers with different number of cores.

13 1516 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 9, SEPTEMBER 2017 by setting the trace buffer size (LM) to 1.Seq requires only 2 trace buffers. However, if multiple cores are debugged in the previous method (Multi), the number of required trace buffers increases as 2n. In the case of the proposed method, three kinds of on-chip buffers are required, which are 2 trace buffers, 2 EC tag buffers, 2 EC index buffers. As discussed in Section 3, the size of EC tag buffer is nm and the size of EC index is M. That is, the on-chip buffer size of the proposed method is 2(Lþnþ1)M, which is a reasonable overhead compared to Multi. 6 CONCLUSION In this paper, a new DRAM-based post-silicon debug method for multiple identical cores is proposed to reduce the total debug time significantly for various debug cases. Unlike the previous methods which require time or on-chip buffer overhead during debugging multicore system, the proposed method detects error clock cycles of all cores during only three sessions, which consists of error interval detection, error clock cycle detection and error data capture process. This method accelerates the identification process of the errors when the number of CUDs increases. The hardware area and DRAM data overhead of the proposed method are negligible compared to the increment of the previous methods when multiple cores are debugged at the same time. In addition, the proposed method is compatible with other debug techniques, e.g., debugging communication logics. As a result, the proposed method is suitable to be adapted to the practical debug cases of the SoCs or 3D-ICs, which include large on-chip memories, such as edram or 3D DRAM. ACKNOWLEDGMENTS This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. 2015R1A2A1A ). Sungho Kang is the corresponding author. REFERENCES [1] M. Abramovici, P. Bradley, J. Dwarakanath, P. Levin, G. Memmi, and D. Miller, A reconfigurable design-for-debug infrastricture for SoCs, in Proc. ACM/IEEE Des. Autom. Conf., 2006, pp [2] X. Liu, X. Liu, and Q. Xu, On signal tracing for debugging speedpath-related electrical errors in post-silicon validation, in Proc. IEEE Asian Test Symp., Dec. 2010, pp [3] M. H. Neishaburi and Z. Zilic, On a new mechanism of trigger generation for post-silicon debugging, IEEE Trans. Comput., vol. 63, no. 9, pp , Sep [4] SB. Park, T. Hong, and S. Mitra, Post-silicon bug localization in processors using instruction footprint recording and analysis (IFRA), IEEE Trans. Comput.-Aided Des., vol. 28, no. 10, pp , Oct [5] K. Chang, I. L. Markov, and V. Bertacco, Automating post-silicon debugging and repair, in Proc. Int. Conf. Comput.-Aided Des., Nov. 2007, pp [6] A. Nahir, et al., Bridging pre-silicon verification and post-silicon validation, in Proc. ACM/IEEE Des. Autom. Conf., 2010, pp [7] X. Gu, W. Wang, K. Ki, H. Kim, and S. Chung, Re-using DFT logic for functional and silicon debugging test, in Proc. IEEE Int. Test Conf., Oct. 2002, pp [8] B. Vermeulen, T. Waayers, and S. K Goel, Core-based scan architecture for silicon debug, in Proc. IEEE Int. Test Conf., Oct. 2002, pp [9] R. Datta, A. Sebastine, and J. A. Abraham, Delay fault testing and silicon debug using scan chains, in Proc. IEEE Eur. Test Symp., May. 2004, pp [10] K. J. Lee, S. Y. Liang, and A. Su, A low-cost SOC debug platform based on on-chip test architectures, in Proc. SOC Conf., Sep. 2009, pp [11] D. Josephson, The manic depression of microprocessor debug, in Proc. IEEE Int. Test Conf., Oct. 2002, pp [12] H. F. Ko, A. B. Kinsman, and N. Nicolici, Design-for-debug architecture for distributed embedded logic analysis, IEEE Trans. VLSI Syst., vol. 19, no. 8, pp , Aug [13] E. A. Daoud and N. Nicolici, On using lossy compression for repeatable experiments during silicon debug, IEEE Trans. Comput., vol. 60, no. 7, pp , Jul [14] H. Oh, T. Han, I. Choi, and S. Kang, An on-chip error detection method to reduce the post-silicon debug time, IEEE Trans. Comput., vol. 66, no. 1, pp , Jan [15] J.-S. Yang and N. Touba, Improved trace buffer observation via selective data capture using 2-D compaction for post-silicon debug, IEEE Trans. VLSI Syst., vol. 21, no. 2, pp , Feb [16] W. Jung, H. Oh, D. Kang, and S. Kang, A 2-D compaction method using macro block for post-silicon validation, in Proc. Int. SoC Des. Conf., pp , Nov [17] Feb [Online]. Available: issue1.pdf [18] S. Deutsch and K. Chakrabarty, Massive signal tracing using onchip DRAM for In-system silicon debug, in Proc. IEEE Int. Test Conf., 2014, pp [19] G. Giles, J. Wang, A Sehgal, K. J. Balakrishnan, and J. Wingfield, Test access mechanism for multiple identical cores, in Proc. IEEE Int. Test Conf., Oct. 2008, pp [20] M. Sharma, A Dutta, W.-T. Cheng, B. Benware, and M. Kassab, A novel test access mechanism for failure diagnosis of multiple isolated identical cores, in Proc. IEEE Int. Test Conf., Sep. 2011, pp [21] T. Han, I. Choi, and S. Kang, Majority-based test access mechanism for parallel testing of multiple identical cores, IEEE Trans. Very Large Scale Integr. Syst., vol. 23, no. 8, pp , Aug [22] D. Wendel, et al., The Power7TM processor SoC, in Proc. Int. Conf. IC Des. Technol., 2010, pp [23] H. Sun, et al., 3D DRAM design and application to 3D multicore systems, IEEE Des. Test Comput., vol. 26, no. 5, pp , Sep [24] H. F. Ko and N. Nicolici, Combining scan and trace buffers for enhancing real-time observability in post-silicon debugging, in Proc. IEEE Eur. Test Symp., Jul. 2010, pp [25] S. Sarangi, B. Greskamp, and J. Torrellas, CADRE: Cycle-accurate deterministic replay for hardware debugging, in Proc. IEEE Int. Conf. Dependable Syst. Netw., Jun. 2006, pp [26] I. Silas, I. Frumkin, E. Hazan, E. Mor, and G. Zobin, System-level validation of the Intel Pentium M processor, Intel Technol. J., vol. 7, no. 2, pp , May [27] B. Quinton and S. Wilton, Programmable logic core based postsilicon debug for SoCs, in Proc. 4th IEEE Silicon Debug Diagnosis Workshop, May [28] M. Fujita and H. Yoshida, Post-silicon patching for verification/ debugging with high-level models and programmable logic, in Proc. 17th Asia South Pacific Des. Autom. Conf., 2012, pp [29] Dec. 23, [Online]. Available: amber [30] Apr. 03, [Online]. Available: downloads/tsmc_library_request/sc_brochure_9.pdf Hyunggoy Oh received the BS degree in electrical and electronics engineering from Yonsei University, Seoul, Korea, in 2014, where he is currently working toward the MS and PhD degrees in electrical and electronics engineering. His current research interests include design for testability/ debug, and system-level test and validation.

same field. His current research interests include SoC design, design for testability, and systemlevel test and validation.

14 OH ET AL.: DRAM-BASED ERROR DETECTION METHOD TO REDUCE THE POST-SILICON DEBUG TIME FOR MULTIPLE IDENTICAL Inhyuk Choi received the BS degree in electrical and electronics engineering from Yonsei University, Seoul, Korea, in 2009, where he is currently working toward the MS and PhD degrees in the same field. His current research interests include SoC design, design for testability, and systemlevel test and validation. Sungho Kang received the BS degree in control and instrumentation engineering from Seoul National University, Seoul, Korea, and the MS and PhD degrees in electrical and computer engineering from University of Texas at Austin, Austin, Texas, in He was a research scientist with the Schlumberger Laboratory for Computer Science, Schlumberger Inc., Austin, Texas, and a senior staff engineer with Semiconductor Systems Design Technology, Motorola Inc., Austin, Texas. Since 1994, he has been a professor in the Department of Electrical and Electronic Engineering, Yonsei University, Seoul. His current research interests include very-large-scale integration/ system-on-chip/3d IC design and testing, design-for-testability, built-in self-test, defect diagnosis, and design-for-manufacturability. He is senior member of the IEEE. " For more information on this or any other computing topic, please visit our Digital Library at

An On-Chip Error Detection Method to Reduce the Post-Silicon Debug Time

38 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 An On-Chip Error Detection Method to Reduce the Post-Silicon Debug Time Hyunggoy Oh, Taewoo Han, Inhyuk Choi, and Sungho Kang, Member, IEEE