An On-Chip Error Detection Method to Reduce the Post-Silicon Debug Time

38 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 An On-Chip Error Detection Method to Reduce the Post-Silicon Debug Time Hyunggoy Oh, Taewoo Han, Inhyuk Choi, and Sungho Kang, Member, IEEE Abstract Debug time has become a major issue in post silicon debug because of the increasingly complicated nature of circuit design. However, reducing debug time is a major challenge because of the limited size of the trace buffer used to observe internal signals in the circuit. This study proposes an on-chip error detection method to overcome this challenge. The on-chip process detects the error-suspect window using the pre-calculated golden data stored in the trace buffer. This allows the selective compaction and capture of the debug data in the trace buffer during the error-containing interval. As a result, reducing the number of debug sessions significantly reduces the total debug time. The experimental results on various debug cases show significant reductions in total debug time compared to previous work. Index Terms Post-silicon debug, MISR compaction, trace buffer, debug time 1 INTRODUCTION Ç DEVELOPMENTS in semiconductor technology have allowed the integration of a larger number of components into a single chip, such as a system-on-chip (SOC). However, the demand for new features increasingly leads to errors in the circuit and can increase the time to market. Therefore, it is imperative to ensure that the circuit is error-free in order to satisfy the time-to-market requirements. Previous research has focused on pre-silicon verification such as formal verification and simulation to help designers to efficiently detect errors in a circuit [1], [2], [3]. Although these techniques are widely applied to hardware modeling during the implementation process, it is nearly impossible to eliminate errors such as logical and electrical errors in the first silicon as the technologies decreases in physical size. Logical errors are related to designer mistakes caused by the complexity of the design. On the other hand, electrical errors occur in certain electrical environments during normal operation [4] and are difficult to detect during pre-silicon verification [5]. It is important to eliminate these errors at the earliest after the arrival of the first silicon to avoid the increased cost caused by a silicon respin [6]. As a result, the post-silicon debug has emerged as an important part of the implementation flow [7], [8], [9]. The post-silicon debug generally comprises two different phases: non-deterministic and deterministic [10], [11]. In the nondeterministic phase, bug occurrences cannot be reproduced because of non-deterministic input sources such as asynchronous interfaces, interrupts from peripherals, or mixed signal circuitry. The main objective in this phase is to determine how to control the failure. When the failure is controllable, the debug environment can be cycle-accurate deterministic. In this deterministic phase, the main goal is to detect the root cause in terms of space (the erroneous logic) and time (the exact clock cycle when the bug occurs) information as quickly as possible using golden data calculated via simulation using the behavioral model of the circuit [10], [12], [13]. Real-time signal tracing methods were researched to support each debug phase [8], [10], [11], [12], [14]. These approaches H. Oh, I. Choi, and S. Kang are with the Department of Electrical and Electronics Engineering, Yonsei University, Seoul, Korea. E-mail: {kyob508, ihchoi}@soc.yonsei.ac.kr, shkang@yonsei.ac.kr. T. Han is with the Department of SOC Design Team, Samsung Electronics, Gyeonggi-do, Korea. E-mail: twhan@soc.yonsei.ac.kr. Manuscript received 31 Aug. 2015; revised 14 Apr. 2016; accepted 27 Apr. 2016. Date of publication 2 May 2016; date of current version 19 Dec. 2016. Recommended for acceptance by K.Chakrabarty. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee. org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TC.2016.2561920 include an embedded logic analyzer consisting of trigger units, a sample unit, and an offload unit. The trigger unit determines the start or end point for observing the circuit states, and the traced debug data are captured via the sample unit, which includes an on-chip trace buffer, such as embedded memory. Finally, the captured data are unloaded from the internal debug module to the external workstation through the offload unit, and the debug data are analyzed to detect the error via the debug software. The trace buffer-based technique allows acquisition of real-time data. However, the major problem in the trace buffer-based technique is the limited observability, because the size of the trace buffer results in design for debug (DfD) hardware overhead. This limitation results in a large requirement with respect to debug time. In addition, unlike the non-deterministic debug phase which requires certain debug cases, the functional tests for very long debug cycles are performed repetitively in the deterministic debug phase. Hence, a technique that acquires a considerably higher amount of debug data information given the limited capacity of the trace buffer is strongly required in the deterministic debug phase. An on-chip error detection method is proposed in this study to improve the capacity of the trace buffer. The main contributions of this paper are the following: The empty area of the trace buffer was re-used in this study to reduce the debug time with a negligible hardware overhead. As storing the pre-calculated golden data in the empty area, the debug data can be analyzed in real-time and the additional debug data can be stored in the trace buffer. To support the on-chip method efficiently, a new compaction technique is proposed. To exploit the fact that the on-chip analysis can detect the erroneous intervals in real-time, the proposed debug module can compress the erroneous intervals selectively. A new architectural feature using two multiple-input signature registers (MISRs) is proposed to perform the onchip method using the selective compaction. Furthermore, the debug scheduling algorithm and the post-debug analysis are described to perform the debug process properly. Section 2 provides a review of the related work. Section 3 discusses the iterative debug method with on-chip error detection, and Section 4 describes the experimental results for various debug cases. Finally, conclusions are presented in Section 5. 2 RELATED WORK A debug architecture using content-addressable memory (CAM) has been researched to compress the real-time debug data to improve the capacity of the trace buffer in the non deterministic debug phase [15]. As this compaction technique is based on different dictionary coding algorithms, the compaction ratio is insignificant and strongly dependent on how correlated the debug data are. Nevertheless, this technique can still be useful in achieving the debug data when the debug phase is non-deterministic. A compaction technique using a MISR has been researched in the deterministic debug phase [10]. In this technique, the whole target observation window is compacted in the trace buffer as signatures. After transferring the captured data to an external workstation, the captured signatures are analyzed to detect the time intervals in which the erroneous data are captured. This is carried out by comparing the captured signals to the golden signatures calculated by simulating the behavioral model of the circuit. In the following debug session, the set of error-suspect windows is compacted, and the error-suspect window is investigated until the specific error cycles are detected. The compacting of the long debug cycles to a signature and the detection of the error cycles in the errorsuspect window improve the debug time of this debug process 0018-9340 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 39 Fig. 1. The debug flow of the on-chip error detection method. compared to that of the conventional debug method. However, this method still requires many debug sessions to detect error cycles because the compaction ratio is fixed at the size of the trace buffer. In [12], a selective debug capture method that requires three debug sessions using 2-D compaction is proposed. In the first session, the estimated error rate is calculated with a parity generator. The error-suspect clock cycles are then determined via 2-D compaction in the second session. Finally, the erroneous debug data are captured with pre-calculated tag bits. With this three-pass methodology, the method significantly expands the observation window. However, this method has some limitations with regard to employment in the various debug experiments. First, there is a chance that more debug sessions will be required since the first step, which estimates the error rate, is strongly dependent on the error distribution case. Furthermore, this method can only be applied to short debug cycles because the ability of 2-D compaction to detect the error cycles is increasingly inaccurate as the size of the target observation window increases. Consequently, this method is only suitable for cases with a short debug time or as a supplementary technique to other debug methods for cases with long debug times. This paper proposes an on-chip error detection method to reduce the total debug time in a cycle-accurate deterministic debug environment by improving an MISR compaction technique [10]. It should be noted that the proposed idea is also applicable to the deterministic debug phase similar to that in a previous study [10]. The main concept of the proposed method includes the performance of the process to twice detect the error-suspect cycles, both on-chip and off-chip. This contrasts with the existing methods in which the error detection process is performed outside the chip after the end of the debug experiment. Furthermore, the debug data for the error-suspect window can be compacted selectively with two MISRs to more rapidly detect the erroneous cycles by applying the on-chip error detection. 3 PROPOSED ITERATIVE DEBUG METHOD WITH ON-CHIP ERROR DETECTION In [10], an iterative error detection method to detect the erroneous cycles of a long observation window using MISR compaction is introduced. This study further expands on this method by first explaining the concept of the debug framework using an on-chip error detection method. Then, a selective compaction technique that supplements the on-chip method to significantly reduce the debug time. In addition, the debug scheduling algorithm and postdebug analysis are described. Finally, the hardware architecture of the proposed debug module is introduced. Fig. 2. Examples of error detection methods using MISR compaction. (a) The previous method. (b) The proposed method. 3.1 The Debug Framework Using an On-Chip Error Detection Method An on-chip error detection is proposed to reduce the number of debug sessions in this paper. The debug flow of the on-chip error detection method is illustrated in Fig. 1. In contrast to the method in the previous work, the proposed method provides two error detection processes: on-chip and off-chip ((4) and (8) in Fig. 1) and more rapidly detects the error cycles. First, the golden signatures (GSs) are generated in the debug configuration step. Since the debug phase is deterministic and the debug data are predetermined, the GS of the debug data can be acquired. When the debug process starts with the configuration, such as the trigger event condition and debug data selection, GS are uploaded to the trace buffer by a serial interface (e.g., JTAG). After the functional operation starts with deterministic input data, the debug process will be started when the trigger events are performed. During the debug phase, the debug data during certain cycles are compacted via MISR and are compared to GS in order to detect an erroneous interval in real-time. If these signature value are the same, it is not necessary to observe the data during those cycles. However, if not, the data should be analyzed to detect the error cycles. To check the on-chip analysis result, only one bit is required per GS (where 1 indicates a failure, and 0 indicates that there is no failure). This bit is called a tag bit in this paper. The tag bits are captured in the tag bit register and the additional debug data can be captured to the trace buffer by overwriting the area of the trace buffer in which GSs are stored because they are no longer needed. At the final debug level, the erroneous debug data that are identified by the on-chip analysis are captured. Therefore, the additional debug data can be captured to the trace buffer using the space remaining after the erroneous data have been captured. After the debug session is completed, the captured data in the trace buffer and tag bit register are transferred to the workstation (off-chip) and analyzed to determine the error-suspect cycle in a process called post-debug analysis. A simple comparison between the previous and proposed methods is provided in Fig. 2. In this example, eight failure signatures are identified at the previous debug level, and it is assumed that four failure signatures can be compacted to the trace buffer, which means that the segmentation size of the trace buffer is four. If only post-debug analysis is performed, two debug sessions are required to capture the debug data. The debug data during the first

40 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 TABLE 1 Notation for Debug Experiments Name N M GS RS SS SE TP S SNG SPRS i SPSS i DS i T CUD Representation The length of the observation window Trace buffer depth Golden signature Reference signature Segmented signature Segmentation variable Trigger point The number of segments in the trace buffer Standard number to the golden signatures Samples per reference signature at debug level i Samples per segmented signature at debug level i Total number of debug sessions for the CUD Time for running debug sessions on the CUD four failure signatures (signatures for 14) are acquired in the trace buffer during debug session 1 and the data for the last four failure signatures (signatures for 58) are acquired during debug session 2 as described in Fig. 2 (a). Conversely, if the on-chip analysis is performed, the debug data for signatures for 14 can be analyzed via the on-chip method and the results can be encoded in the form of tag bits, as described in Fig. 2(b). Furthermore, after the on-chip analysis, the debug data for signatures for 58 can be acquired in the trace buffer without stopping the debug session. As a result, the debug information for signatures for 18 can be acquired during only 1 debug session. 3.2 A Selective Compaction Technique to Reduce the Error-Suspect Window As previously explained in Section 3.1, the on-chip analysis can detect the erroneous interval with GS in real-time. To exploit this fact, a new compaction technique which compresses the erroneous interval selectively with high ratio is proposed to reduce the debug sessions much more in this section. For better understanding, the notations are given in Table 1, which are similar to those used in [10]. First, GSs are stored in the trace buffer and the debug data are compacted via two MISRs. The first MISR, termed MISR 1, has the same compaction ratio as the GS, and the second MISR, termed MISR 2, has a higher compaction ratio in order to focus on the debug interval. The ratio is defined as the segmentation variable (SE). The signature compacted by MISR 1, i.e., the reference signature (RS), is analyzed to determine whether or not the interval is erroneous by comparing it with GS in real-time. During analyzing the GS, the SE number of signatures, i.e., segmented signatures (SS), are generated by the MISR 2 and are captured to the overwritable area in the trace buffer. As previously explained, the overwritable area is the area occupied by GS that have already been used to analyze RS and so are no longer needed. To prevent the situation that the first RS is failure, the SE number of GS are captured to the golden register whose size is as much as SE after the debug process starts. GS is captured to the golden register after the comparison with RS ends. If RS does not indicate a failure, the area of captured SS can be overwritable because they are also error-free. The next RS is then analyzed through comparison to the next GS. On the other hand, if RS indicates a failure, the area of captured SS should be stored and the overwritable area is reduced by as much as SE. After that, the next GS is analyzed in a similar manner. Hence, the results of the on-chip analysis are captured as the type of the tag bit and the erroneous intervals can be compacted with higher ratio, SE. After the on-chip analysis ends, the debug data can be obtained more until the trace buffer is full as explained in Section 3.1. Fig. 3 illustrates examples of the selective compaction technique for each debug case. To help understand, a simple debug case is used where samples per RS (SPRS) is 100 cycles, samples per SS (SPSS) is 50 cycles, that is SE is 2. If RS is error-free, the area of Fig. 3. Examples of the selective compaction technique in the debug case where (a) RS is error-free. (b) RS is error-suspect. (c) RS is error-suspect and overwritable area is insufficient. captured SSs can be overwritable because they are also error-free as described in Fig. 3a. And then the tag bit is generated as 0. On the other hand, if RS is error-suspect, the captured SS should be stored in the trace buffer and the tag bit is stored as 1 as described in Fig. 3b. In this process, the error-suspect interval can be selectively compacted more and the error-suspect window can be detected faster than the previous work. However, the overwritable area might be insufficient to capture all SS in some cases. In this case, only as many SSs as can fill the overwritable area are captured. This is described in Fig. 3c. If RSs for 101200 and 201300 cycles were failure, SS for 101150, 151200, 201250, and 251300 cycles were stored in the trace buffer. Consequently, SS for 301350 can be only captured. However, these cases do not mean the proposed method cannot perform properly but just affect the quality of the selective compaction because the results of the on-chip analysis are still stored as a tag bit. In this case, we can only check these cycle interval (301350 cycles) whether it is error-suspect or not and the remained interval (351400 cycles) should be error-suspect during the post-debug analysis. It should be noted that, for the post-silicon debug, in which pre-silicon verification and a manufacturing test were

IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 41 already passed, errors may occur at some corner cases and the error rate is very low. As a result, the overwritable area might be achievable in numerous cases. As describe above, these worst cases are strongly related to the error distribution and error rate. Therefore, the quality of the selective compaction is demonstrated with various debug cases in Section 4. 3.3 Debug Scheduling Algorithm and Post-Debug Analysis In order to perform the debug run, the debug scheduling performed before the start of the debug and the post-debug analysis performed after the end of the debug are introduced. The scheduling algorithm determines the total number of debug sessions (DS) and circuit running time ðt CUD Þ. It is described in Algorithm 1. At the first debug level, SPRS 0 ¼ N=M, and SPSS 0 ¼ SPRS 0 =SE. At this time, DS is set to 1, and T CUD is equal to N. The standard number for the golden signatures (SNG) is set to (M / S) /SE, which means that the size of the segmented trace buffer is divided by SE (line 2). AS SNG represents the number of SSs that can be captured to the trace buffer, this standard determines SPRS i and SPSS i of each debug level. This algorithm is repeated until the debug level does not reach the final level (line 3). The computation for each debug level can be explained as follows (lines 4 11): Algorithm 1. Scheduling of the Whole Debug Experiment Input: M, S, SE, SNG, SPRS 0 and SPSS 0 Output: DS and T CUD 1 SPRS 0 ¼ N/M; SPSS 0 ¼ SPRS 0 /SE; DS ¼ 1; T CUD ¼ N; i ¼ 0; 2 SNG ¼ (M/S)/SE; 3 while (Final debug level not reached) do 4 if (SPSS i > SNG) then 5 SPRS iþ1 ¼ SPSS i =SNG; 6 if (selective compaction possible) then 7 SPSS iþ1 ¼ SPRS i /SE; 8 else 9 SPSS iþ1 ¼ 1; 10 else 11 SPRS iþ1 ¼ 1, SPSS iþ1 ¼ 1; 12 Run debug experiment with SPRS i and SPSS i ; 13 Detect the error interval with on-chip and post debug analysis; 14 Update DS, T CUD during the current debug level 15 iþþ; 16 end 17 Run the final debug level; 18 Update DS, T CUD at the final debug level; 19 return DS, T CUD ; If SPSS i > than SNG, then SPRS iþ1 ¼ SPSS i /SNG. In addition, SPSS iþ1 ¼ SPRS i /SE when SPRS i /SE > 1, which means that selective compaction can be used to adapt the debug experiment. If not, SPSS iþ1 is 1 because the selective compaction cannot be further adapted. If SPSS i < SNG, the debug level moves to the final level with SPRS iþ1, and SPSS iþ1 ¼ 1. After the computation, the debug experiment is performed with the calculated SPRS i and SPSS i. After the experiment ends at the current debug level, DS and T CUD are updated, and the debug experiment for the next level is performed by increasing the debug level (line 15). If the debug level is final, the erroneous debug data is captured with the on-chip analysis, and the additional debug data is captured to detect the erroneous data during the post-debug analysis, as explained in Section 3.1. After the debug process is complete, DS and T CUD are calculated and returned at the end of the algorithm (line 19). Post-debug analysis is performed with the result data of the onchip analysis on a workstation. It is described in Fig. 4. Basically, the captured data in the trace buffer are divided into two sets (SSs which are identified via the on-chip method and the additional Fig. 4. The process of the post-debug analysis with the result data of the onchip analysis. debug data captured after the on-chip method). If the debug level is non-final, it is important to determine the required number of SSs per GS and the total number of SSs. As described in Section 3.2, the number of captured SSs during analyzing GS can vary with the size of the overwritable area and the overwritable area can be computed by the sequence of the tag bits. As a result, the SSs can be identified using the tag bit data, the GS set for the on-chip analysis, and the trace buffer data. To analyze the errorsuspect window, SSs are compared to GSs generated using the same compaction ratio, SPSS i, in the workstation. The error-suspect window is then determined based on the current trigger point and the set of failure signatures with SPSS i, and the TP for the next debug session can be acquired. After analyzing the SSs, the additional debug data set can be obtained in the same manner. The error-suspect window is also determined by comparing the additional data and the GSs, but their compaction ratio is SPRS i. Consequently, the next TP is computed by dividing the error-suspect window by SE. In this way, the error-suspect window of each debug session at the non-final debug level is determined. At the final debug level, the number of tag bits indicates the number of erroneous data. Consequently, the erroneous data set can be required by the trace buffer data, tag bit data, and the current level TP, which indicates the interval analyzed by the on-chip method. After identifying the erroneous data set using the on-chip method, the additional debug data are analyzed in the post-debug analysis. Because the amount of the additional debug data is the remaining data in the trace buffer, representing the difference in the trace buffer data and the number of tag bits, the erroneous data can be determined by comparing the pre-calculated golden data on the work station. In this way, the erroneous data of each session at the final debug level can be obtained. 3.4 The Hardware Architecture of the Proposed Debug Module The hardware architecture of the proposed on-chip method is illustrated in Fig. 5. The debug configuration module is controlled via a low-bandwidth interface such as JTAG during the configuration step. This configuration module controls the starting points of the

42 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 TABLE 2 Debug Time Reduction Ratio Comparison for Different Standard Deviations and Error Rates Fig. 5. Hardware architecture of the proposed debug module of the on-chip method. debug process, debug data, and the trace buffer. In addition, the GSs are stored in the trace buffer, the trigger points are established in the trigger point register, and the compaction ratios of the two MISRs are determined in this module. After the start of the debug process, the GSs are captured to the golden register up to the size of SE. Following the comparison between the GS and RS via the comparator, the result is stored to the tag bit register as a tag bit as discussed previously. The size of the tag bit register is the same as the trace buffer depth. To control the trace buffer during the onchip process, a finite state machine (FSM) is added to the proposed module. During the on-chip analysis, the FSM receives the result from the comparator and controls the write and read addresses to capture SSs to the trace buffer and GS to the golden register. In addition, the FSM computes the overwritable area in real-time using the write and read addresses to avoid overwriting GSs that have not yet been analyzed. After the on-chip process ends, the FSM computes the difference between the write and read addresses to ensure the remaining area in the trace buffer and to capture the additional debug data. The data in the tag bit register and the trace buffer are transferred to the external workstation via the low-bandwidth interface and analyzed to detect the errorsuspect window and the error data. 4 EXPERIMENTAL RESULTS This section discusses the experimental results in terms of debug time and the hardware area of the debug module to illustrate how the proposed method improves the previous work [10]. The experimental results are presented for an ARM-based processor design [16] and a MP3 audio decoder [10] to facilitate a comparison of the results. Each debug module designed as a Verilog RTL model is synthesized using a 130 nm ASIC standard cell library to estimate the area size. A 32-bit data bus is assumed as the debug data in the ARM-based design and the output of the decoder is used to collect the debug data of the MP3 decoder. Faults were randomly injected in circuits to produce misbehavior. The concepts of on-chip sampling and communication times [10] are used to compute the debug execution time. The on-chip sampling time is related to the clock cycles that elapse from the trigger point until the debug session ends, and the communication time is the time during which the debug data are offloaded through the JTAG interface. According to [10], the total debug time of sequential debug case and the previous method is calculated as Standard deviation (s) Error rate (%) Debug time reduction ratio (T seq /T prop(prev) ) Proposed [10] SE ¼ 1 SE¼ 2 SE¼ 4 ARM Based Design [16] 4 0.012 104.26 175.65 229.70 296.14 0.051 59.61 104.31 132.96 150.72 0.100 48.01 84.39 102.45 112.80 0.490 36.47 60.05 66.06 70.60 1.243 32.53 44.19 46.92 49.13 8 0.012 98.08 164.75 216.95 266.91 0.051 50.34 89.39 120.65 133.58 0.100 36.48 65.47 83.14 89.10 0.490 22.79 39.87 43.96 46.24 1.243 20.01 32.31 34.38 35.94 16 0.012 89.92 154.23 197.25 245.43 0.051 44.14 79.10 103.19 132.02 0.100 29.66 53.93 73.74 80.54 0.490 14.34 25.98 29.84 31.19 1.243 11.83 20.64 22.29 23.11 32 0.012 79.68 137.99 178.87 219.28 0.051 38.94 70.35 89.03 109.28 0.100 25.44 46.62 61.24 77.62 0.490 9.79 18.02 22.21 25.89 1.243 7.13 12.92 14.51 15.07 MP3 Decoder [10] 4 0.051 69.88 114.46 143.83 161.99 0.16 58.88 94.56 112.61 124.07 1.59 42.78 54.46 56.19 59.28 8 0.051 61.61 100.66 130.18 143.85 0.16 46.75 74.74 93.41 99.25 1.59 31.28 42.58 45.65 46.21 16 0.051 55.41 88.37 114.06 142.29 0.16 39.93 64.72 85.21 90.81 1.59 21.98 31.91 31.56 33.38 32 0.051 50.21 80.62 100.33 119.55 0.16 34.71 56.89 71.51 87.77 1.59 17.44 23.19 24.78 25.34 follows. The sequential debug case means no compaction technique is applied during the debug experiment. Note that L is the width of the trace buffer. T seq ¼ð1þN=MÞN=2 1=f CUD þ N L 1=f JTAG (1) T prev ¼ DS CUD M L 1=f JTAG þ T CUD 1=f CUD : (2) The total debug time of the proposed method can be calculated in the same manner. Unlike previous studies [10], the communication time of the proposed method is DS prop M ð2l þ 1Þ because of the additional time required for storing the golden data in the trace buffer and for offloading the tag bit data where the size is M in the tag bit register. Therefore, T prop is calculated as T prop ¼ DS CUD M ð2l þ 1Þ1=f JTAG þ T CUD 1=f CUD : (3) Although the communication time per debug session is slightly increased as M ð2l þ 1Þ, the total communication time of the proposed method is reduced because the on-chip error detection method has fewer debug sessions ðds CUD Þ compared to [10]. Furthermore, the on-chip sampling time can also be reduced because T CUD is strongly dependent on the number of DS CUD.Therefore,thetotal debug time of the proposed method can be reduced significantly compared to that in [10]. To understand the experimental results between [10] and the proposed one clearly, the concept of the debug time reduction ratio ðt seq =T propðprevþ Þ in [10] is also used in this paper. Table 2 shows the debug time reduction ratio of [10] and the proposed method for the experiment in which N ¼ 2 M cycles, M ¼ 512 with different error rates and distribution. The error distribution is a Gaussian distribution with a mean point (m) of N/2 and it is shown in the first column. The error rates are computed as the

IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 43 TABLE 3 Hardware Area Overhead Comparison with Different Trace Buffer Depth Size Trace buffer depth size Hardware area (2 NAND equivalents) Proposed (M) [10] SE ¼ 1 SE¼ 2 SE¼ 4 64 2489 3100 3521 128 2812 3518 3928 256 2037 3143 3790 4588 512 3450 4254 4980 Fig. 6. Debug time reduction ratio estimation with different error rate. number of cycles with errors divided by N and they are shown in the second column. Increasing s results in a wider, more random error distribution, and an increasing error rate means that the number of error cycles is increased. As shown in Table 2, when s and error rate decrease, the debug time reduction ratio increases significantly. This is because these iterative debug methods detect the errors more quickly when the type of error is a burst error or when there are few errors. The proposed method has a higher debug time reduction ratio than [10] as SE increases. With the selective compaction technique, the error-suspect intervals can be compacted with higher compaction ratio and the number of total DS CUD is reduced more than [10]. The case in which SE ¼ 1 refers to the debug case that uses on-chip error detection without the selective compaction technique described in Section 3.1. In addition, the debug time reduction ratio of a higher SE increases as the error rate and error distribution goes down because the selective compaction technique is influenced by the error distribution when overwriting the golden data in the trace buffer, as described in Section 3.2. If the error data is sparsely distributed, the selective compaction better reduces the total debug time with overwriting of the golden data. Therefore, the proposed method using the selective compaction technique is more efficient than [10] when the debug case has a lower error rate. Fig. 6 shows the debug time reduction ratio estimation with different error rate for the experiment in which N ¼ 2M cycles, M ¼ 512 with uniform error rate in the ARM based design. As graph shows, the debug time reduction ratio increases as the error rate decreases because the selective compaction technique can reduce the error-suspect window more. As the error rate increases, the debug time reduction ratio of the proposed method and the efficiency of the selective compaction technique are reduced. However, the proposed method still improves the ratio more than [10] with on-chip method. Further more, because the error rate is typically low in the post silicon debug, as discussed in Section 3.2, the proposed method can be adapted to the practical debug case. Fig. 7 shows the debug time reduction ratio estimation with different M for the debug cases in which the error rate is 0.16 percent Fig. 7. Debug time reduction ratio estimation with different trace buffer depth. and the error distribution is uniform in the ARM based design. The debug time reduction ratio increases as M decreases. That is, the quality of the selective compaction is higher as M is low, which means that the proposed method can be more efficient when the amount of debug data is low with the limited capacity of the trace buffer. Since the size of the trace buffer is determined during the design process, smaller trace buffer can be used to satisfy the proper debug time overhead in the proposed method. Table 3 compares the hardware aspect of the debug modules of [10] and the method proposed in this study in terms of two input NAND (NAND2) gates with different trace buffer sizes. They are designed in RTL code and synthesized, and the results from this table refer only to the logic area and do not account for the trace buffer. In the case of the proposed method with SE ¼ 1, the comparator, the tag bit register, and the FSM for analyzing the debug data with the golden data in the trace buffer are included. On the other hand, the proposed method with SE ¼ 2 or 4 requires an additional MISR and a golden register to adapt the selective compaction technique. Because of these additional debug modules, including the MISR, comparator, and tag bit register, the area required by the proposed method is slightly larger than that in [10]. However, the trace buffer results in DfD hardware overhead in a large portion and the hardware area overhead of the debug module when using SE ¼ 4 and M ¼ 512 is still less than 5 percent of the size of an embedded trace buffer of 4 Kbytes implemented in the same technology [10]. That is, the debug module of the proposed method is still a negligible hardware overhead. If there is no limitation of the trace buffer size, the ratio between time and hardware overhead is ideally the same. However, the onchip method re-uses the empty area of the trace buffer in the limited size of trace buffer. As a result, the increment of the debug time reduction ratio is saturated as SE increases although the hardware overhead increases. In addition, the debug time reduction value is still meaningful although the increment of the debug time reduction ratio seems smaller as the SE increases because the total debug time can be reduced drastically as the number of the designs or the debug cases under the post-silicon debug increases. 5 CONCLUSION In this paper, an on-chip error detection method is proposed to reduce the total debug time for various debug cases. The proposed method performs the on-chip error detection process by re-using the empty space of the trace buffer. In addition, the selective compaction technique enables the on-chip detection method to reduce the number of debug sessions with the detection of the error interval information in real-time. As a result, the proposed method can significantly reduce the debug time with a negligible hardware overhead when compared to that in a previous study. ACKNOWLEDGMENTS This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. 2015R1A2A1A13001751). Sungho Kang is the corresponding author.

44 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 REFERENCES [1] D. Van Campenhout, H. Al-Assad, J. P. Hayes, T. Mudge, and R. B. Brown, High-level design verification of microprocessors via error modeling, ACM Trans. Des. Autom. Electron. Syst., vol. 3, no. 4, pp. 581 599, 1998. [2] K. Randecka and Z. Zilic, Design verification by test vectors and arithmetic transform universal test set, IEEE Trans. Comput., vol. 53, no. 5, pp. 628 640, May 2004. [3] W. K. Lam, Hardware Design Verification: Simulation and Formal Method-Based Approaches. Englewood Cliffs, NJ, USA: Prentice Hall, 2005. [4] X. Liu and Q. Xu, On signal tracing for debugging speedpath-related electrical errors in post-silicon validation, in Proc. IEEE Asian Test Symp., Dec. 2010, pp. 243 248. [5] SB. Park, T. Hong, and S. Mitra, Post-silicon bug localization in processors using instrcution footprint recording and analysis (IFRA), IEEE Trans. Comput.-Aided Des., vol. 28, no. 10, pp. 1545 1558, Oct. 2009. [6] B. Vermeulen and S. K Goel, Design for debug: Catching design errors in digital chips, IEEE Des. Test Comput., vol. 19, no. 3, pp. 35 43, May 2002. [7] Y. C. Hsu, F. Tsa, W. Jong, and Y. T. Chang, Visibility enhancement for silicon debug, in Proc. ACM/IEEE Des. Autom. Conf., 2006, pp. 13 18. [8] M. Abramovici, P. Bradley, J. Dwarakanath, P. Levin, G. Memmi, and D. Miller, A reconfigurable design-for-debug infrastructure for SoCs, in Proc. ACM/IEEE Des. Autom. Conf., 2006, pp. 7 12. [9] S. Mitra, S. A. Seshia, and N. Nicolici, Post-silicon validation opportunities, challenges and recent advanced, in Proc. ACM/IEEE Des. Autom. Conf., 2010, pp. 12 17. [10] E. A. Daoud and N. Nicolici, On using lossy compression for repeatable experiments during silicon debug, IEEE Trans. Comput., vol. 60, no. 7, pp. 937 950, Jul. 2011. [11] H. F. Ko and N. Nicolici, Combining scan and trace buffers for enhancing real-time observability in post-silicon debugging, in Proc. 15th IEEE Eur. Test Symp., 2010, pp. 62 67. [12] J.-S. Yang and N. Touba, Improved trace buffer observation via selective data capture using 2-D compaction for post-silicon debug, IEEE Trans. VLSI Syst., vol. 21, no. 2, pp. 320 328, Feb. 2013. [13] S. Sarangi, B. Greskamp, and J. Torrellas, CADRE: Cycle-accurate deterministic replay for hardware debugging, in Proc. IEEE Int. Conf. Dependable Syst. Netw., Jun. 2006, pp. 301 312. [14] H. F. Ko, A. B Kinsman, and N. Nicolici, Design-for-debug architecture for distributed embedded logic analysis, IEEE Trans. VLSI Syst., vol. 19, no. 8, pp. 1380 1393, Aug. 2011. [15] E. A. Daoud and N. Nicolici, Real-time lossless compression for silicon debug, IEEE Trans. VLSI Syst., vol. 28, no. 9, pp.1387 1400, Sep. 2009. [16] (2010, Dec.). [Online]. Available: http://opencores.org/project,amber