Test Resource Reused Debug Scheme to Reduce the Post-Silicon Debug Cost

IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 12, DECEMBER 2018 1835 Test Resource Reused Debug Scheme to Reduce the Post-Silicon Debug Cost Inhyuk Choi, Hyunggoy Oh, Young-Woo Lee, and Sungho Kang, Senior Member, IEEE Abstract In this paper, a design for debug (DFD) method that reuses test resources is proposed to reduce the debug cost in post-silicon validation. With the proposed method, the trace buffer is shared for embedded cores to capture the signatures of each core concurrently by reusing a test access mechanism. In this case, the depth of the trace buffer allocated to the core is reconfigurable and variable according to debug scheme. The experimental results indicate that the proposed DFD significantly reduces the debug time when the trace buffer is shared by cores in various debug cases. Index Terms Post-silicon validation, test access mechanism, embedded cores, shared trace buffer, debug time 1 INTRODUCTION Ç AS semiconductor manufacturing technologies have become significantly advanced, an increased number of functional features have been integrated onto a single chip, such as a system on chip (SoC). After design and pre silicon verification, manufacturing tests, such as scan tests, are necessary to detect faults caused by technical problems in the manufacturing process. In addition, boundary scans (IEEE 1149.1) or test access mechanisms (TAMs) can also be applied for efficient test access of embedded cores in SoC [1]. However, functional or electrical errors that are not detected during the manufacturing tests can still present in the first silicon. These errors result in incorrect functional operation, which is distinct from the engineer s desired operational results. If a modification of the design is required, the manufacturing costs increase and the time-to-market requirements cannot be met [2]. Therefore, the engineer may attempt to detect the errors before silicon respinning. However, the consequent efforts to detect such errors incur additional costs. For this reason, the reduction in post-silicon debug costs is important. To observe consecutive debug cycles, a trace buffer is used to store consecutive debug data in general. However, the number of observed debug cycles is constrained by the size of the trace buffer [3]. Therefore, recent debug techniques have improved observability by compacting the trace debug data using a multiple-input signature register (MISR) for deterministic and repeatable errors [4], [5], [6], [7], [8]. Nevertheless, the debug costs in aspect of trace buffer size increases because modern SoC consists of many embedded cores to debug. For this reason, a test resource reused debug scheme is proposed in this paper to reduce the debug costs of embedded cores in consideration of the operational relation by sharing the trace buffer. The contributions of the proposed debug scheme are as follows: To reduce the debug cost, the proposed debug scheme introduces a sharing methodology of the trace buffer. In The authors are with the Department of Electrical and Electronics Engineering, Yonsei University, Seoul 03722, Korea. E-mail: {ihchoi, kyob508, roberto}@soc.yonsei.ac.kr, shkang@yonsei.ac.kr. Manuscript received 15 Oct. 2017; revised 7 May 2018; accepted 7 May 2018. Date of publication 15 May 2018; date of current version 7 Nov. 2018. (Corresponding author: Sungho Kang.) Recommended for acceptance by P. Girard. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee. org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TC.2018.2835462 addition, the unavoidable increment in debug time according to the reduction in the trace buffer is reduced. The proposed debug scheme reuses the pre-implemented TAM in the manufacturing test step to share the trace buffer. The TAM transfers the debug data without additional routing efforts. 2 PROPOSED DEBUG SCHEME The following sections describe the detailed operation and hardware architecture of the proposed debug scheme starting with the motivation of this paper. 2.1 Motivation If the methods using MISR signature generation and the trace buffer, such as methods in [7], [8], are applied to debug for more than one core, the trace buffer size should be increased to obtain the desired debug quality. However, the trace buffers dedicated to every core cause additional hardware overhead, as illustrated in Fig. 1. For this reason, the trace buffer can be split to reduce its overall size. In this case, less depth is allocated to each trace buffer, and fewer MISR signatures can be stored in the trace buffer; meanwhile, the compaction ratio of MISR is increased to observe the overall trace cycles. Therefore, the debug time increases because more debug sessions are required. Similarly, some cores will share the trace buffer to reduce the hardware overhead caused by the increased size of the trace buffer. However, the routing efforts are also required to share the trace buffer connected from different locations of cores in the SoC. To overcome the aforementioned constraints, a specified interconnection should be required to share the trace buffer. In this paper, the TAM used in the manufacturing test is reused to debug effectively. The TAM is unnecessary after the test step of SoC. Therefore, the TAM can be used to debug, thereby reducing the routing efforts because the cores are already connected to the TAM before the debug step. By using the TAM, the trace buffer can be shared by the cores. Nevertheless, the debug time increases unavoidably according to the reduction in the trace buffer size by sharing. For this reason, this paper introduces a debug scheme to share the trace buffer effectively. The proposed debug scheme includes the debug scheduling for the use of a shared trace buffer and the debug architecture to share the trace buffer by using the TAM. Therefore, the unavoidable increment in debug time can be reduced while decreasing the size of the trace buffer. 2.2 Debug Framework Using Shared Trace Buffer Fig. 2 shows an example for the debug scheduling of two cores. The conventional repeatable debug method is referred from [7] and the notation are provided in Table 1. In this example, more errors exist in CUD 2 than in CUD 1. In this case, the two cores are debugged concurrently. Assuming that the observation window (O) is 32 and the overall size of trace buffer depth (M) is reduced from 2M to M ¼ 8 by applications of the split buffer and proposed shared buffer, two debug cases can be explained. First, the split buffer is applied to debug as shown in Fig. 2a, and M is split into M 1;j ¼ 4 and M 2;j ¼ 4, respectively. To compact the trace cycle, SPS 1;1 and SPS 2;1 are 8 in DS 1. In each DS j, the debug operations of CUD 1 and CUD 2 are performed simultaneously until the M 1;j and M 2;j are filled with the number of MISR signatures. After the first debug session, the MISR signatures are unloaded to external debugger via a low-frequency protocol, such as the test access port (TAP) of IEEE 1149.1 and analyzed to detect the error-suspect signatures by comparing with golden signatures. By doing so, two error-suspect signatures in CUD 1 and four error-suspect signatures in CUD 2 are 0018-9340 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1836 IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 12, DECEMBER 2018 TABLE 1 Notation for Proposed Debug Scheme Name CUD i DS j EC i;j L i;j M i;j SPS i;l TDS j Representation Core i under debug Debug session j Error-suspect cycle of CUD i in DS j Debug level of CUD i in DS j Allocated buffer depth for CUD i in DS j Sample per MISR signature of CUD i in L i;j ¼ l Cycles for running DS j Fig. 1. Split and shared allocations of trace buffer in which the depth is divided into M 1 and M 2 to reduce the overall buffer size. identified after DS 1. If the MISR signatures are generated completely within the range of error-suspect cycles, the L i;j ¼ l increases and the SPS i;l is reconfigurated by SPS i;lþ1 /(M i;j of previous debug level L i;j ¼ l 1). By doing so, SPS 1 and SPS 2 become 2. In DS 2, the signatures are generated within the error-suspect cycles until M 1;2 and M 2;2 are repeatedly filled with MISR signatures. In this case, M i;j is fixed until the termination of debug process because the split buffer is not shared by CUD 1 and CUD 2.In addition, TDS j is determined by the maximum running cycle between CUD 1 and CUD 2. In the case of DS 2, TDS 2 is 16 because the runtime of CUD 1 is longer than that of CUD 2, and this is different compared with the single-core debug process. The emphasis of concurrent debug is that L i;j ¼ l of each CUD i is increased differently for the length of the error-suspect cycle in previous DS j.in general, the length of EC i;j is expected by the analysis of the number of error-suspect signatures of CUD i in L i;j ¼ l 1. For example, the greater length of EC i;j of CUD 2 than that of CUD 1 is identified through the debug process. For this reason, the proposed debug scheme uses the variable M i;j according to the length of EC i;j in L i;j ¼ l 1. In the proposed debug scheme, M i;j is varied in each DS j. The variation in M i;j is possible because the proposed debug scheme uses the shared buffer. In the proposed debug scheme, more M i;j in DS j are allocated to CUD i, with more error-suspect signatures identified in L i;j ¼ l 1. The overall debug process of the example is illustrated in Fig. 2b. In DS 1, the initial SPS 1;l and SPS 2;l are the same in the case of a split buffer used for debug in Fig. 2a. However, M i;j is varied in DS 2, and M 1;2 and M 2;2 are 2 and 6, respectively, because the number of signatures of CUD 2 to store in the shared buffer is greater than that of CUD 1. In this case, TDS 2 is 12 because the runtime of CUD 1 and CUD 2 are equal. Otherwise, M 1;3 and M 2;3 are same because the error-suspect cycles of CUD 1 and CUD 2 are equal at the trigger point (trace cycle ¼ 13) of DS 3 until the shared buffer is filled with the generated signatures. After DS 6, the debug operation of CUD 1 is terminated, then M is fully occupied by CUD 2. The overall debug process can reduce the number of DS j and length of TDS j simultaneously. To understand and apply the proposed debug scheme, the algorithm of debug process scheduling can also be given as follows: Algorithm 1. Overall Debug Process Scheduling Input: O, M, M i;j, SPS i ; l Output: nðds j Þ, P TDS j 1 Set O as EC i;1 ; SPS i ; 1 ¼ O=M i;1,ds j¼1, l ¼ 1, j ¼ 1; 2 while (ORðL i;j 6¼ EndÞ)do//L i;j 3 while (M 6¼ full)do//ds j 4 Scan the EC i;j starts with the min(trigger point of CUD i ); 5 Compact the trace cycle of CUD i with SPS i;j ; 6 Store the signatures into the shared buffer; 7 end 8 Unload and analyze the signatures; 9 Update DS j, nðds j Þ, P TDS j, EC i;jþ1,m i;j ; 10 if (DS j does not remain in L i;j ¼ l) 11 if (SPS i;l >M i;1 ) SPS i;lþ1 ¼ SPS i;l =M i;1 ; 12 else SPS i;lþ1 ¼ 1; 13 end 14 l þþ; 15 end 16 j þþ; 17 end 18 return nðds j Þ, P TDS j The algorithm determines the number of debug sessions (nðds j Þ) and the sum of cycles for running DS j ð P TDS j Þ during the overall debug process. For the initial condition of the debug process, SPS i;1 is set to O=M. In the proposed debug scheme, DS j is performed while scanning EC i;j, starting with the minimum trigger point between the cores until the shared buffer is filled with M Fig. 2. Example of debug scheduling when two cores are operated in concurrent. (a) Split trace buffer is used for debug. (b) In case of shared buffer and variable depth Mi; j are applied in DSj.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 12, DECEMBER 2018 1837 Fig. 3. Debug architecture reusing TAM for embedded cores. signatures (lines 3 7). For this reason, the initial EC i;j is set to O (line 1). M i;j is determined according to the length of EC i;j scanned by CUD i.ifds j is completed, signatures are unloaded and analyzed by the external debug software via TAP. Subsequently, DS j, nðds j Þ, P TDS j, EC j, and M ij are updated to prepare the next DS jþ1. In this case, the trigger points are also updated with EC i;j. When all DS j in L i;j ¼ l have been performed, SPS iþ1 is reconfigured until the final debug level (lines 10 15). At the final debug level, SPS i is 1, and the raw trace cycles are captured into the shared buffer. This debug process is performed repeatedly until all CUD i denote the end of debugging (line 2). If one of the cores denotes the debug end, another core can use the overall shared buffer depth (M i;j ¼ M). 2.3 TAM Reusing for Trace Buffer Sharing The proposed debug scheme utilizes the TAM used in the manufacturing test to enable buffer sharing. By using the TAM for transferring the debug data (the signatures), less routing effort is required for debugging. In the proposed architecture, the shared buffer is located at the end of the TAM to collect the debug data from the connected CUD i. Each core is wrapped by a core wrapper, which comprises a core test wrapper and a debug module. The core wrapper of the proposed debug hardware is compatible with the IEEE standard [9]. The shared buffer is also wrapped by a buffer wrapper. This buffer wrapper is dedicated to the sharing of M by CUD i. In Fig. 3, the two abovementioned cores are allocated on TAM. If the functional output width of each core is assumed as trace buffer width (W), the width for wrapper parallel output (WPO) is W in the debug step. In this case, the required TAM width is 2W to transfer the debug data concurrently without conflict, and the width of each TAM channel for CUD i is W. If the TAM width used in the test step is k<2w, however, the additional TAM wires are required for debug operation. In general, the constraint of the TAM width is due to the number of test pins. However, the additional test pins are not required in the proposed debug scheme. Therefore, the additional TAM wires implemented for internal interconnection among the wrappers are only prepared in the test step. If the TAM wires are insufficient, the debug by use of the single TAM channel can be applied by pipelined transferring the debug data. To set the debug mode in the debug step, the specific debug instructions are transferred via the serial test port from test data input (TDI) of TAP to the wrapper instruction register (WIR). The debug instruction to configure all CUD i, such as SPS i;l, trigger point, debug mode, and M i;j, are shifted in via TDI. Similarly, the debug data stored in the shared buffer can be shifted out via test data output (TDO) of TAP. All wrappers are serially concatenated through the wrapper serial input (WSI) and wrapper serial output (WSO). These shift operations for debug configuration are conducted before DS j. The shift-out cycle of the stored debug data is sufficiently longer than the shift-out cycle of the debug instructions. Therefore, the debug instructions and the stored debug data can be transferred simultaneously via TDI and TDO, respectively. Fig. 4. Core wrapper design for debug support. Fig. 4 illustrates the proposed wrapper to support the debug process. When the core wrapper operates in debug mode, the mode signal debug mode determines the transfer debug data to TAM by controlling the multiplexers. The mode signal also sets the on-chip debug clock to TAM. In this case, the mode signal is generated by decoding the debug mode instruction in WIR. If the core wrapper is set to debug mode, the debug instructions are shifted in via WSI; the instructions are then stored in the debug configuration module. In consideration of the test step, the functional outputs from the primary output (PO) ports are stored in the wrapper boundary register (WBR), and shifted out via TAM. In the debug step, however, the functional outputs from PO are compacted by MISR in the debug module, and then MISR generates the signature with compaction ratio SPS i;l. The event trigger register stores the trigger points of EC i;j. The overall debug operations are controlled by a finite state machine (FSM). The FSM receives SPS i;l from the debug configuration module and triggers the output of MISR at the given debug cycle. If the width of WPO in the test step is less than W (solid line), additional wires (dotted line) are also required, as in the case of TAM as mentioned in Fig. 3. When the core wrapper is not used for debugging, the WPO is set to bypass mode. Bypass mode is required not only in the test step but also in the debug step in order to prevent interruptions from other core wrappers that share the same TAM wires while the debug data are transferred. In addition, the interconnection fabric logic can be added optionally if the width of the PO is larger than W. To receive the transferred debug data of CUD i via TAM, a dedicated wrapper for the shared buffer is also designed in the proposed debug scheme, as shown in Fig. 5. Because the buffer wrapper is designed in the design step, the buffer wrapper should be turned off in the manufacturing test step. For this reason, to debug CUD i, the debug mode is set from the WIR; otherwise, the test data are passed via the bypass register in the test mode. As with the core wrapper, the debug instructions are transferred via a WSI to set the buffer wrapper to debug mode. The debug instructions consist of a capture point and M i;j. The debug data transferred from CUD i do not reach the buffer wrapper in one clock Fig. 5. Buffer wrapper design for the shared trace buffer.

1838 IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 12, DECEMBER 2018 TABLE 2 Debug Time Increment Ratio (Buffer Depth Reduction: 2M! M) Debug time increment ratio M Error rate(%) Split buffer Shared buffer (proposed) 64 0.0057 2.2506 1.7076 0.0114 1.7757 1.4653 0.0171 1.4442 1.2731 0.0229 1.2393 1.1418 0.0286 1.1171 1.0588 128 0.0057 0.3117 0.0788 0.0114 0.4531 0.2254 0.0171 0.4037 0.2209 0.0229 0.3378 0.1997 0.0286 0.2154 0.1175 256 0.0057 0.0349 0.2512 0.0114 0.0232 0.1312 0.0171 0.2585 0.0673 0.0229 0.1916 0.0082 0.0286 0.1350 0.0023 cycle when the TAM is used for debug. In general, several temporary registers are implemented in TAM to prevent the data-transfer delay from TAM IN to TAM OUT. Therefore, the required length of cycles to transfer the debug data between the core wrapper and the buffer wrapper is stored in the capture trigger register. In addition, M i;j is stored in the depth register to determine the allocated depth of CUD i. The address generator generates the address of the shared buffer within the range of M i;j. The overall operation of the buffer wrapper is also controlled by the FSM. 3 EXPERIMENTAL RESULTS This section discusses the experimental results in aspects of debug time and hardware. For the experimental results, the overall trace cycle number (N) issetto2 20. The probability of error occurrence is distributed as a Gaussian distribution (m ¼ N=2, s ¼ 2 8 ). In addition, the error rates are computed as the number of entire errors of CUD i divided by N. This experiment assumes that the error rate is small in electrical environments but it is increasable in the first-silicon. The trace cycles are sampled within the range of observation window O ¼ 2 10 (sampling cycles: N=2 O=2 þ 1 to N=2 O=2). The experimental results are presented for two cores: an ethernet controller [10] and AC97 controller [11], both with data output width of 32 bits. To implement the proposed debug architecture, the debug module and buffer wrapper designed as a Verilog RTL model are synthesized using a 130 nm application-specific integrated circuit (ASIC) standard cell library. TABLE 3 Debug Time Comparison in Variations of Proportion of the Number of Errors and Error Rate (M ¼ 128) Buffer configuration Split buffer Shared buffer (proposed) Error rate (%) Debug time (M cycles) b ¼ 2 b ¼ 4 0.0057 2929.50 3257.52 0.0114 4241.09 4576.64 0.0171 4936.63 5192.02 0.0229 5310.70 5437.29 0.0286 5498.26 5599.87 0.0057 2409.35 2308.48 0.0114 3576.56 3341.67 0.0171 4293.87 3951.54 0.0229 4762.21 4326.51 0.0286 5055.59 4624.08 TABLE 4 Hardware Area Overhead Comparison M Hardware area (2 NAND equivalents) Core test wrapper Proposed debug scheme 64 14,648 5,086 128 14,648 5,251 256 14,648 5,431 According to [7], the concepts of communication time to transfer the debug data via TAP and on-chip sampling time for running debug session are used to calculate the debug execution time. The overall debug time introduced from [7] is applied for experimental results as: T M ¼ nds j M W 1=fTAP þ X TDS j 1=f CUD : (1) The overall debug time is calculated according to variation of M and is notated by T M. During the overall debug process, nðds j Þ and P TDS j are computed as shown in Algorithm 1. The communication time is nðds j ÞM W1=f TAP if the operational frequency of TAP is f TAP. In similar, the on-chip sampling time is P TDS j 1=f CUD if the operational frequency of CUD is f CUD. In this paper, the clock speed of f CUD is 10 times faster than that of f TAP. To estimate the unavoidable time increment, this paper assumes that the depth of the trace buffer is reduced from 2M to M. The estimated results are shown in Table 2. The time increment is calculated by ðt M T 2M Þ=T 2M, where T M is the debug time when the trace buffer is split or shared. Otherwise, T 2M is estimated from the debug condition that each core has a dedicated trace buffer of M. In this case, CUD i is separately debugged. From the estimated results, the proposed scheme performs better than the split buffer usage. When M is 64, more DS j are required to debug CUD i.ifa sufficient M value is offered to debug CUD i, the increment ratio is low, although the trace buffer size is reduced by half. Even the debug time is reduced when M is reduced from 512 to 256. As the error rate increases, the performance becomes better, generally because the debug case of T 2M also requires more debug time. If the proportion of the number of errors is varied because the error rates of each core are different due to the electrical errors [6], the experimental result can be shown as in Table 3, in which the proportion of the number of errors between CUD i is set to CUD 1 : CUD 2 ¼ 1:b. If the error distribution is biased to one of cores, the proposed debug scheme shows better results than the split buffer scheme. As mentioned in Section 2.2, if fewer errors exists in the core, the range of EC i;j is small, and the debug process is terminated early. After one of CUD i finishes the debug process, another CUD i occupies the entire depth M to debug. For this reason, if b increases, the debug process can be accelerated in the proposed debug scheme. However, if the split buffer is used, M i;j is still fixed, even after one CUD i finishes the debug process. Therefore, more DS j occur in another CUD i with a higher number of errors. This causes the debug time to increase by an increment of b. Table 4 compares the hardware overhead of the proposed debug scheme in terms of two input NAND gates with different M. The results do not account for the trace buffer overhead. The hardware overhead considers that both cores with a 32-bit wide data width are debugged simultaneously. The core test wrapper is unaffected by M because W of the core is maintained regardless of M. In contrast, the proposed debug scheme is affected by M. In the buffer wrapper, the size of the depth register varies depending on M. In addition, the size of the capture trigger register in the buffer wrapper is dependent on the number of cores connected to TAM. The hardware overhead of the proposed debug scheme accounts for a large portion compared with the core test wrapper in the

IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 12, DECEMBER 2018 1839 Table 4. However, the hardware overhead of the proposed debug scheme is still less than the size of the trace buffer; similarly, the core test wrapper occupies a small portion of the hardware area in the embedded core. Additionally, the debug cost will be increased to maintain the scalability of the proposed debug scheme to apply for SoC with tens of cores. The silicon area will be enlarged when every core requires additional TAM wires. If the silicon area is insufficient, however, the proposed debug scheme can provide the debug for multiple cores by using a single TAM channel. In this case, the debug time will be increased due to the increment of the number of debug sessions. 4 CONCLUSION In this paper, the TAM used in the manufacturing test is reused to share the trace buffer between cores in post-silicon debug. The proposed debug scheme reduces the unavoidable increment in debug time for the reduction in the trace buffer size. ACKNOWLEDGMENTS This research was supported by the MOTIE (Ministry of Trade, Industry & Energy) (10067813) and KSRC (Korea Semiconductor Research Consortium) support program for the development of the future semiconductor device. REFERENCES [1] V. Iyengar, K. Chakrabarty, and E. J. Marinissen, Test wrapper and test access mechanism co-optimization for system-on-chip, in Proc. IEEE Int. Test Conf., 2001, pp. 1023 1032. [2] M. Abramovici, P. Bradley, K. Dwarakanath, P. Levin, G. Memmi, and D. Miller, A reconfigurable design-for-debug infrastructure for SoCs, in Proc. ACM/IEEE Design Autom. Conf., 2006, pp. 7 12. [3] M. Abramovici, In-system silicon validation and debug, IEEE Design Test Comput., vol. 25, no. 3, pp. 216 223, May/Jun. 2008. [4] S. Sarangi, B. Greskamp, and J. Torrellas, CADRE: Cycle-accurate deterministic replay for hardware debug, in Proc. IEEE Int. Conf. Dependable Syst. Netw., 2006, pp. 301 312. [5] J.-S. Yang and N. Touba, Improved trace buffer observation via selective data capture using 2-D compaction for post-silicon debug, IEEE Trans. VLSI Syst., vol. 21, no. 2, pp. 320 328, Feb. 2013. [6] H. Oh, I. Choi, and S. Kang, DRAM-based error detection method to reduce the post-silicon debug time for multiple identical cores, IEEE Trans. Comput., vol. 66, no. 9, pp. 1504 1516, Sep. 2017. [7] E. A. Daoud and N. Nicolici, On using lossy compression for repeatable experiments during silicon debug, IEEE Trans. Comput., vol. 60, no. 7, pp. 937 950, Jul. 2011. [8] H. Oh, T. Han, I. Choi, and S. Kang, An on-chip error detection method to reduce the post-silicon debug time, IEEE Trans. Comput., vol. 66, no. 1, pp. 38 44, Jan. 2017. [9] E. J. Marinissen and Y. Zorian, IEEE std. 1500 enables modular SoC testing, IEEE Design Test Comput., vol. 29, no. 1, pp. 8 17, Jan./Feb. 2009. [10] Ethernet MAC 10/100 Mbps, Jan. 10, 2016. [Online]. Available: http:// opencores.org/project,ethmac [11] AC 97 controller IP core, Jul. 11, 2011. [Online]. Available: http:// opencores.org/project,ac97