Parallelized Network-on-Chip-Reused Test Access Mechanism for Multiple Identical Cores

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 7, JULY 2016 1219 Parallelized Network-on-Chip-Reused Test Access Mechanism for Multiple Identical Cores Taewoo Han, Inhyuk Choi, Hyunggoy Oh, and Sungho Kang, Senior Member, IEEE Abstract This paper proposes a new network-on-chip (NoC)-reused test access mechanism (TAM) for testing multiple identical cores. It can test multiple cores concurrently and identify faulty cores to derate the chip by excluding the core. In order to minimize the test time, the TAM utilizes the majority value of test response data. All of the cores can thereby be tested in parallel and test costs (in both test pins and test time) are exactly the same as those for a single core. The hardware overhead is minimized by reusing the NoC infrastructures and transfer-counters are designed as a majority analyzer. The experimental results in this paper show that the proposed TAM can test multiple cores in the same time as a single core and with negligible hardware overhead. Index Terms Multicore, network-on-chip (NoC), parallel test, test access mechanism (TAM). I. INTRODUCTION A system-on-chip (SoC) design mainly consists of multiple IP cores, each of which contains an individual design block and its design. Thus, an SoC test implies a highly structured design-fortest (DFT) infrastructure to observe and control individual core test solutions [1]. The design of a communication infrastructure within such complex systems requires high performance and high quality levels while connecting an increasing number of cores. The communication architecture causes severe on-chip synchronization errors, unpredictable delays, and power consumption. A network-onchip (NoC) is proposed as a solution to overcome the limitations from bus-based and point-to-point communication architectures [2]. When the NoC is used as an interconnection fabric, the cores in the SoC can be tested using the NoC as the test access mechanism (TAM). This NoC-reused TAM allows the use of existing functional interconnects, with reduced area, pin count, and test time costs [3]. Infrastructures of the NoC, which include routers and interconnections, must be tested before reusing the NoC as TAM [4]. Amory et al. [5] and Cota et al. [6] proposed a DFT scheme to test all identical routers concurrently and Xiang and Zhang [7] and Xiang [8] proposed a scheme to test interconnections with a reduced cost. An NoC-reused TAM facilitates design reuse and localizes the DFT effort to access points and core wrappers such as the IEEE 1500 [9], and therefore reduces the impact of last-minute design changes. For heterogeneous cores, research on the optimization of a dedicated TAM [10], [11] and NoC-reused TAM [12] demonstrated that Manuscript received November 26, 2014; revised June 16, 2015; accepted September 2, 2015. Date of publication September 23, 2015; date of current version June 16, 2016. This work was supported by the National Research Foundation of Korea through the Korea Government (MSIP) under Grant 2015R1A2A1A13001751. This paper was recommended by Associate Editor J. L. Dworak. (Corresponding author: Sungho Kang.) T. Han is with the SoC Development, System LSI, Samsung Inc., Gyounggi-do 445-701, Korea (e-mail: twhan@soc.yonsei.ac.kr). I. Choi, H. Oh, and S. Kang are with the Computer Systems and Reliable SOC Laboratory, Department of Electrical and Electronic Engineering, Yonsei University, Seoul 120-749, Korea (e-mail: ihchoi@soc.yonsei.ac.kr; kyob508@soc.yonsei.ac.kr; shkang@yonsei.ac.kr). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2015.2481872 pin-count-aware test schedule optimization can reduce test time for given pins. Recently, modern microprocessor designs have evolved to include multiple identical cores [13] and an NoC helps to implement reconfigurable systems and a topology reconfiguration for defect-tolerant NoC-based homogeneous multicore or many-core systems [14]. The build of this highly reliable system begins with an accurate test: identifying faulty cores to derate the chip by excluding it. A pipelinebased TAM is proposed for parallel testing of multiple identical cores [15]. NoC-reused parallel (NRP) TAM (NRP-TAM) [16] adopts this pipeline-based test scheme to be used as NoC-reused TAM. However, the pipeline-based TAM has the characteristic of requiring additional test time when the primary core has a fault. The worst case is when the primary core fails continuously, thereby requiring N tests for N cores. In addition, if one chip needs additional test time, the other chips on the same wafer also wait the additional test. As a result, the overall test process should be delayed and the only one fault at the primary core is expected to have a huge impact. A majority-based TAM [17] is proposed to overcome the limitations of the pipeline-based TAM and it can test all cores using the same test pins and test time as required for testing a single core, but it is designed as the dedicated TAM. The majority analyzer in the dedicated TAM is a combinational module and it is hard to apply for the NoC infrastructures which have sequential logic with routing buffers. In this paper, completely parallelized NoC-reused TAM for multiple identical cores is proposed. It is implemented by utilizing the scheme of majority-based TAM. Also, a dedicated majority analyzer is designed for reusing NoC infrastructures. With the majority-based TAM scheme, cores that produce test response (TR) data, which is different from the majority value (MV), can then be considered to be faulty. The MV is then tested by the automated test equipment (ATE) to determine whether it matches the expected value or not, which indicates a fault. The proposed NoC-reused TAM targets most common NoC architectures [18] and has flexibility in its design, configuration, and application. The proposed TAM can be used to perform a complete core-level diagnosis and the test process is completely parallelized for minimizing the test costs. II. PREVIOUS WORKS A. NoC-Reused TAM An NoC-reused TAM for testing multiple identical cores is studied with the pipeline-based TAM scheme. If the pipeline-based TAM is applied to an NoC-reused TAM as it is, the bandwidth of the TAM is reduced by half. This is due to the test pattern (TP) data and TR data of a primary core transferring in one direction. Fig. 1(a) shows a simple diagram of the NoC-reused TAM in which the pipeline-based TAM is applied as it is. Cores connected to the routers are omitted in this figure. The width of a flit in the NoC is W. In order to transfer the TP and the primary core s TR in the same direction, the TP uses a W/2 bandwidth and the primary TR uses the other W/2 bandwidth. In the pipeline-based TAM, the spare output channels can be used to 0278-0070 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1220 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 7, JULY 2016 (a) (b) Fig. 1. Simple diagram of the NoC-reused TAMs. (a) Pipeline (two tracks) [15]. (b) NRP-TAM [16]. directly observe the response of another primary core. This allows two sets of cores to be compared in parallel to each observable core [15]. Because Primary1 reuses only half of the bandwidth (W/2) of the test output pins, the remaining bandwidth (W/2) of the test output pins can be used to construct the two tracks (for Primary2). It can reduce the test time stochastically (the probability of two primary cores both having fault is expected as less than the one of primary core having fault). However, it reuses only half of the bandwidth of the test input pins and half of the bidirectional links in the NoC. Therefore, it leads to a loss of test time than which one reuses the full bandwidth of links in NoC. In this paper, a pipeline-based TAM is a dedicated TAM and pipeline is the NoC-reused TAM which adopts the pipelinebased test scheme as it is. The limited test bandwidth of Pipeline is overcome by an NRP-TAM [16]. NRP-TAM is a parallel TAM which is specialized for multiple identical cores in NoC-based system. In order to use the full bandwidth of NoC interconnection as the width of TAM, a new deterministic routing algorithm to transfer test data is designed. Fig. 1(b) represents this TAM. If one link is in use for transferring the TP, the primary s TR is transferred by bypassing the link. It can test the homogeneous cores in an NoC efficiently, but they have the same drawback as the pipeline-based TAM. If the primary core has a fault, additional test time is required for testing the other cores. The majority-based TAM can be applied to an NoC-reused TAM that promises to overcome the limitations above. B. Majority-Based TAM The majority-based TAM [17] has intuitive and clear fundamentals. In a multicore system, multiple identical cores can be tested in parallel using broadcasted TPs. If there are no faults, the TR data can be predicted to be the same among each of the cores. By expecting that most cores will not have faults, the proposed TAM analyzes the TR data and finds the MV. A core that produces TR data which is different from the MV can then be considered to be faulty. Naturally, the MV is then tested by the ATE to determine whether it matches the expected value or not, which indicates a fault. When the MV is equal to the expected data, it means that more than half of the cores are not faulty. If the TR data of a core is different from the MV, that core is recorded in the error registers as a faulty core. When the MV is different from the expected data, it means that more than half of the cores are faulty. In this case, it is possible for the TAM to operate the wrong test. The nonfaulty cores would be recorded instead in the error registers, but this multicore chip would be discarded. If the exact diagnosis is required even if more than half of the number of cores having faults, the test process is repeated with the nonfaulty cores. Fig. 2. Architecture of the proposed NoC-reused TAM. Consequently, with the majority-based TAM scheme, all of the cores can be tested simultaneously using the same test pins and test time as that required for testing a single core. Test methodology with the proposed NoC-reused TAM in this paper is described with the scan test. Because the TAM is only related to transferring test data, it can be extended to all scan, functional, and other tests for multiple identical cores. III. PROPOSED NOC-REUSED TAM A typical structure of NoC system with the proposed NoC-reused TAM is represented in Fig. 2. Detailed descriptions about the TAM are presented in the following chapters. A. Architecture Fig. 2 represents a typical structure of an NoC system with the NoC-reused TAM which adopts the majority-based TAM scheme. It reuses the buffers in the inputs of the routers and some multiplexers and comparators (XOR gates) are added; however, the comparators can be shared with the testing of routers. In addition, a bitwise counter is designed for analyzing the MV. It is assumed that Router0 is a source and sink node. External ATE sends the TP data to Router0, receives the MV of cores from Router0 and confirms whether the response data is identical to the expected data. Router0 transfers the TP data from the ATE to Core0 and, at the same time, it is transferred to Router1. One input buffer in Router1 is reused as the pipelining register for TP data. The red lines indicate the transmission of TP data which transfers from Router0 to Router1 and Router4 according to the proposed routing algorithm. The link between Router0 and Router1 is shared for both TP data and majority counting data by the time division methods. The majority counting data is the number of 1 s in the TR data for analyzing the MV. Therefore, the green lines represent the transmission of TP and majority counting data. Each router has a majority analyzer and it uses bitwise transfer-counters to maximize the efficiency of its process. The blue lines indicate the transmission of the complete MV which transfers from Router5 to Router4 and from Router1 to Router0.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 7, JULY 2016 1221 Fig. 3. Routing results of the proposed NoC-reused TAM. Fig. 4. Architecture of majority analyzer. All routers and cores can transfer and receive the TP data, the majority counting data, and the MV in the same way. Each router then compares the TR data of its core and the MV. As a result, each router requires additional buffering to adjust the timing. This buffering reuses the input buffer between the core and the switch box. B. Routing Algorithm The proposed TAM uses shortest-path routing to transmit TP data, MV, and the majority counting data. It reuses the bidirectional links in the NoC for an abundant bandwidth of the TAM. In order to transmit the TP data and the counting data for analyzing the MV simultaneously, the TAM utilizes the difference between the speed of the scan shift cycle and the speed of the transmission cycle. Typically, the speed of the transmission cycle in an NoC is higher than 600 MHz, but the speed of the scan shift cycle is about 15 MHz [18]. Therefore, the proposed TAM can transmit enough counting data per TR data. Fig. 3 represents 4 3 NoC systems and routings according to the proposed algorithm. Cores connected to the routers are omitted. TP data transmits from Router0 (first router) to the other routers. After one scan shift cycle of a router receives the TP data, its core prints a TR data and the router generates a majority counting data. The majority counting data transmits to the next router. Router11 (last router) receives all the majority counting data and generates MV. The MV transmits from Router11 to the other routers. With this routing algorithm, each router receives the MV one scan shift cycle after the core of router printing a TR data. In order to compare the MV and the TR of core, the router requires one buffer (detailed time information is referred in Fig. 5). After routing all of the routers in an NoC in the same way, regardless of the number of cores, the depth of the buffering is always one clock. Therefore, all cores in the NoC can be tested concurrently by the pipelined test data. C. Majority Analyzer Each router in the proposed NoC-reused TAM has a majority analyzer for analyzing the MV. It counts the number of 1s in the TR data of each core; bitwise transfer-counters are designed for maximizing the number of testable cores and the simplified architecture helps to reduce the operation time and hardware overhead. The transfercounters are newly designed counter architectures in order to counter and transfer the data simultaneously. Fig. 4 shows the architecture of the majority analyzer and how it uses transfer-counters. For 4 3 cores, three bits are required for discovering the number of 1s in the TR data that are larger than half of the total cores (2 3 > 12/2). When the least significant bit of the majority counting data is transferred from the previous router, this bit is added with the TR data of the core by the adder. Then, the majority counting data is increased only when the bit of the TR data is 1. The result of the bitwise sum is transferred to the next router. The next bit of the majority counting data is then transferred from the previous router and this bit is added Fig. 5. Timing diagram of the proposed NoC-reused TAM. with the preceding carry bit by the adder. This sum is transferred to the next router as well. After the same process for the most significant bit takes place, all of the majority counting data of the previous routers are counted and transferred. At the last router, the calculated majority counting data is compared with half of the number of total cores and the MV is 1 when the majority counting data is larger than this half. D. Timing Analysis To represent the pipelined test data and concurrent test process of the proposed NoC-reused TAM, a timing diagram of test data in the TAM is shown in Fig. 5. TP data is transferred from Core0 to Core1, and from Core1 to Core2 (TP1 is the first TP and TP2 is the second TP). After one scan shift cycle, Core1 gets the TR data and the majority counting data (mcnt) are calculated and transmitted according to the transmission cycle. The MV of all cores can be analyzed in the same way and Core1 receives the MV from the last core (Core3 in this case) after one more scan shift cycle. In order to compare the TR data and the MV, the TR data must be buffered one scan shift cycle ( ) Transmission clock #ofcores log scan shift clock 2 + 1 + # of cores in row 2 + # of cores in column. (1) The proposed TAM utilizes the difference between the speed of the scan shift cycle and the speed of the transmission cycle. Equation (1) represents the relation between the speed of the cycles and the number of testable cores. The left side indicates the utilizable clocks from the different speeds of the transmission clock and scan shift clock. The right side is composed of the bits for counting the MV and the

1222 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 7, JULY 2016 TABLE I HARDWARE OVERHEAD OF THE PROPOSED TAM IN 4 3NOC Fig. 6. Expected test time of NoC-reused TAMs (N = 12 and G = 10). TABLE II HARDWARE OVERHEAD OF THE PROPOSED TAM IN 4 3 2NOC longest transmission path. In order to transfer and count the majority counting data simultaneously, transfer-counters are used [if a router counts the all-bits of the majority counting data and then transfers them to the next router, the calculation on the right side of (1) would be a multiplication instead of an addition]. Given that the speed of the transmission cycle in an NoC is 600 MHz and the speed of the scan shift cycle is 15 MHz, the proposed TAM can test 256 cores concurrently. As a result, the abundant number of cores can be tested withthesametesttimeasthatofasinglecore. IV. EXPERIMENTAL RESULTS Several experiments were performed to verify the effectiveness of the proposed NoC-reused TAM. The experimental results include comparisons of the proposed TAM to the previous NoC-reused TAM which adopted the pipeline-based TAM scheme. These are implemented with the general NoC [19] and synthesized by the Synopsys 90-nm generic library [20] for analyzing the proposed TAM in real NoC systems. A. Test Time One advantage of the majority-based test scheme is that it can test multiple identical cores in the same amount of time as one core. In Fig. 6, the expected test time of the proposed NoC-reused TAM is always 1T (T: time for testing one core). On the other hand, both dedicated TAM and NoC-reused TAM which adopted the pipelinebased test scheme require additional test time according to the yield of the core. Giles et al. [15] analyzed the expected test time of the pipeline-based TAM. The expected test time which is necessary to determine the pass/fail status of each core (up to the decision to exit according to the deration policy) may be calculated as a function of the per core yield, Y. The proposed TAM and NRP-TAM can reuse the full bandwidth of bidirectional NoC links as the TAM, but pipeline reuses one directional link for both TP data and TR data of the primary core. As the width of the TAM decreases to become half as large, the test time tends to increase to nearly double [16]. The rest of the output channels in pipeline can be reused for comprising two tracks. Therefore, the experimental result of pipeline in Fig. 6 doubles the expected test time of the pipeline-based TAM with two tracks [9]. Fig. 6 represents a case for when at least ten cores are good among 12 cores, but the NoC-reused TAMs have consistent expected test times in various cases [15], [17]. While the expected test time of the majority-based TAM is always 1T, the expected test time for the pipeline-based TAM increases rapidly with decreasing yield. Multiple tracks in the pipeline-based TAM can reduce the expected test time, but it can be applicable in particular conditions (asymmetric channels at test inputs and test outputs) and, above all, it requires more test time than a single track when the yield of the core is good. B. Hardware Overhead NoC-reused TAMs were designed in RTL code and synthesized in order to compare the hardware size. Table I shows the hardware overhead of the NRP-TAM and proposed TAM in a 4 3(2-D) NoC and Table II indicates the case of a 4 3 2 (3-D) NoC. The hardware architecture of the proposed TAM is related to each router and it can be extended to various numbers of routers. Therefore, Tables I and II show similar tendencies regardless of the number of routers or their dimensions. The hardware size of the proposed TAM with an NoC is represented in the number of NAND gates. Since the NoC-reused TAMs are reusing the input buffers for pipelining registers and additional buffering, the hardware overhead of the TAMs decreases when the number of buffers in the original NoC increases. Furthermore, the hardware overhead increases in the larger flit size at the NoC, which is the width of the TAM. The hardware overhead of the proposed NoC-reused TAM is less than 5% in the worst case of the experiments. The remaining components are similar, but the proposed TAM has more hardware overhead than the NRP-TAM due to the majority analyzers. However, considering the fact that the number of gates of a modern multicore processor system is much more than a million gates, the hardware overhead of the proposed TAM in a chip is negligible. Consequently, the proposed TAM can be implemented with the minimized hardware overhead by reusing the NoC infrastructures and it is the only NoC-reused TAM which can test multiple identical cores in NoC using the same test time as required for testing a single core. V. CONCLUSION In this paper, a novel NoC-reused TAM for parallel testing of a multicore system is described. All of the cores can be tested

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 7, JULY 2016 1223 simultaneously with the scheme of the majority-based test strategy and the test time is the same as that required for a single core. A majority analyzer with transfer-counters are designed which uses the MV of TR data to test multiple identical cores. The hardware overhead is minimized by reusing the infrastructure of an NoC. Experimental results show that the proposed NoC-reused TAM has a minimized test time with an abundant TAM width and negligible hardware overhead. The majority-based NoC-reused TAM is only related to the delivery of TR data and it can be compatible and improved with existing DFT technologies. REFERENCES [1] International Technology Roadmap for Semiconductors (ITRS): 2013 Edition, Semicond. Ind. Assoc., Washington, DC, USA, 2013, pp. 26 30. [2] R. Marculescu, U. Y. Ogras, L.-S. Peh, N. E. Jerger, and Y. Hoskote, Outstanding research problems in NoC design: System, microarchitecture, and circuit perspectives, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 28, no. 1, pp. 3 21, Jan. 2009. [3] E. Cota, A. M. Amory, and M. S. Lubaszewski, Reliability, Availability and Serviceability of Networks-on-Chip. New York, NY, USA: Springer, 2012. [4] R. Nourmandi-Pour and N. Mousavian, A fully parallel BIST-based method to test the crosstalk defects on the inter-switch links in NOC, Microelectron. J., vol. 44, pp. 248 257, Mar. 2013. [5] A. M. Amory, E. Briao, E. Cota, M. Lubaszewski, and F. G. Moraes, A scalable test strategy for network-on-chip routers, in Proc. IEEE Int. Test Conf., Austin, TX, USA, 2005, Art. ID 25.1. [6] E. Cota et al., A high fault coverage approach for the test of data control, and handshake interconnects in mesh networks-on-chip, IEEE Trans. Comput., vol. 57, no. 9, pp. 1202 1215, Sep. 2008. [7] D. Xiang and Y. Zhang, Cost-effective power-aware core testing in NoCs based on a new unicast-based multicast scheme, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 30, no. 1, pp. 135 147, Jan. 2011. [8] D. Xiang, A cost-effective scheme for network-on-chip router and interconnect testing, in Proc. IEEE Asian Test Symp., Jiaoxi Township, Taiwan, 2013, pp. 207 212. [9] E. J. Marinissen and Y. Zorian, IEEE Std 1500 enables modular SoC testing, IEEE Des. Test. Comput., vol. 26, no. 1, pp. 8 17, Jan./Feb. 2009. [10] V. Iyengar, K. Chakrabarty, and E. J. Marinissen, Test wrapper and test access mechanism co-optimization for system-on-chip, J. Electron. Test., vol. 18, no. 2, pp. 213 230, 2002. [11] B. Noia, K. Charkrabarty, and E. J. Marinissen, Optimization methods for post-bond testing of 3D stacked ICs, J. Electron. Test. Theory Appl., vol. 28, pp. 103 120, Feb. 2012. [12] R. Michael and K. Chakrabarty, Optimization of test pin-count, test scheduling, and test access for NoC-based multicore SoCs, IEEE Trans. Comput., vol. 63, no. 3, pp. 691 702, Mar. 2014. [13] I. Parulkar, T. Ziaja, R. Pendurkar, A. D Souza, and A. Majumdar, A scalable, low cost design-for-test architecture for UltraSPARC chip multi-processors, in Proc. IEEE Int. Test Conf., Baltimore, MD, USA, 2002, pp. 726 735. [14] L. Zhang, Y. Han, Q. Xu, X. Li, and H. Li, On topology reconfiguration for defect-tolerant NoC-based homogeneous manycore systems, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 9, pp. 1173 1186, Sep. 2009. [15] G. Giles, J. Wang, A. Sehgal, K. J. Balakrishnan, and J. Wingfield, Test access mechanism for multiple identical cores, in Proc. IEEE Int. Test Conf., Santa Clara, CA, USA, 2009, pp. 1 10. [16] T. Han, I. Choi, H. Oh, and S. Kang, A scalable and parallel test access strategy for NoC-based multicore system, in Proc. IEEE Asian Test Symp., Hangzhou, China, 2014, pp. 81 86. [17] T. Han, I. Choi, and S. Kang, Majority-based test access mechanism for parallel testing of multiple identical cores, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 8, pp. 1439 1447, Aug. 2015. [18] E. Salminen, A. Kulmala, and T. D. Hämäläinen, Survey of network-on-chip proposals, White Paper, OCP-IP, 2008, pp. 1 13. [19] (Apr. 8, 2014). Efficient Microarchitecture for Network-on-Chip Routers. [Online]. Available: http://purl.stanford.edu/wr368td5072 [20] (Mar. 26, 2012). 90 nm Generic Library. [Online]. Available: http://www.synopsys.com/community/universityprogram