2010 25th International Symposium on Defect and Fault Tolerance in VLSI Systems A Strategy for Interconnect Testing in Stacked Mesh Network-on- Chip Min-Ju Chan and Chun-Lung Hsu Department of Electrical Engineering, National Dong Hwa University, 1, Sec. 2, Da Hsueh Rd., Shou-Feng, Hualien, 974, Taiwan, R.O.C cch@mail.ndhu.edu.tw Abstract 3D IC process has be a tendency in recent years. But the progress of IC process technologies recently has the related problems. In the 3D NoC architecture, the 3D IC process makes the placement and routing to become more complex. Then, the faults increase because of the more complex architecture. Therefore, we have to study a methodology to solve the problem. At present, the testing approach for NoC interconnect fault is based on the 2D architecture. The 3D simulated tool is not perfect. Therefore, we have to study a feasible method to test 3D architecture. In this paper, we consider how will apply a mature interconnect test approach for the 2D NoC architecture to test the 3D NoC architecture. Then, we are able to achieve the objective for increasing the yield of product through the replacement of defective chips. Index Terms built-in self-test (BIST), interconnect testing, network-on-chip (NoC). 1. Introduction According to Moore s Law, IC process technologies will progress doubled through each 18 months. The fact means a thing that ICs is able to be embedded more blocks of different function in the same size. For integrating a very high number of Intellectual Property (IP) blocks in a single die and having systems with intensive parallel communication requirement, it has emerged as a revolutionary methodology to using the Network-on-Chip (NoC) architectures [1]. The NoC architecture is able to increase performance of the SoC (System-on-Chip). It outperforms more mainstream bus architectures. At present, the conventional 2D IC has limited the choices for floor planning. And consequently, it restrains the performance improvement for using the NoC architectures. According to the International Technology Roadmap for Semiconductors (ITRS) for the longer term, new interconnect paradigms are in need [2]. Recent works have already a revolutionary methodology to solve these problems. That is introduction of 3D IC. One major advantage of the 3D IC paradigm is that it allows for the integration of dissimilar technologies, e.g., memory, analog, MEMS, and so forth, in a single die. 3D ICs improve the performance of microprocessors by forming a processor and memory stack. 3D IC has emerged better performance, functionality, and packaging density compared to more traditional 2D IC. Current NoCs are implemented predominantly following 2D architectures. However, the emergence of 3D ICs will present a fundamental change. 3D NoC has better transmission distance and number of transmission channel on the communication infrastructure than 2D NoC. It makes the 3D NoC to be better throughput, latency, energy dissipation, and wiring area overhead more than using the 2D NoC. because the distance is very short between each layers in the 3D NoC, it can embedded more blocks under the circumstance that size of the die have not more change. However increasing dramatically in the number of blocks and interconnects has made the all structure to be complex. It leads to increase the fault probability and make the yield of chips to decrease. Therefore, a methodology for 1550-5774/10 $26.00 2010 IEEE DOI 10.1109/DFT.2010.21 122
detecting the 3D NoC is more needed at present. But the recent study is almost based on the 2D NoC [3], [4], [5]. And consequently, this paper is aimed to study how to use a mature 2D NoC test strategy on the 3D NoC. 2. Test consideration 2.1 NoC testing approach The test of a NoC-based SoC for manufacturing defects is usually divided into two parts: the test of the cores and the test of the communication infrastructure [3]. The test of the cores is usually based on the reuse of the NoC as TAM, to avoid the burden of adding extra hardware for a dedicated test bus. Recent works have been addressing the test of the NoC infrastructure, including routers [6]-[9] and interconnect channels [10], [11]. Interconnect testing in NoCbased chips has been related to faults in wires within a single channel connecting two adjacent routers. However, this assumption is not reasonable in large NoC layouts. Considering realistic NoC layouts [12], the placement and routing of routers and channels are actually prone to even simpler faults, such as shorts between wires connecting the core to the network and between wires of distinct network channels. Grecu et al. [10] propose a built-in self-test (BIST) methodology for testing the channels of the communication platform. The proposed methodology targets crosstalk faults. The problem of detecting short faults in interconnection has been widely studied [3], [4], [5]. The most of the works are aimed at detecting faults for interconnects between two adjacent routers. Some studies have proposed to insert the BIST block in the router. However, setting the BIST block in the router make some faults to do not detected between the core and router. Therefore, another research is embeding the BIST block in the NI (network interface) of core. It can accomplish the object for testing all interconnect. The test strategy is based on two BIST blocks: the test data generator (TDG) and the test response analyzer (TRA). TDG generates the test vectors to transmit in the NoC. TRA receives the test vectors from the NoC and detect whether they occur faults. 2.2 Fault model definition When considering short faults in the NoC, it is important to define the region where the faults may occur, i.e., which links will more likely be short circuited. However, considering that all possible wires can be faulty might not be realistic. The number of faults grows exponentially with the number of wires considered, as shown in (1) for n, the number of independent wires, and k, the size of each fault group [2]. n! Cnk (, ) = k!( n k)! (1) In the worst-case scenario, we suppose the short faults can occur between any two interconnects. Short faults include two kinds of AND-short and OR-short, as shown in Fig. 1. They will make information packet to change the path or information flit to generate error. In this test work, considering the test difficulty and structure scalability, we suppose a 2 2 2 Stacked Mesh NoC to be a most minimum search space for test structure. It has 56 links, and the channel is 8 bits. 123
Figure 1. Fault model. 3. Proposed methodology At present, mature NoC technology refers to the 2D NoC scenario, whereas 3D NoC architectures have simulation tools still in a preliminary stage, and no complete testing plan. Therefore, considering how to take testing method for the 2D NoC interconnects to apply on the 3D NoC is more easy and feasible. According to the conception of 3D space, we know that 3D space is comprised of three kinds of 2D plane. As shown in Fig. 2, when we observe the Stacked Mesh NoC by the conception, we will discover that Stacked Mesh NoC is able to also be partitioned into three kinds of 2D structures. We use this discovery to apply to the testing 2 2 2 Stacked Mesh NoC and get the result as Fig. 3. In other words, we are able to achieve the test objective by partition 3D to 2D planes. After observing Fig. 3, we are able to choose any two kinds of plane to do testing. It have the same effect as testing a complete 2 2 2 Stacked Mesh NoC. In this work, we choose the Y-Z plane and Z-X plane to do testing, because the two kinds of plane have same test structure. Figure 2. 3D space is comprised of three kinds of 2D plane. Figure 3. A 2 2 2 Stacked Mesh NoC is partitioned into three kinds of 2D structures. 124
For this work, the number of transmission paths is 4, as shown in Fig. 4. The number of total links is 24. Because of the property of 3D structure, the faults must be calculated separately on each floor. According to (1), this test structure is most possible to generate 6,560 faults. The TDG in the core will transmit a test vector. It consists of the header and data. The header has the related information flits about path. It is able to be modified the path by shifting. Therefore, if interconnects occur the short fault, it will make the flit to change, and lead the transmission path to change. And the test vector in the data will be changed by the short fault. It will occur to the data error. There are two situations about detecting fault on the TRA. First one, the change of path information for the header causes the test sequence to arrive the wrong target core. The fault is defined as time_out. Another, the test vectors for the data what the TRA receives are error. The fault is defined as data_error. But, some faults do not change the flit or affect the path. The flit is 8 bits. The bits for 0, 1, and 4 control the direction. The bits 3 and 7 control the number of shift. And consequently, the bits 2, 5, and 6 do not directly affect the path. In other words, the transformations of bits 2, 5, and 6 do not change the path. Figure 4. Test transmission structure for Z-X plane. In this work, we use the C code to simulate the running state of test structure. Fig. 5 shows is the working flowchart. Test steps are listed as follows: 1. TDG generates the 8 bit test sequence. 2. Inject the short fault into the test structure. 3. Short fault affect the test sequence. 4. Header is modified the path, and data is changed the test vectors. 5. TRA receives the test sequence and detects fault. For example, we transmit the test sequence from core 1 to core 7 in Fig. 4. The path information of flit in the header is set to 11011001. The bits 0 and 1 control the direction of move for east (E), west (W), south (S), and north (N). The bit 3 controls the number of shift for E, W, S, and N. The bit 4 controls the direction of move for up (U) and down (D). The bit 7 controls the number of shift for U and D. When the test sequence transmits in the test structure, the router can detect the path information of the header. At first, router detects the value of the bit 3. If the bit 3 is 1, router will make the test sequence to shift according to the direction of the representative of the bits 0 and 1. If the bit 3 is 0, router will detect the value of the bit 7 to determine whether to shift 125
U and D direction. If the bits 3 and 7 are 0, router will transmit the test sequence to the core that the router connects. Under normal circumstances, when the header arrive the goal core 7, the information of flit is 11001000. After we inject the short faults into the test structure, TRA receives the test sequence and will detect two error situations. First one, the short faults change the flit (e.g. 01011001) to affect the path. That leads the test sequence to arrive the wrong core. After the set time, TRA in the original target core think the test sequence loss, and decide the fault is time_out. Second one, the short faults do not change the flit of header, or change the flit of header (e.g. 01010011), but the path is not affected. The test sequence still reaches target core 7. But the test vectors of data in the test sequence are changed by the short faults. When TRA compares the test vectors of data, TRA will know what the test vectors of data are wrong, and decide the fault is data_error. 4. Experimental results Figure 5. Working flowchart. In the simulation test, we divide the short faults into two kinds of AND-short and OR-short to test. Because we randomly select the location of short faults, a problem is that select the repeated short faults. Therefore, we inject 8,000 faults into the test plane to reduce the problem for the repeated selection. That will make the result to close to the actual situation. Then we average the test results of two kinds of plane. The average result is equivalent to the test fault coverage of a 2 2 2 Stacked Mesh NoC. Table 1 and table 2 present the simulation result when the faults are divided into AND-short and OR-short. The test results are two kinds of time_out and data_error. However, the test vectors are sure to be changed because the short faults affect the data. Therefore, the test sequence that happen the time_out fault also generates the data_error fault. In this test, we only use the time_out on behalf of this fault case. As shown in table 1 and table 2, we discover that the incidence of time_out is higher in the OR-short case. The reason is the property of AND-short and OR-short. AND-short faults have 75 percent probability to change the value to 0, and ORshort faults have 75 percent probability to change the value to 1. The value that changed to 1 is more likely to affect the path of test sequence according to the header format. In the simulation process, we discover when the flit of header has a similar number of 0 and 1, it will increase the incidence of time_out fault. Therefore, this is the reason that the incidence of time_out fault on the Z-X plane is higher than Y-Z plane. However, if we changed the header format, the result will be different. 126
Table 1. Testing fault coverage analysis for AND-short Time_out Data_error only Total detected Test area Inject 8000 faults at each plane Y-Z plane (1) 538 (6.72%) 7462 (93.28%) 8000 (100%) Y-Z plane (2) 563 (7.04%) 7437 (92.96%) 8000 (100%) Z-X plane (1) 587 (7.34%) 7413 (92.66%) 8000 (100%) Z-X plane (2) 558 (6.98%) 7442 (93.02%) 8000 (100%) 2 2 2 Stacked Mesh NoC 2246 (7.02%) 29754 (92.98%) 32000 (100%) Table 2. Testing fault coverage analysis for OR-short Time_out Data_error only Total detected Test area Inject 8000 faults at each plane Y-Z plane (1) 1326 (16.53%) 6674 (83.42%) 8000 (100%) Y-Z plane (2) 1281 (16.01%) 6719 (83.98%) 8000 (100%) Z-X plane (1) 1510 (18.88%) 6490 (81.12%) 8000 (100%) Z-X plane (2) 1558 (19.48%) 6442 (80.52%) 8000 (100%) 2 2 2 Stacked Mesh NoC 5675 (17.73%) 26325 (82.27%) 32000 (100%) 5. Conclusions In this test work, we discover that a 3D architecture is just a rule arrangement as the mesh NoC. Then we are able to use the conception of 3D space in this test work to partition it. Therefore, we are able to use the test approach for 2D to do test we need. And the test is not limited to short fault. The simulation results show we take a 2D testing approach for using the BIST methodology to test the 3D structure that is feasible. We note that the probability of the test sequence that is affected is bigger if the information flit of header and test vectors of data are more complex. In addition, the selection for test plane will be different according to used the test structure and the definition of the fault model. In the end, if we obtain the high fault coverage, we are will achieve the objective for increasing the yield of product through the replacement of defective chips. 6. References [1] P. Guerrier and A. Greiner, A Generic Architecture for On-Chip Packet-Switched Interconnections, Proc. Conf. Design, Automation and Test in Europe, pp. 250-256, 2000. [2] B. S. Feero and P. P. Pande, "Networks-on-Chip in a Three-Dimensional Environment: A Performance Evaluation", IEEE Transactions on Computers, Vol. 58, No. 1, January 2009, pp. 32-45 [3] E. Cota, F. L. Kastensmidt, L. Fernanda, M. Cassel, M. Hervé, P. Almeida, P. Meirelles, A. Amory and M. Lubaszewski, A High-Fault-Coverage Approach for the Test of Data, Control and Handshake Interconnects in Mesh Networks-on-Chip, Computers, IEEE Transactions on Volume 57, Issue 9, pp. 1202-1215, Sep. 2008. [4] E. Cota, F.L. Kastensmidt, M. Cassel, P. Meirelles, A. Amory, and M. Lubaszewski, Redefining and Testing Interconnect Faults in Mesh NoCs, Proc. IEEE Int l Test Conf., pp. 1-10, 2007. [5] M. B. Herve, E. Cota, F. L. Kastensmidt and M. Lubaszewski, NoC Interconnection Functional Testing: Using Boundary-Scan to Reduce the Overall Testing Time IEEE 10th Latin American Test Workshop (LATW '09), pp. 1-6, 2009. [6] A.M. Amory, E. Briao, E. Cota, M. Lubaszewski, and F.G. Moraes, A Scalable Test Strategy for Networkon-Chip Routers, Proc. IEEE Int l Test Conf., p. 9, 2005. [7] K. Stewart and S. Tragoudas, Interconnect Testing for Networks on Chips, Proc. 24th IEEE VLSI Test Symp., p. 6, 2006. 127
[8] C. Grecu, P. Pande, B. Wang, A. Ivanov, and R. Saleh, Methodologies and Algorithms for Testing Switch- Based NoC Interconnects, Proc. 20th IEEE Int l Symp. Defect and Fault Tolerance in VLSI Systems, pp. 238-246, 2005. [9] J. Raik, V. Govind, and R. Ubar, An External Test Approach for Network-on-a-Chip Switches, Proc. 15th Asian Test Symp., pp. 437-442, 2006. [10] C. Grecu, P. Pande, A. Ivanov, and R. Saleh, BIST for Network-on-Chip Interconnect Infrastructures, Proc. 24th IEEE VLSI Test Symp., p. 6, 2006. [11] P.P. Pande, A. Ganguly, B. Feero, B. Belzer, and C. Grecu, Design of Low Power and Reliable Networks on Chip through Joint Crosstalk Avoidance and Forward Error Correction Coding, Proc. 21st IEEE Int l Symp. Defect and Fault Tolerance in VLSI Systems, pp. 466-476, 2006. [12] F. Angiolini, P. Meloni, S. Carta, L. Benini, and L. Raffo, Contrasting a NoC and a Traditional Interconnect Fabric with Layout Awareness, Proc. Int l Conf. Design, Automation and Test in Europe, pp. 1-6, 2006. 128