Chapter 1 Introduction

Size: px

Start display at page:

Download "Chapter 1 Introduction"

Myron Moore
5 years ago
Views:

1 Chapter 1 Introduction The advent of synthesis systems for Very Large Scale Integrated Circuits (VLSI) and automated design environments for Application Specific Integrated Circuits (ASIC) have allowed digital systems designers to place large numbers of gates on a single IC in record time. Generation of test patterns for these circuits to insure that they are fault free however, still consumes considerable time. Currently, up to one third of the design time for ASICs is spent generating tests [1]. Many algorithms have been developed to automate the test generation process [2],[3],[4], but the test generation problem has been shown to be NP complete [5]. This thesis deals with the application of parallel processing techniques to Automatic Test Pattern Generation (ATPG) to address this problem. 1.1 Motivation There are two basic approaches to solve the Automatic Test Pattern Generation (ATPG) problem; algorithmic test pattern generation, and statistical or pseudorandom test pattern generation. In the algorithmic approach, a test is generated for each fault in the circuit using a specific ATPG algorithm. Most of these algorithms can be proven to be complete. That is, they are guaranteed to find a test for a fault if a test does exist. However, this process may involve a search of the entire solution space which is computationally expensive. Statistical or pseudorandom test pattern generation on the other hand, selects test patterns at random, or using some heuristic, and determines the faults that are detected by these patterns using fault simulation. Test patterns are selected and added to the test set if they detect any previously undetected faults. This process continues until some required fault coverage or computation time limit is reached. This method finds tests for the easyto-detect faults very quickly, but becomes less and less efficient as the easy-to-detect faults 1

2 2 are removed from the fault list and only the hard-to-detect faults are left. In many cases, the required fault coverage can not be achieved without excessive computation times. An efficient combined method for solving the ATPG problem uses statistical methods to find tests for the easy-to-detect faults on the fault list and switches to an algorithmic method to find tests for the hard-to-detect faults which remain. Using either this method or the purely algorithmic method, a significant portion of the computation time will be spent generating tests for the hard-to-detect faults algorithmically. Therefore, finding a method to speed up this process should reduce the overall computation time considerably. Much research has been done in increasing the efficiency of algorithms for ATPG through heuristics [6],[7],[8]. However, the overall gains that can be achieved through these improvements are limited and will not be adequate for future needs. This statement can be justified by two facts. First, no system currently presented in the literature has been proven on circuits that contain combinational logic blocks larger than 3 or 4 thousand gates. Second, most sequential ATPG techniques are based on combinational ATPG algorithms [9],[10]. These systems require that multiple passes be made through the ATPG process in order to generate a test for a single fault. Therefore, any excessive runtimes will be multiplied by this process and achieving fast combinational ATPG becomes even more important. An alternate approach to heuristics in reducing computation times is to use parallel processing techniques. Parallel processing machines are becoming available for general use and are being used to solve other problems in Computer Aided Design [11]. Most of these readily available parallel processors are distributed memory machines due to cost and scalability factors. Operating systems are also being developed that allow simple networks of workstations to be used as distributed memory parallel computing environments [12]. Previous efforts to parallelize the ATPG problem can be placed in one of five categories; fault partitioning, heuristic parallelization, search space partitioning, algorithmic partitioning, and topological partitioning [13],[14]. These techniques, which will be more fully detailed in Chapter 2, usually require each processing node in a

3 3 distributed memory system to contain the entire circuit description. However, the increasing size of VLSI circuits has caused the amount of memory required to process these circuits to grow rapidly. Topological partitioning techniques can be used to distribute the circuit database across several processors thereby increasing the size of the largest circuit that can be processed on a given distributed memory configuration. For example, results from the EST system [15] have shown that the memory requirements for processing the ISCAS 85 [16] benchmark circuit C7552 can be over 9 MBytes. This circuit contains only 3512 gates which is relatively small compared to state of the art VLSI devices. At this rate, a circuit of only 10,000 gates could take as much as 25 MBytes of memory to process. Typical commercially available distributed memory multicomputers must be able to take advantage of their entire memory space across all nodes to process circuits as large as, or larger than this. Thus, topological partitioning of the database across several processors will be required to perform ATPG on these larger circuits. Previous research in topological partitioning for ATPG [17],[18] has focused on the D-algorithm [2]. The initial effort contained in [17] was directed toward a shared memory parallel processor, hence, the parallelism exploited was fairly fine grained. The results of the effort to port this system to a distributed memory multicomputer were mixed [18]. Some speedup was obtained, but the large number of messages required even for simple circuits significantly limited the speedup. The parallelism exploited in these two systems was limited to a parallelized implication procedure and fault partitioning which was used to keep idle processors busy on other faults. The fact that the D-algorithm, which has been shown to be very inefficient for some classes of circuits, was used in these systems also increased the overall runtimes and limited the speedup possible. One of the most promising results presented in [18], however, was that topological partitioning resulted in significant reductions in the memory required in each processing node. For these reasons, research into topological partitioning with a more efficient ATPG algorithm such as PODEM [3] was undertaken.

4 4 1.2 Goals This research focused on a system that is based on topological partitioning of the circuit-under-test across several processing nodes. The goals of this research included expanding on previous work [17],[18], and extending it to a more efficient base ATPG algorithm. Analytical models of the topologically partitioned ATPG process were developed to help predict the performance that could be expected. These models, once validated through experimentation, were then used to predict the performance of the ATPG system. The model was then used to determine the communications latency required on a multicomputer to efficiently utilize this technique to achieve speedups. Another goal of this research was to develop parallelizations of the base ATPG algorithm to increase speedup. Investigations of how these parallelization methods could be used in conjunction with other parallelization methods such as fault or search space partitioning were also undertaken. Finally, this research outlined the additional work that will be required to make topological partitioning a valid addition to ATPG systems for large scale designs. 1.3 Organization This dissertation is divided into 7 Chapters including this introduction. Chapter 2 contains background material which includes a brief review of serial and parallel ATPG algorithms. A discussion of the ES-KIT distributed memory multicomputer ES-TGS parallel ATPG system used in this research is also included in Chapter 2. Chapter 3 describes the implementation details and results of the serial Topological Partitioning Test Generation System (TOP-TGS) developed for this research. Chapter 4 details the analytical model of the serial ATPG process and topologically partitioned ATPG. Predicted results are also developed in this chapter and compared to the actual results presented in Chapter 3. Chapter 5 details the algorithmic parallelizations developed for the TOP-TGS system and presents their results. Chapter 6 presents the results of using multiple parallelizations in the TOP-TGS system. Finally, Chapter 7 presents conclusions and future work. The future work section includes a discussion of how many of the heuristics presented in the literature could be implemented in a topologically partitioned ATPG system.

5 Chapter 2 Background This chapter presents the background material for the thesis. A brief presentation of serial ATPG algorithms is included to familiarize the reader with the ATPG problem. Next a discussion of the techniques available to parallelize ATPG is presented. Finally, the distributed memory multicomputer, ES-KIT, and the parallel test generation system, ES- TGS, that this work was based upon is presented. 2.1 Serial ATPG Most parallel ATPG algorithms, including the ones to be presented here, are based upon widely known serial ATPG algorithms. For a detailed discussion of ATPG algorithms, the interested reader is referred to [19] and [20]. For this research, we will consider only algorithms designed to generate tests for single stuck-at faults. These are physical faults that cause a node in the circuit to behave as if it were stuck at a logic 0 or a logic 1 level. The single stuck-at fault model is a simplification of the types of faults found in real circuits, but empirical evidence shows that for most common implementation technologies, it provides very high coverage of physical faults [21]. Automatic Test Pattern Generation can be thought of as the process of searching through the entire space of possible input patterns for a circuit in an attempt to find one which causes the output to differ depending on whether or not a circuit contains a specific fault. The size of the search space is 2 n where n is the number of inputs to the circuit. Because the search space is so large, many techniques have been developed to guide the search process. Most of the search techniques in popular use today fall into the class of algorithms called path runners. Path runners attempt to detect a fault by sensitizing it, and then sensitizing a path between the faulty node and a primary output. Sensitizing a fault consists of setting the value on the faulty node opposite the stuck-at vault, i.e. setting a logic 1 on a node being tested for a stuck-at 0 fault. Sensitizing a path consists of setting logic 5

6 6 values along a path in the circuit from the faulty node to the primary outputs such that a change of the logic value on the node is observable at the primary output. For example, if an AND gate is in the sensitized path, setting all of its inputs not in the path to a logic 1 will result in its output value following the value of the inputs in the path. In order for an algorithm to be complete, it must search all paths and combinations of paths in the circuit from the faulty node to the primary outputs. The major difference between the path sensitization algorithms presented in [2],[3],[4] is the method and order by which the actual logic values are assigned to nodes in the circuit. The D-algorithm [2], attempts to set logic values on nodes in the circuit by assigning values to nodes which precede them in the circuit topology. The PODEM [3] algorithm attempts to assign values to nodes in the circuit by assigning values to the circuit s primary inputs only. Because typical circuits have fewer inputs than internal nodes, frequently by orders of magnitude, the search space enumerated by PODEM is much smaller than the D-algorithm. Because this reduction of the search space makes PODEM much more efficient than the D-algorithm, PODEM is the basis for most follow-on algorithms that have been developed for ATPG [5],[6],[7],[8],[15]. For this reason, PODEM was also selected as the base algorithm for this work whereas the D-algorithm served as the basis for previous work in topological partitioning [17],[18]. A brief example of the PODEM algorithm will now be presented to familiarize the reader with this technique. Figure 2.1 contains a diagram of the circuit that will be used for the purposes of this discussion. The PODEM algorithm consists of 3 major processes; selection of the next objective, backtracing that objective to an unassigned primary input to determine the value that should be assigned to it, and assigning that value to the input and implying all node values in the circuit affected by that assignment. This latter simulation-like process is called forward implication. For example in Figure 2.1, consider the fault of line J stuck-at a logical 1. The first objective selected might be to sensitize the fault by setting node J to 0. Since both of the

7 7 A B G J s-a-1 C D H I L E F K Figure 2.1 Circuit under test. inputs to the OR gate which drives J need to be set to 0 in order to set J to a 0, setting one of them to this value becomes the next objective.the next step would be to backtrace this objective to a primary input. Figure 2.2 illustrates this process. Assume for the sake of discussion that line G is chosen to be set first. Typically, testability measures such as controllability and observability measures are used to assist in making these types of decisions. In order to set line G to a 0, either of the inputs to the AND gate that drives it must be set to 0. Input A is selected and it along with its value are pushed on a stack that is used to hold the input search space. The next step is to actually assign a value of 0 to input A and simulate the affect of this assignment on the rest of the circuit. This processes is called forward implication. Assigning a 0 to A will cause the AND gate to drive node G to a 0. No other nodes in the circuit will be affected by this assignment. Since the objective of setting node J to a 0 has not been satisfied, it will be backtraced again. This backtrace will determine that node I must be set to 0 and this must be accomplished by setting input E to 0 and node H to 0. Node H may be set to 0 by setting input C to 0. This assignment is pushed on the stack and implication is performed. Next, input E is pushed on the stack with its value and forward implication is done.this implication will result in node J taking on the required 0 value. This 0 value represents the value node J would have in the fault-free circuit, but node J will assume a value of 1

8 8 backtrace that objective to line A set to 0 A B backtrace that objective to line G set to 0 G J first objective: set line J to 0 s-a-1 C D H I L E F K Figure 2.2 Backtracing objective J = 0. in the presence of a stuck-at 1 fault. This set of values is represented using the D notation of [2]. A node with a value of D represents a 1 in the good circuit and a 0 in the faulty circuit. A node with a value of D represents a value of 0 in the good circuit and a 1 in the faulty circuit. Since node J is 0 in the good circuit and 1 in the faulty, its value will be represented by a D. Figure 2.3 shows the state of the circuit and the input stack after the assignments A= 0, C= 0, and E= 0. A=0 B C=0 D G=0 H=0 I=0 J=D s-a-0 L C 0 A 0 E=0 F K E 0 Figure 2.3 Circuit state and input stack. The final step in generating a test for node J stuck-at 1 is to make the value on node

9 9 J visible on the primary output. This process is called propagation and it involves sensitizing a path from the node to the output. This path is sensitized by setting all inputs to gates on the path to their non-controlling values. The non-controlling values for an AND or NAND gate is 1 and for an OR or NOR gate is 0. XOR and XNOR gates do not have a controlling value, so either a 0 or 1 may work. All gates which have a D or D on one of their inputs and an unknown, X, on their outputs are on potential paths. These gates constitute what is known as the D-frontier. In the example circuit, the gate OR gate that drive node L is on the D-frontier at this point. Node K must be set to the non-controlling 0 value so that the value on the output L follows the value of node J. This task may be accomplished by setting input F to a 0. The final circuit state and input stack are illustrated in Figure 2.4. A A=0 B C=0 D G=0 H=0 I=0 J=D s-a-0 L=D E C 0 0 E=0 F=0 K=0 F 0 0 Test Figure 2.4 Circuit state and input stack. A test vector is generated by simply removing all input assignments from of the stack. All inputs not in the stack are assigned don t-cares in the test vector. In the example circuit, the vector 0X0X00 would be the test vector for fault J stuck-at 1. If, during the assignments of input values, an assignment is made that makes a test no longer possible, the last input assignment is popped off of the stack and the alternate value is tried. This process is called backtracking and it continues until a circuit state where

10 10 a test is again possible is reached or the stack becomes empty. Situations that may cause a test to be impossible include setting the faulty node to the same value as the stuck-at fault, and the disappearance of the D-frontier. For example, consider the circuit of Figure 2.1 with a stuck-at 0 fault on node K. The first objective would be to set node K to a 1. The first backtrace could lead to the assignment of a logical 0 on input F. Then in order to set node K to a 1, node I must be assigned a 1. This task may be accomplished by assigning E= 1. However, when node I is set to 1, node J becomes 1 and node L is forced to a 1 regardless of the value on node K. This assignment makes a test impossible with this input vector because the fault cannot be observed at the primary output. Figure 2.5 illustrates this situation. A B G J=1 F C D H I=1 L=1 E 0 E=1 F=0 K=D s-a-0 1 Test not possible Figure 2.5 Circuit state and input stack. Backtracking will occur at this point and the alternate assignment of node E= 0 will be tried. This assignment will also cause a test to be impossible because node K will then be set to 0, the same value as the stuck-at fault. Since assignments of both logic values to node E did not result in a test, backtracking to the previous assignment must occur. Thus, node F is now assigned a value of 1. A test can now be found by assigning a 0 value to nodes E, C, and A as shown in Figure 2.6. Backtracking results in an ordered search of the solution space and results in implicit pruning of the search tree when inconsistent states are encountered such as

11 11 A=0 B C=0 D E=0 F=1 G=0 J=0 0 F 1 0 E 1 E H=0 I=0 L=D Test not possible C 0 0 K=D s-a-0 A 0 Test Figure 2.6 Circuit state and input stack. assigning a value of 0 to input F in the previous example. By checking the two alternate assignments of input E and determining that they are both inconsistent, the entire portion of the search tree below F= 0 may be pruned. 2.2 Parallel ATPG This section provides a brief discussion of the methods that have been used to parallelize the ATPG process. For a more detailed presentation of parallel ATPG techniques, the reader is referred to [14]. These techniques can be divided up into 5 categories [13],[14]: 1) Fault Partitioning 2) Heuristic Parallelization 3) Search Space Partitioning 4) Functional (Algorithmic) Partitioning 5) Topological Partitioning The simplest way to parallelize the ATPG problem is divide up the fault list among multiple processors. Each processor then generates tests for each fault on its portion of the fault list until all faults have been detected. This scheme results in each processor having a completely separate task in that it performs the entire test generation procedure on its own. This method of parallelization has been termed fault partitioning. If the fault list is divided up carefully, each processor will have roughly the same amount of work to do and they will all finish in about the same time. In practice, optimal partitioning of the fault list is not easy

12 12 to do a priori so the scheduling can be done dynamically with each processor requesting a new fault from a master scheduler whenever it is idle. Dynamic scheduling requires increased communications overhead due to the requests from idle processors for new faults to process. The fault partitioning method is very suitable for coarse grained parallel systems because synchronization is only necessary when a new fault is needed from the remaining fault list. The biggest disadvantage of fault partitioning is that the setup time will be large. The entire ATPG program and circuit database must be loaded into each processor s memory across the message fabric. If the total amount of work that can be divided up among the processors is large (i.e. the fault list is long) then the percentage of time spent on setup can be kept small and this scheme has promise. If the circuit has a small number of faults, or fault classes, then the speedup will be limited as the above analysis suggests. In any case, this method does not scale well because of the large setup time. Also, performance of this method is poor if there are only a few hard-to-detect faults which account for most of the processing time. Because processors cannot cooperate in generating a test for the same fault, one or two processors could take hours to generate a test for these hard-to-detect faults while the others stand idle. Many typical circuits have only a few hardto-detect faults and fall into this category. Results for systems which use this technique show that linear speedup is possible only for a small number of processors, usually less than ten [22],[23]. Clearly, this method of parallelization is less than optimum although it has the benefit of being the simplest to implement. Because ATPG is an NP complete problem [5], heuristics are used to guide the search process. Research has indicated that many heuristics will produce a test for a given fault within some computation time limit when other heuristics have failed to do so [24]. These complementary heuristics can be used in a multiprocessor system to aid in the ATPG process. There are two basic strategies to heuristic parallelization; a variation of the fault partitioning scheme discussed above, and concurrent parallel heuristics [25]. In the variation of the fault partitioning method, called uniform partitioning, the

13 13 fault list is divided up among the processors and each generates tests for the faults on its own portion of the list. In generating the tests, however, multiple heuristics are used in sequential order to attempt to generate a test. If a heuristic fails to generate a test within a time limit, that heuristic is discontinued and the next one in the list is begun. This scheme has the same advantages and disadvantages as the fault partitioning scheme discussed above. However, it will be slightly better in some cases because the multiple heuristics will shorten the test generation time for hard-to-detect faults. In the concurrent parallel heuristic method, the system is required to have (m x n) processors where n is the number of different heuristics available. If m is equal to one, each processor computes a test for the same fault using one of the n heuristics. Whenever a processor succeeds in generating a test for the fault, it sends a stopwork message to the other processors in the cluster and they stop processing that fault. A new fault is selected from the fault list and the process begins again. If m is greater than one, the processors are clustered into groups of n and each cluster works on a separate fault. In this case, the system is actually using a combination of the fault partitioning and heuristic parallelization schemes. The concurrent parallel heuristic method has the potential to achieve greater speedups than the uniform partitioning method due to possible anomalies in the ordering of the heuristics for different faults. The main disadvantage of the heuristic techniques discussed previously is that the processors that are working on the same fault with a different heuristic are not guaranteed to be searching disjoint portions of the search space. That is, all of the heuristics may lead the ATPG program down the same path towards a non-solution. A better way to parallelize work on a single fault is to divide up the search space into disjoint pieces and evaluate them concurrently. This approach is a parallel implementation of the branch and bound method which involves concurrent evaluation of subproblems [26],[27]. This technique is called OR parallelism and its application to ATPG is presented in detail in [28],[29]. Search space partitioning involves dividing up the search space such that subproblems skipped by one processor are evaluated by another. The search

14 14 spaces for the processors are therefore disjoint and are spread across the solution space as far as possible to maximize the area of the current search. This organization increases the chances of finding a valid solution quickly. The process of dividing up a search tree is illustrated in Figure 2.7. The search space 0 A 1 0 A 1 0 A B 0 1 B 0 1 B 0 1 Inconsistent 0 C 1 Inconsistent 0 C Inconsistent 0 C 1 0 D 1 0 D 1 0 D 1 Inconsistent 0 E 1 Inconsistent 0 E Inconsistent E 1 Processor X Processor X Processor Y Figure 2.7 Division of search tree. belonging to processor X is divided up into 2 parts for processors X and Y. Notice that the processors are in fact always working on different problems (i.e. disjoint search spaces) and that the place where each processor will backtrack to is different. If processor X finds a conflict, it will backtrack and try an alternate value for input A. Processor Y will backtrack and try an alternate value for input C in case of a conflict. This approach keeps the current search space as large as possible which tends to make the search more efficient. A major problem with search space partitioning is that it also requires a long setup time. Each processor must have the entire circuit database and ATPG program loaded into it. On the other hand, processors are dedicated to only one task which does not change and the tasks are completely independent. This fact makes the overhead due to communications

15 15 very low and results in greater efficiency. Search space division is therefore most appropriate for circuits that contain a small number of hard-to-detect faults which take up a great deal of computation time. It is also ideally suited to message passing systems because of its coarse grained parallelism. There is another technique that can be used to allow more than one processor to work simultaneously on finding a test for a single fault. This technique is called functional partitioning. Functional partitioning refers to the process of dividing up an algorithm into independent subtasks. These independent subtasks can then be executed on separate processors in parallel. This method of parallelization is also known as algorithmic or AND parallelism. Most serial ATPG algorithms are difficult to parallelize functionally. The few subtasks that can be identified, such as fault sensitization and path sensitization are not independent. That is, action taken to perform one of these processes may change the circuit state such that it has a side effect or causes an inconsistency in another process. Justification of two goals cannot, in general, be done simultaneously. One way to allow parallelism in justification is to perform justification for goals in different faults simultaneously. This parallelism is an adaptation of the fault partitioning scheme already discussed. In all of the parallel algorithms discussed thus far, each processor has to have access to the entire circuit database. This requirement may be a problem for large circuits because each node many not have enough memory to hold the entire circuit database. Also, loading the database into memory in a message passing system takes time. Topological partitioning of the circuit into separate partitions and instantiating each on a different processor would help alleviate this problem. Researchers have been investigating topological partitioning for parallel logic simulation for some time. Although logic simulation is a different problem, it has some similarities to algorithmic ATPG. A discussion of circuit partitioning for parallel logic simulation is included in [30]. The objective of the partitioning scheme is to reduce the communications necessary between partitions as much as possible while maximizing the

16 16 amount of work that can be done concurrently within the partitions. This paper analyzes 6 different partitioning schemes; random partitioning, natural partitioning, partitioning by gate level, partitioning by element strings, and partitioning by fanin and fanout cones. Fanin cones are an attempt to place all gates connected to a single primary input (even through other gates) in the same group. Fanout cones are constructed the same way using primary outputs. The results presented in [30] indicate that for simulation, random partitioning scores the best in maximizing concurrency, but worst in interprocessor communications. This condition would make random partitioning a bad choice for most systems. Partitioning by fanin and fanout cones offers the best trade off between concurrency and interprocessor communications with fanout cones being slightly better. This result is most likely due to the fact that fanout cones closely fit the flow of activity in the circuit during logic simulation. An analysis of circuit partitioning techniques for ATPG is the focus of section 3.3 of this work. Another issue in circuit partitioning for ATPG is the number of gates in each partition; the so-called block size. As the number of gates assigned to a block decreases, the amount of work that can be done between communications steps becomes smaller. Hence, the parallelism becomes more fine grained. The minimum block size will also affect how the problem scales with increasing numbers of processors. As more processors are added to the system, the block size will get smaller and efficiency will decrease. An investigation of the amount of parallelism theoretically available in topologically partitioned parallel ATPG was undertaken in [31]. This work attempted to find an upper bound on the amount of parallelism present in conflict-free test generation. Two phases of the test generation process using the PODEM algorithm, backtracing and forward implication, were parallelized. Each gate was assumed to be placed on its own individual processor. The objective then was to measure the maximum number of operations that could be performed in parallel by individual gate processors as possible. The methods that were used to accomplish this objective are best illustrated using an

17 17 example. Consider the example circuit of Figure 2.1 again. In setting the objective of J= 0, there are several paths that must be backtraced such as the J->G->A path and the J->I-> H->C path. If these paths could be backtraced at the same time, then parallelism would be present. If each backtrace operation is assumed to take place in the same amount of time, then the backtraces will propagate through the circuit on a level by level basis. At each gate, backtraces are generated on each input as required. Using this method, conflicts may arise at points of reconvergence of fanout. This point is where the authors use the conflict-free assumption. The correct values to be placed on reconvergence points are precomputed offline and the conflict is avoided. Figure 2.8 illustrates the process of parallel backtracing for the objective J= 0. backtrace operations Time-step: 2 Time-step: 1 objective values (not yet implied) A=0 B C=0 D Time-step: 3 G=0 Time-step: 2 H=0 I=0 Time-step: 1 J=0 s-a-1 L E=0 Time-step: 3 K F Figure 2.8 Multiple parallel backtrace. Notice that in this case, the maximum number of backtrace operations that occur during the same period of time, or time-step is 2. This measure would be the maximum amount of parallelism available in this step of the ATPG process. Also note that the objective values required on the individual lines are not assigned to them during the

18 18 backtrace procedure. These values must be set through forward implication as is done in the serial PODEM case. The difference is that in this parallel implementation, the implication procedure is parallelized. Thus in the case shown in Figure 2.8, the values A= 0,C= 0, and E= 0 would be implied at the same time. Implications would then be performed back through the circuit in parallel in a manner similar to backtraces. Analysis of Figure 2.8 shows that the maximum parallelism present during parallel implication of the above input assignments would be 2 as well. The authors of [31] use this technique to analyze the maximum and average amount of parallelism present in the ISCAS 85 [16] benchmark circuits. The analysis was done on a conventional workstation using a simulation based technique. The average amount of parallelism they found was less than they expected. For forward implication, most circuits had an average parallelism of 4 to 7, although some circuits had values higher than this. For backtracing, most circuits had average parallelism values of 1.5 to 3.5. This method of analysis of the parallelism present in topological partitioned ATPG has several drawbacks which do not allow a valid conclusion to be drawn concerning the performance of topological partitioning. First, the assumption of one gate per processor is unrealistic. Second, as shown in this thesis, there are other methods of parallelism available for use with topological partitioning. Finally, the work in [31] completely ignores the practical aspects, such as synchronization protocols and communications latency, of implementing this type of system on an actual multicomputer. The authors do acknowledge that this technique may have benefit when used with other parallelization methods and that it has the important characteristic of allowing larger circuits to be processed on a given distributed memory multicomputer. 2.3 Hardware Considerations This research utilized a distributed memory MIMD machine known as the ES-KIT 88K. It is assumed that the reader is familiar with the typical characteristics, such as message passing and connectivity, of parallel machines of this type. Only a brief discussion of the affect of the characteristics of the machine and the programming of the application

19 19 will be undertaken in this section. This section will be followed by a discussion of the actual hardware used in this research. Finally, the software system that formed the basis for this work will be presented. Distributed memory machines have local memory for each processor but no globally accessible memory. Processors must send messages across some interconnection medium, also called a message fabric, to share data. It may take hundreds or even thousands of instructions to package a message for transmission so communications costs are much higher than for shared memory. Also, message transfer time depends on the distance between communicating processors. Distance between processors is a measure of the length of the communications channel and the number of other processors which must pass the message along for it to be transferred. There are a number of interconnection strategies used on message passing systems [32]. Each one involves a trade-off of distance between processors and the number of connections per processor. Because communication time is distance dependent, data location in message passing systems is as critical as, if not more so, that in shared memory systems. Determination of which processors perform certain tasks is much more important in distributed memory systems than in shared memory systems. Processes that must communicate frequently must be instantiated on processors that are close to each other. Therefore, algorithms must be designed for the specific communications topology of the target machine. Algorithms designed for one machine may not perform satisfactorily on another [33]. Programs on a message passing system will in general, use built-in systems calls to send and receive messages. Data must be explicitly moved from one processor to another using the send and receive mechanism. Synchronization between processors must also take place using messages and is therefore more time consuming then in shared memory systems. For this reason, algorithms for message passing machines must use more coarse grained parallelism. Coarse grained parallelism implies that many instructions must be processed between synchronization events. Setup time is much longer on message passing systems because all of the program code and data, such as the circuit topology information, must be loaded across the message fabric. New processes are harder to spawn

20 20 for this same reason. Therefore, setting up one processor as a master is more difficult. Proper load balancing among the processors is also harder to achieve. In general, algorithms for message passing systems are more difficult to design well, but the programs are themselves easier to implement and debug because data consistency is more easily maintained [33] Experimental Systems Kit The parallel processing machine available for use in this research is the Experimental Systems-KIT (ES-KIT) 88K processor developed by the Microelectronics and Computer Technology Corporation (MCC). The ES-KIT was developed by MCC under a Defense Advanced Research Projects Agency (DARPA) grant to facilitate experimentation into new parallel computer architectures and application specific computing nodes. The ES-KIT system includes the 88K processor, described below, and the ESP runtime system, described in the next section. The description of the ES-KIT system is brief and limited to the characteristics which influence applications design. A more detailed description of the ES-KIT system can be found in [34]. Further, the ESP system and ES-KIT applications are implemented in the C++ language [35]. It is assumed that the reader is knowledgeable in this language ES-KIT Hardware The 88K processor is a distributed memory parallel architecture based on a 16 node, 4X4 2 dimensional mesh. A Sun 3/140 running the BSD 4.1 operating system acts as a host for the 88K hardware.the Sun communicates with the 88K processor through a VME bus interface board. The message fabric of the 88K processor is based on Simult System's Asynchronous Message Routing Device (AMRD). Each node in the mesh has its own AMRD. The use of the AMRDs prevents having to use store and forward message passing. The use of AMRDs means that the processor on each node is not involved in passing messages that are not addressed directly to it. The message fabric is in general capable of passing messages at a rate of 20MB per second, but the software overhead of packing and

21 21 unpacking messages at either end prevents this rate from being achieved. Each node is a general purpose computing system based on Motorola's processor family. The nodes consist of four boards which communicate with each other across an internal bus based on the 88K standard. The four boards consist of the Message Interface Board, the Processor Board, the Memory Module, and the Bus Terminator Module. The boards are connected through a unique set of stacking connectors which allow the nodes to be build on top of each other. A typical installation consists of two nodes stacked on top of each other. The node stacks are arranged on top of Mother Board modules that provide power and ground connections and contain the AMRDs. A 16 node configuration consists of four Mother Boards mounted in a plane, each one containing two stacks with two nodes per stack. The processor board consists of one 20 MHz Motorola RISC microprocessor, and two cache modules. The two 88200's provide separate paths for instructions and data. The Memory Module provides 8MB of dynamic RAM. It is possible to have up to four Memory Modules per node for a total of 32MB of memory, but the standard configuration is only 8MB. The Message Interface Module consists of one processor, 128KB of data memory, 128K of instruction memory, and interface logic to provide a path from memory to the node's AMRD. The processor on the MIM takes care of all processing necessary to package a message and send it out through the AMRD. The MIM processor off-loads a significant amount of message processing from the in the Processor Module. Finally, the Bus Terminator Module provides the electrical termination for the high speed lines in the 88K bus, general purpose services such as the system clock, and a UART interface to the outside world for debugging and repair of the node hardware ESP Runtime System The ESP (Extensible Software System) run-time environment is as important and complex as the 88K processor. The environment is written in C++ and is intended to maximize flexibility in the types of configurations that can be used in a parallel processing

22 22 system. The environment consists of four major components, the ISSD, the mail daemon, the shadow process, and the actual ESP kernels. The Inter Service Support Daemon (ISSD) is the heart of the system in that it is the first process invoked by the user and it constructs the rest of the run-time environment. The ISSD is the major interface between the ESP environment and the outside world. The ISSD controls the starting and terminating of applications programs, and all communication with peripheral devices including screen and disk IO. The ISSD begins by reading the configuration file to determine what ESP components are to be invoked. The configuration file is created by the user and contains instructions as to how many of the various components are to be constructed, where they are to run, and how they are connected. The minimum configuration file must contain the invocation instructions for a single ISSD, one mail daemon, and one or more ESP kernels. The ISSD also invokes the public service objects (PSO's) such as the application manager and the kernel librarian which are necessary to run any application. The ISSD always runs on the Sun host machine, but the PSO's can run on any of the 88K nodes. The mail daemon is responsible for routing messages between individual or groups of ESP kernels. Each mail daemon is connected to all of the kernels in its group, every other mail daemon in the configuration, and the ISSD. This connectivity allows messages to be passed from kernel to kernel with the minimum handling possible. The configuration must contain a minimum of one mail daemon for each type of ESP kernel in the configuration. The shadow process runs on the host Sun where the ISSD is located. The shadow process is responsible for reading the application source code files and managing the terminal IO for the application. The shadow process is the next ESP component invoked by the user after the ISSD. Finally, the ESP kernel is the work horse of the ESP system in that it actually runs the application. The kernel performs memory management, message packing and unpacking, and task switching for the applications. The kernel runs on top of a rudimentary OS in the 88K processor. The message passing portion of the kernel utilizes an MCC

23 23 developed protocol which utilizes the processor in the MIM on the 88K processor Object Oriented Programming in ESP The ESP system uses the C++ object oriented paradigm as its abstraction for parallel processing. Applications written for the 88K processor to run in ESP must be programmed in C++. C++ incorporates the ideas of objects, data encapsulation, and inheritance. C++ objects or classes are instantiated on different nodes and communicate with each other through method invocation and return values. Each object has its own local data contained within the node and that data can only be manipulated by method calls. There is no global or 'public' data allowed in ESP. There were five major changes made to the C++ language to implement the distributed processing environment of ESP. These changes consisted of overloading the pointer-to-member function ->(), redefining the return values available for methods, overloading the 'new' function, eliminating the 'main' routine, and incorporating the concept of futures. All objects that are to have methods that are available for remote invocation must be derived from the object remote_base. This object was developed by MCC and includes several features necessary to implement remote method invocation, the first of which is a handle. A handle is a pointer to an instance of an object and contains all of the information needed to address an object in a distributed system. This information is contained in four parts, a node number where the object actually resides, a class number for the object, an application number and the actual instance number of the object. Handles can be passed between objects or the address information can be passed and a new handle constructed to point to that object. The second feature included in remote_base is the overloaded pointer-to-member function. In regular C++, the method invocation: object_instance->method(arg1,arg2,...) is implemented as a subroutine call. In ESP C++, the object may reside on a remote

24 24 node. Therefore, the method call must be invoked through message passing. Overloading of the ->() function for remote methods handles this process. Methods for an object derived from remote_base are defined to be remote by declaring them in the public section of the object specification. When a method on a remote object is invoked, the kernel reads the argument list to determine its length. It then copies the argument list into the message buffer with the length of the argument list in bytes appended to it. Finally, the kernel uses the handle of the object to instruct the MIM processor where to send the message. When the receiving object receives the message, it invokes the proper method with the argument list. If a value is to be returned by the method, one of the return macros defined in the ESP programming environment must be used. The return macros instruct the kernel that the return is to a remote object and that it must be packaged as a message. Macros are available for returning most of the common data types such as integers, doubles, characters, and strings. There is also a pointer return macro, but its functionality is different than in regular C++. This macro is necessary because pointers on remote nodes are meaningless in the ESP environment. If a pointer return is specified, the kernel packages the entire object to which the pointer refers and sends it back to the node that invoked the method. The kernel on the invoking node then copies the returned object into its memory space and returns a pointer to this copy to the invoking object. In this way, any structure or object the size of which can be determined at compile time can be returned from a remote method invocation. In C++, the new operator is used to allocate memory space for instances of objects. In the ESP environment, the new operator is overloaded to allow arguments to be passed to it. These arguments specify which node an object is to be instantiated upon. The syntax for a call to the overloaded new function is: object_pointer = ( object_type* )new { node,relationship }object_type(); The node, relationship pair is used to specify the location of the object. For example if the variable homenode is specified to be (1,1), the call new{homenode,sameas} will create the object on node 1,1. Options for the relationship variable include; SAMEAS, DIFFERENT, NEAR, FAR, and NEXT. The next relationship does not need a node

25 25 specifier and it allows the kernel to select the node for the object using its own criteria. At this point, the criteria used is the amount of memory left on each node. The object is created on the node with the most free memory. Other algorithms that take into account load balancing and communications costs are under development by MCC. Until they are available, the user must be careful to take these factors into consideration and specify where each large object is to be created in order to optimize the application. In ESP C++, there is no 'main' routine. When the shadow program loads the first object in the application, its constructor is invoked after it is loaded. This constructor must do the work necessary to start the application. This may be as simple as calling another routine within the same object to take over control, or as complex as creating all other objects and directly performing the necessary algorithm. The former approach is recommended as it is more 'correct' and it allows the kernel to complete construction of the initial object and alter the stack size for that object if necessary. When a method on a remote object is invoked, the processing takes place on the remote node. The invoking method is then free to perform some other calculation if it does not need the result of the remote method. If the invoking method does need the result of the remote method, it must block until that result is returned. Controlling when the object blocks for the return result is done by using futures. Futures were introduced as a part of Multilisp, [36] and allow lazy evaluation of return values. Note that the only parallel processing that occurs is between the time that the remote method is invoked and the future is evaluated. This fact demonstrates the value of the future abstraction for methods that return values. Of course, remote methods that do not return values always run in parallel with the invoking method. One notable characteristic of objects in ESP is that to insure correctness only one method in a specific object may be invoked at a time. This includes methods that are blocked waiting for a future or return value. Thus if two objects each invoke a method in each other and wait for return value, deadlock is possible. This process is illustrated in Figure 2.9.

Preizkušanje elektronskih vezij

Laboratorij za načrtovanje integriranih vezij Univerza v Ljubljani Fakulteta za elektrotehniko Preizkušanje elektronskih vezij Generacija testnih vzorcev Test pattern generation Overview Introduction Theoretical