TIME WARP PARALLEL LOGIC SIMULATION ON A DISTRIBUTED MEMORY MULTIPROCESSOR. Peter Luksch, Holger Weitlich

Size: px

Start display at page:

Download "TIME WARP PARALLEL LOGIC SIMULATION ON A DISTRIBUTED MEMORY MULTIPROCESSOR. Peter Luksch, Holger Weitlich"

Lionel Hodge
5 years ago
Views:

1 TIME WARP PARALLEL LOGIC SIMULATION ON A DISTRIBUTED MEMORY MULTIPROCESSOR ABSTRACT Peter Luksch, Holger Weitlich Department of Computer Science, Munich University of Technology P.O. Box, D-W-8-Munchen, Germany phone: ; fax: ; luksch@informatik.tu-muenchen.de Germany to appear in SCS European Simulation Conference, Lyon, June 7--9, 99 In this paper we describe a Time Warp based parallel implementation of an event driven logic simulator on a distributed memory multiprocessor (ipsc/86). The basic Time Warp mechanism has been complemented with an optimized method for incremental state saving and a mechanism that optimizes re-simulation of a rolled back period of simulated time which is especially worthwhile for complex elements. In addition to static partitioning where elements are distributed to partitions either randomly or by using a min-cut algorithm dynamic repartitioning is possible in our implementation. For our measurements, we used a set of wellknown benchmark circuits. Speedups showed to be strongly dependent on the circuit being simulated, its input stimuli and on the way circuits are partitioned. However, one observation has been made with most of the workloads: The simulators' lvt's tend to diverge extremely throughout the simulation. Even though memory requirements for state saving have been minimized, simulators whose lvt is far ahead of the other processes run out memory for larger circuits. We therefore had to limit Time Warp's optimism by preventing simulators from getting too far ahead of gvt. THE TIME WARP MECHANISM The Time Warp mechanism [Jeerson, 98] is an optimistic synchronisation protocol which can be used to synchronize parallel discrete event simulation that is based on model partitioning. The protocol, however, is not restricted to this application. Each process simulates a partition of the circuit's elements and has its own simulation time (lvt) and event list. Whenever it generates an event aecting a signal that connects to remote partitions, it sends an event message to the corresponding processes. Processes simulate their partition based on their current This work has been partially funded by the DFG (\Deutsche Forschungsgemeinschaft", German science foundation) under contract No. SFB, TP A information about signal values, which, however, may be incorrect because events with time stamps in the local past of the receiving simulators may arrive from remote partitions. Such an event message is referred to as a straggler. Stragglers as well as anti messages (i.e. messages informing a process about the incorrectness of an event message) cause the simulator to roll back, i.e. return to the point where simulation began to be incorrect. Rollback involves restoration of the local state information for the point in simulated time to which the simulator rolls back and the cancellation of all messages sent in the rolled back period. Therefore, the simulation process has to store information about its local state and the messages it has sent. One method to undo messages is to send anti messages immediately upon rollback (aggressive cancellation). The alternative approach, lazy cancellation, is based on the observation that a large portion of events can be expected to be generated again during re-simulation of the rolled back time interval. Therefore, an anti message is sent only if the corresponding event is guaranteed not to be generated again. In our implementation, we use lazy cancellation. Global progress of a Time Warp simulation is measured by the global virtual time (gvt), which is the minimum of the local simulation times and the time stamps of all un-processed events in the system. A number of gvt algorithms have been proposed in literature [Samadi, 98, Lin & Lazowska, 99, Bauer & Sporrer, 99]. TIME WARP PARALLELIZATION OF A LOGIC SIMULATOR The basis for our parallel implementation is a logic simulator for the gate level that implements most of today's state-of-the art techniques in the modelling of digital circuits [Krodel & Antreich, 99]. It uses a six-valued logic and allows for ambiguity delays to be modelled explicitly. The program is written in c for

2 a unix environment. The parallel program has been implemented on ipsc/86 and ipsc/ multiprocessors using mmk, a parallel programming library designed within project SFB, TP A. communication rate [MB/sec]. MMK remote send operation ipsc/86 Communication. Figure displays the performance of the communication system as a function of message length. For each message there is a signicant startup time of s on the ipsc/86 and ms on the ipsc/ which is independent from message length. The latency is due to the circuit switched message passing on the ipsc's. Therefore, a given amount of information should be transferred using few long messages instead of many short ones, i.e. several event messages have to be combined into one message that is transmitted by the communication system. On the other hand the synchronisation protocol requires remote partitions to be be informed about events as soon as possible in order to prevent simulators from having to roll back over long periods of time because because they were informed too late about the incorrectness of their computation. In our implementation communication is controlled by a buering mechanism that accounts for both of these conicting requirements. There is one buer for each remote partition where event messages for the corresponding partition are written to. After each step (i.e. one iteration in the loop of signal value update followed by the evaluation of fanout elements) buers are checked whether they have reached some minimum length or contain events that have been generated more than a maximum number of steps before. The number or steps the simulator executes while an event stays in the buer is referred to as the event's age. If a buer is long enough or contains events that have been held in the buer for too long a time, its contents is sent. Synchronisation eciency can be optimized for a given multiprocessor system by adjusting these parameters. State Saving. The state of a simulator can be saved either periodically as a whole (checkpointing) or incrementally by storing state changes. Since in logic simulation each event changes only a very small portion of the state, checkpointing would result in inecient memory usage. Moreover, the target system, like most of today's parallel computers, has only limited physical and no virtual memory on the. As status information is quite large when simulating big circuits, incremental state saving has to be used. Memory requirements are reduced further by saving only the rst change of a signal value that occurs in processing an active point in simulated time. Since a rolled back point in simulated time always is resimulated completely, this state information is sucient.... ipsc/ message length [kb] Figure : mmk: communication performance as a function of message length Global Virtual Time. We have implemented two gvt algorithms: Samadi's gvt [Samadi, 98] and an algorithm proposed by Lin and Lazowska [Lin & Lazowska, 99]. In contrast to the algorithm by Lin/Lazowska, Samadi's simple gvt algorithm requires all processes to stop simulation during gvt computation. Our implementation of inter-simulator communication permits processes to continue local simulation. However, they must refrain from sending any event messages during gvt computation. Simulation with Samadi's algorithm showed to be faster than with Lin/Lazowska's method because the latter requires more messages to be sent. Optimized Re-Simulation after Rollback. Lazy cancellation is based on the optimistic assumption that most events will be generated again when resimulating the rolled back interval. While lazy cancellation prevents unnecessary re-evaluations in remote partitions, local computation is redone completely. For complex elements like PLA's or even microprocessors it is desirable to avoid unnecessary evaluations in the local partition, too. In order to skip renewed element evaluation, the simulator must know the events that have been generated upon evaluation of the element under consideration in the preceding simulation, i.e. the causality relation between events needs to be stored during \normal" simulation. For each event that is executed, pointers to the events that have been generated when evaluating the fanout elements of the signal that is aected by the event are stored together with the identity of the fanout element whose evaluation caused them to be created, the element's input signal values and its internal state (if any). During rollback the simulator marks local events instead of deleting them as it is done in the basic Time Warp mechanism. If during re-simulation of a rolled back period in simulated

3 time an element is up to be evaluated due to an event that has occurred in the previous simulation, too, the simulator has to check whether the element's current inputs and its internal state are the same as just before the corresponding evaluation in the previous simulation. If so, the element need not be evaluated. Instead, the events caused by its previous evaluation can be re-scheduled. Partitioning. Before simulation, circuits are partitioned based on their topology. We use random partitioning and a min-cut algorithm which is a generalization of Fiduccia's and Mattheyses' bipartitioning method [Vijayan, 989]. At runtime, dynamic repartitioning allows to take into account the activity of elements and signals in order to distribute work evenly among the processors. Each simulator reports to the gvt process the time of the earliest un-processed event in its partition that it knows about. These time stamps reect the simulator's load. A simulator reporting a low time stamp lags far behind the others in its simulation, i.e. it is heavily loaded. A lightly loaded simulator will advance its LVT quickly and thus report a high time stamp. In principle, elements should be moved from the \slowest" simulator to the \fastest" simulator. Elements are selected according to their complexity and their activity. In addition, the eect of possible element migrations on communication topology must be taken into account. Time Warp synchronisation introduces an additional problem: in order to be able to roll back simulation, a simulator whose partition has been assigned new elements has to know state information associated with signals connected to these elements. If a signal has not been in the partition before repartitioning its \history" must be transferred, too. In our implementation the gvt process tells the \slowest" simulator to move elements out of its partition. This process will determine the target partition according to the other simulators' load values (provided by the gvt process) and the number of events that to the other simulator in the past. It selects elements whose outputs are already in the receiver's partition, whenever this is possible. For each element being a candidate to be moved the eect that moving it would have on communication topology is considered. Migrations resulting in minimal communication costs (i.e. number of interpartition signals weighted by their activities) are preferred. Also, moving few highly active elements is preferred to moving more but less active elements. Element and signal activities are measured by counting the number of evaluations of each element and the number of events for each signal. EXPERIMENTAL RESULTS The parallel simulator has been run with several of the ISCAS-89 benchmark circuits. Performance measurements were done by source code instrumentation. Times for dierent subtasks were measured using the ipsc's hardware clock. Additional statistics were collected by counters. Dynamic behaviour of LVT's on dierent was observed using the topsys software monitor [Bemmerl et al., 99]. In most of our simulation runs the simulators' lvt's have diverged extremely. During simulation of, units of time lvt's diverge by up to more than, time units, i.e. nearly half the total period being simulated. Even though memory consumption has been minimized by incremental state saving, simulators run out of memory for larger circuits or longer input sequences. We therefore had to limit Time Warp's optimism by preventing simulators from advancing their lvt's too far ahead of gvt by suspending simulation if a maximum value for memory consumption is exceeded. Speedup. Speedup does not scale linearly with the number of simulators. Instead curves show peaks and valleys (see gure ). Despite being not a straight line, the curve clearly has a positive slope. In addition to speedup the following statistics are displayed: the time that is spent in rollback, the time for communication and for processing extern events and the time during which the simulation is suspended to prevent processes from running out of memory. For each measurement (i.e. number of simulators), the gure displays the maximum value of all partitions involved. There is a clear correlation between good speedup and low rollback costs and simulation being suspended rarely. The correspondence between peaks in speedup and valleys in communication is less distinct. For more than two, memory consumption for state information always reaches the limits set by the ' physical memory capacity. We have also gathered statistics on communication. Having set the parameters for event message buering to a maximum event age of and a minimum message length of events, we found average message length to be in the range of to kb for the simulation of c. For this message length, eective bandwidth is still far below its maximum value (see g. ). Communication performance can be optimized by increasing the maximum event age parameter. For larger circuits, however, the buer length can be expected to increase since the number of events that are generated in each simulation step will increase as partitions get larger. For all our test runs, only the time for the simula-

4 simulated time GVT LVT LVT LVT LVT GVT and LVT s trace (TOPSYS software monitoring) circuit: c clock resolution: ms real time [sec] Figure : GVT and LVT's vs. real time (trace generated by topsys software monitor) tion proper has been measured. Input and output les had to be accessed using Intel's remote hosting software. Therefore, i/o has been extremely slow. Unfortunately it was impossible to use the concurrent le system (cfs) because its use is not supported by mmk. However, since parallelization aims at acellerating computations, not I/O, ommiting I/O times seems to be justied for the evaluation of a synchronisation protocol. Monitoring LVT's and GVT. lvt's and gvt have been observed with the help of topsys' distributed monitoring system. An inspect task on one node periodically broadcasts display commands for the lvt and gvt variables to the and stores their replies in a buer that is written to le after simulation has nished. The monitoring technique provides the best possible approximation to a global time base in the distributed memory multiprocessor. Figure shows a trace from the simulation of c. Samadi's simple algorithm has shown to approximate gvt suciently good. Its main benet is the small number of messages per gvt computation. Its disadvantage of having to stop simulation during gvt computation is mitigated by event message buering which allows local simulation to proceed if no event messages are sent while processes are computing their local minima. Optimized Simulation after Rollback. Reducing the number of element evaluations during resimulation of a rolled back period of simulated time can signicantly increase Time Warp's performance if elements are complex to evaluate. Its benet, however, varies strongly with the number of processes in the parallel simulation. For an element evaluation time of ms the maximum increase in speedup that we have observed in the simulation of c is a factor of more than two (see g. ) For some numbers of partitions there was, however, no noticeable benet from optimized re-simulation. CONCLUSIONS AND FUTURE WORK Measurements have shown that Time Warp's ef- ciency strongly depends on an equal distribution of computation load on processes. Although elements have been evenly distributed on processes in static partitioning lvt's diverge extremely. This observation emphasizes the need for dynamic repartitioning. We have not yet been able to analyze Time Warp's behaviour and the eects of our optimizations comprehensively because a detailed study requires a very large number of measurements to be carried out where each of the numerous parameters impacting TW's performance is modied in a controlled way. However, program development and performance measurements were impeded by the fact that our ipsc's have been very unreliable for more than a year now (and still are). Hoping for the system's reliability to improve in the future we intend to carry out more measurements especially in order to evaluate our optimizations to the basic Time Warp mechanism.

5 speedup speedup basic Time Warp method optimized re-simulation rollback time [sec] communication + processing extern events [sec] simulation suspended [sec] Figure : simulation of c (multiple delays) REFERENCES [Bauer & Sporrer, 99] Bauer, H. & Sporrer, C. (99). Distributed Logic Simulation and an Approach to Asynchronous GVT-Calculation. In Proceedings of the 99 circuit: c (unit delay), element evaluation: ms Figure : The eect of optimized re-simulation SCS Western Simulation Multiconference on Parallel and Distributed Simulation (PADS9) (pp. {9). Newport Beach, California. [Bemmerl et al., 99] Bemmerl, T., Lindhof, R., & Treml, T. (99). The Distributed Monitor System of TOPSYS. In H. Burkhart (Ed.), Proceedings of CON- PAR9 VAPP IV, volume 7 of LNCS (pp. 76{76). Zurich, Schweiz: Springer-Verlag. [Jeerson, 98] Jeerson, D. (98). Virtual Time. ACM Transactions on Programming Languages and Systems, 7(), {. [Krodel & Antreich, 99] Krodel, T. & Antreich, K. (99). An Accurate Model for Ambiguity Delay Simulation. In 7th ACM/IEEE Design Automation Conference (pp. {7). [Lin & Lazowska, 99] Lin, Y.-B. & Lazowska, E. (99). Determining the Global Virtual Time in a Distributed Simulation. In Proceedings of the 99 International Conference on Parallel Processing, volume III (pp. {9). [Luksch, 99] Luksch, P. (99). Parallele Logiksimulation auf Multiprozessoren mit verteiltem Speicher. In H. Fuss & P. Schwarz (Eds.), 8. Workshop Simulationsmethoden und -Sprachen fur verteilte Systeme und parallele Prozesse, volume 7 of ASIM-Mitteilungen Dresden: ASIM. [Samadi, 98] Samadi, B. (98). Distributed Simulation, Algorithms and Performance Analysis. Technical Report, University of California, Los Angeles, (UCLA). [Vijayan, 989] Vijayan, G. (989). Min-Cost Partitioning on a Tree Structure and Applications. In 6th ACM/IEEE Design Automation Conference (pp. 77{ 77). [Weitlich, 99] Weitlich, H. (99). Parallele Logiksimulation nach der Time-Warp-Methode auf einem Multiprozessorsystem mit verteiltem Speicher. Diplomarbeit, Technische Universitat Munchen, Institut fur Informatik, Munchen.

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,