A Review on Parallel Logic Simulation

Size: px

Start display at page:

Download "A Review on Parallel Logic Simulation"

Cody Hoover
5 years ago
Views:

Volume 114 No. 12 2017, 191-199 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A Review on Parallel Logic Simulation 1 S. Karthik and 2 S.

1 Volume 114 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu A Review on Parallel Logic Simulation 1 S. Karthik and 2 S. Saravana Kumar 1 Department of CSE, Vels University, Chennai, India. skarthikvit@gmail.com 2 Department of CSE, Karpagam College of Engineering, Coimbatore, India. saravanakumars81@gmail.com Abstract Verification is the important in decreasing the development and production cost of an IC. Parallel processing is considered as one of the key technique to achieve very high speed and performance in case of logic simulation/verification. Logic simulation is recognized as a most frequently used design verification approaches. In this paper various parallel processing techniques used in verification of ICs are reviewed. Key Words:VHDL, Verilog, PDES, GPGP, HDL. 191

2 1. Introduction The design of a VLSI circuit system starts with specification and formally the behaviour of the system is described using hardware description language. VHDL [l] and Verilog [2] are considered as one of the famous languages even though other languages like System C, System Verilog have been developed. Verification plays a pivotal role and a major step in designing an IC as it finds the bugs in the early stage. It can either be done by formal verification or by simulation based. Complex circuit design depends heavily on simulation to ensure that the design matches specifications and to increase system performance. Simulation of VLSI based systems containing millions of transistors/gates is time consuming, and has become a hindrance in the design flow. Engineers rely upon parallel simulation of HDL based systems to increase the speed of simulation. This paper reviews various parallel simulation technique and also discuss about future trends. 2. Factors Affecting Parallel Simulation Design Partitioning is a significant characteristic of distributed parallel simulation. Partitioning impacts the signals flowing between the partitions and also it affects the synchronization. Partitioning can be done based on functionality, no. of modules, gates etc. Biggest difficulty is minimizing communication between the partitioned modules and also running the modules concurrently. Another biggest challenge is communication overhead which is the is defined as the time exhausted in exchanging the values or messages between the partitions. Synchronization overhead is another factor in coordinating all simulations running parallel. When number of partitions increase all the factors discussed will be dominate. Fig.1shows performance enhancement in CPU, Memory latency and interconnect technologies. Growth of CPU has significantly high when compared with memory access time and interconnects technology. The reason behind not adapting parallel simulation is clearly shown in the figure below. Performance in decades interconnect memory latency cpu Performance in decades Figure 1: Shows the idea behind partitioning the design for accelerating simulation 192

3 Partition1 Partition2 Partition3 Core1 Core2 Core 3 Figure 2: Distributed simulation Equation (1) shows the formula for speedup where tp1 is time taken in simulating partition 1 and tparallel is the overall simulation time and tcommunication is time taken in communicating intermediate messages between the cores. Fig.2 shows the distributed simulation. Speedup =tp1+(tparallel+tcommunication) (1) Speed up will be affected for larger design as there will be many inter dependency between the modules. In spite of these researchers has come up with many more techniques which will be reviewed in further sections. 3. Parallel Simulation Architecture Paper [1] presented simulation architecture for running parallel executable codes which is generated from verilog using MPI library and ParaMid library which is based on efficient parallel algorithm. The algorithm discussed partition the modules in such a way that gates or circuit present in a particular module is mapped into same logic process while previous algorithms partitioned the modules but gates or circuits in the same module are mapped into different LPs. Fig.3 shows the simulation architecture. Partitioning Algorithm HDL Parser C++ code generator Executable code ParaMID Simulation Kernel Figure 3: Simulation architecture 4. Distributed Simulator Paper [2] designed a Distributed simulator based on time wrap to aid productive parallel simulation on distributed computers. The simulator consists of a HDL parser performing semantic checking and syntax and flattens every module and connects it to the top module by renaming the gates and finally a netlist will be generated. The simulator uses global virtual time algorithm which counts on constant length short time messages rather than messages with variable length containing vectors. The simulator also employs local fossil collection where each logic process before attempting to process a new event collects fossil. It has the feature of pre allocating a dynamically sized buffer for every event. The simulation was done on the netlist of Viterbi decoder design containing 1 million gates which showed a high speed up. 193

4 5. Effect of PC Cluster on PDES Paper [3]demonstrated PDES on PC cluster and finding the communication latencies and performance. The simulation of IP router model was carried out. The experiments were conducted on two separate 8 node clusters, one containing Ethernet cards and other one with Myrinet network cards. Each PC consists of P3 processor with 256 megabyte ram and the clock speed is 600 megahertz. IPCs are handled by both Ethernet and myrinet NIC. The latency reported on Ethernet system is 142 microseconds on 72 bytes of messages and 21 μsec on the other system. The author showed that rolling-back munches speed and worst when the CPU is faster and speed up factor declines when compared to non-rolling back schemes. 6. Q-Learning Approach The author [4] added Verilog parser with XTW and created a new simulator called VXTW. The simulator model is shown in the Fig.4.The first step is to synthesize the Verilog code using synopsys DC compiler. The target library is the GTECH. Next step consist of verliog parser for checking semantics and creating bench files which will be fed to the simulator. Synopsys Dc compiler Verilog Parser Circuit Simulator.V file GTECH lib.bench file Figure 4: Simulator model The author presented two active load matching algorithms for balancing the computational load and communication and used two Q learning agents to pool these algorithms. The first agent studies the parameters of the dynamic algorithm and the other optimizes the time window value. The author claimed 46 percent improvement in run time for many benchmark circuits. 7. Parallel Simulation Using NVIDIA GPU Paper [4] explored the use of General-purpose computing on graphics processing units for accelerating logic simulation. To have good load balance - and to attain variable coarse grain partitioning an adaptable partitioning approach is used. The author demonstrated the use of NVIDIA CUDA technology shown in Fig.5to achieve speed up and the result showed the speed up increasing by a factor of 21 when compared to single core simulation. Basic CUDA architecture is shown in figure.nvidia graphic processing unit consist of irregular number of cores on a single chip. Basic processing unit within a stream multiprocessor is called stream processor(sp).parallel programming was possible by the use of hierarchical threads and parallel memory. Programs are run parallel on single instruction multi thread mode. 194

Figure 5: Basic CUDA architecture 8. SIMD Parallelization Method. The author [5] proposed a technique where the netlist is converted into task graph shown in Fig.6.

5 Figure 5: Basic CUDA architecture 8. SIMD Parallelization Method. The author [5] proposed a technique where the netlist is converted into task graph shown in Fig.6. A dependency check is done for logic interdependency and multiple task are executed parallel using SIMD parallelization method where a SIMD instruction calculates the output of several unique task using single instruction. Task scheduling is done to assign task on machines dynamically. Figure 6: Task graph 9. Multilevel Temporal Parallel Event Driven Simulation The author [6] suggested a novel approach to gate level parallel simulation targeting at simulation of time slices (shown in Fig.7) rather than design partitioning. This avoids synchronization and communication of messages between partitions. The simulation time is partitioned into slices. Figure 7: Simulation slices The approach of MULTES is as follows. During initial or reference simulation internal state of each slice are stored. Then during parallel simulation of slices each slice is mapped to the processor. No. of slices depends on number of processors. 195

6 10. CMB based Simulator The author [7]implemented a compiled simulation of veriloghdl which is being translated into LP which consist C++ functions. Icarus verilog simulator is being deployed as verilog parser for translation. An LP is the smallest executable block which can be scheduled to the processor as shown in the figure.fig.8 shows the scheduling of LPs. First in first out queue is used as the transmission mechanism for passing the messages between the LPs. The author used CMB as the backbone of parallel simulation. The author got speed up for certain bench mark circuits when compared to other simulators like modelsim. LP1 LP2 LP3 LP4 SCHEDULER Figure 8: Scheduling of LPs 11. Parallel Simulation Using OPENMP and Verilator OS1 OS2 OS3 OS4 The author [7] used a open source Verilog simulator called. Verilator. It used for translating Verilog HDL code into executable C/C++. Since it is open source it is widely used by industries and academicians. In order to execute the generated codes parallel we can adopt OpenMp API library.the author has experimented the flow with different partitioning schemes namely domain partitioning and functional partitioning. Domain partitioning partitions the input data like D1,D2,Dnwhere as functional partitioning partitions the design into different modules. Fig.9 shows thedomain partitioning strategy and Fig.10 shows the functional partitioning scheme Data1 Design Core1 Data2 Design Core2 Figure 9: Domain partitioning Module1 Core1 Module2 Core2 Figure 10: Functional partitioning 196

7 12. ATPG based True Value Simulation The author [8] proposed calculation of real value at the output of each gate. The value corresponding to the output distance is calculated by the difference between bad machine and good machine value. the threshold value is set as 0.5. The approach is based Cheng and Agrawal's value calculation on the output of the gate. The author tested the methodology using 720 benchmark circuits containing 30,000 SA0 and SA1 faults in total and got a speed up of 5.3 using GPGPU when compared with serial processing.fig.11 shows the calculation time. 13. Conclusion Figure 11: Calculation of output value We have reviewed many implementation of parallel logic simulation where author used different approaches in speeding up the simulation. Getting speed up is difficult unless bringing down the communication overhead, adopting better partitioning techniques, use of better simulators. If all the factors are taken account then speed up can be definitely achieved. References [1] Li T., Li S., Ao F., Li G., Parallel verilog simulation: architecture and circuit partition, Proceedings of the Asia and South Pacific Design Automation Conference (2004), [2] Zhu L., Chen G., Szymanski B.K., Tropper C., Zhang T., Parallel logic simulation of million-gate VLSI circuits, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (2005), [3] Le T.T., Rejeb J., Performance of parallel logic event simulation on PC-cluster, 7th International Symposium on Parallel Architectures, Algorithms and Networks (2004), [4] Meraji S., Tropper C., A machine learning approach for optimizing parallel logic simulation, 39th International Conference on Parallel Processing (ICPP) (2010),

8 [5] Zhang Y., Wei T., Kai Y., Fan X., Zhang M., Zhao l., Logis simulation accerlaration based on GPU, 18th International Conference Mixed Design of Integrated Circuits and Systems (2011). [6] Kai N., Nishinohara R., Koide H., A SIMD parallelization method for an application for LSI logic simulation, 41st International Conference on Parallel Processing Workshops (ICPPW) (2012), [7] Kim D., Ciesielski M., Yang S., MULTES: multilevel temporalparallel event-driven simulation, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32(6) (2013), [8] Lingfeng W., Hong C., Yangdong Steve D., Robust Conservative Parallel HDL Simulation on Multi-Core CPUs, International conference on HPCS (2013). [9] Tariq B.A., Maciej C., Parallel Multi-core Verilog HDL Simulation based on Domain, IEEE Computer Society Annual Symposium on VLSI (2014). [10] Goro S., ATGPS Using Real Value Logic Simulation, 12th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT). 198

9 199

10 200

ON THE SCALABILITY AND DYNAMIC LOAD BALANCING OF PARALLEL VERILOG SIMULATIONS. Sina Meraji Wei Zhang Carl Tropper

Proceedings of the 2009 Winter Simulation Conference M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, eds. ON THE SCALABILITY AND DYNAMIC LOAD BALANCING OF PARALLEL VERILOG SIMULATIONS