Tsay 6 National. Hsinchu, Taiwan {fwyu. Timing. synchronization. access and using a. wall. can. software/hardware. co- simulation.

Size: px

Start display at page:

Download "Tsay 6 National. Hsinchu, Taiwan {fwyu. Timing. synchronization. access and using a. wall. can. software/hardware. co- simulation."

Wilfrid Shepherd
5 years ago
Views:

A Critical-Section-Level for Deterministic Multi-Core Instruction- Set Simulations Timing Synchronization Approach Fan-Wei Yu 1, Bo-Han Zeng 2, Yu-Hung Huang 3, Hsin-I Tsing Hua University Wu 4,

com 4 Abstract This paper proposess a Critical-Section-Level timing approach for deterministic Multi-Core at each Instruction-Set Simulation (MCISS).

sections in a deterministic order. Experiments show that our approach performs 295% faster than shared- variable approach on average can effectively simulation.

INTRODUCTION Multi-core architectures have become mainstream have gradually replaced single-core architectures due to power wall issue.

Neverless, parallel applications are very difficult to debug. In general, key to debugging parallel applications is to capture concurrent behaviors precisely.

1 A Critical-Section-Level for Deterministic Multi-Core Instruction- Set Simulations Timing Synchronization Approach Fan-Wei Yu 1, Bo-Han Zeng 2, Yu-Hung Huang 3, Hsin-I Tsing Hua University Wu 4, Che-Rung Lee 5 Ren-Song Tsay 6 National Department of Computer Science Hsinchu, Taiwan {fwyu 1, bhzeng 2, yhhuang 3, cherung 5, rstsay 6 }@cs.nthu.edu. tw, hiwu@gmail.com 4 Abstract This paper proposess a Critical-Section-Level timing approach for deterministic Multi-Core at each Instruction-Set Simulation (MCISS). By synchronizing access instead of every shared-variable access using a simple usage status managing scheme, our approach significantly improves simulation performance while executing all critical sections in a deterministic order. Experiments show that our approach performs 295% faster than shared- variable approach on average can effectively simulation. facilitate system-level software/hardware co- Keywords Deterministic, Multi-core instruction-set simulations, Timing Synchronization I. INTRODUCTION Multi-core architectures have become mainstream have gradually replaced single-core architectures due to power wall issue. As a result, parallel programming models are required for developing software applications that leverage computing power of multi-core systems. Neverless, parallel applications are very difficult to debug. In general, key to debugging parallel applications is to capture concurrent behaviors precisely. The Multi- most promising solution for this issue. Therefore, in recent years core Instruction-Set Simulations (MCISS) approach is considered re have been many research projects focusing on developing fast accurate MCISSs for accurate concurrent behavior simulations. The most important goal of any simulator is to reproduce behaviors of an actual physical target system faithfully, but this goal has long been a great challenge for MCISS. For a multi-core system, each core can execute a task concurrently, different task- interleaving orders may result in different execution behaviors lead to different results. Mainly, different interleaving orders imply potentially different shared-data access sequences. We use example in Fig. 1 to illustrate importance of having a deterministic interleaving order. We assume that re are two processes each runs on a separate ISS. If both processes access same shared data x, n due to possible execution speed difference between different ISSs, access orders of x can be different, such as in cases shown in Fig. 1(b) 1(c). For this example, we have a non-deterministic final value of x. Fig. 1: (a) Two processes, each running on an ISS, access a same shared-data x; (b), (c) Different interleaving orders result in different outputs. Therefore, to reproducee shared-datis to simulate system operation cycle-by-cycle [1], [2], [3]. This type of approach can guarantee accurate simulation results, but its excessive simulation makes access orders faithfully, an intuitive solution it unfeasible for realistic system design use. Instead of synchronizing at each cycle, Wu et al. [4], [5] develop an efficient shared-variable approach based on timing annotation techniques have a deterministic MCISS for accurate concurrent system behavior simulations. Although Wu s approach synchronizes only at shared-variable locations is much more efficient than cycle-based overhead as it has to consider all possible shared-variable locations to be safe. With proper, a deterministic interleaving order can be maintained. However, also introduces certain overhead, which reduces simulation speed. Different granularity induces different simulation speed approaches, it still incurs significant versus accuracy. Wu et al. [6] introduce three TLM-abstraction layers which include Instruction-Level, Data-Level Shared- points Variable-Level to cover possible timing between concurrent hardware software components. In practice, however, we find that it is sufficient to observe only interleaving order of critical sections instead of all shared variables as most parallel application developers do use critical sections to protect shared dataa avoid race conditions. We /DATE13/ 2013 EDAA.

We hence develop a much more efficient deterministic Critical-Section-Level MCISS approach by leveraging timing annotation techniques focusing on synchronizing execution of critical sections as well

2 Fig. 2: The correlation between simulation speed degree of abstraction. refore extend abstraction layer to a higher modeling abstraction named Critical-Section-Level, as shown in Fig. 2. We hence develop a much more efficient deterministic Critical-Section-Level MCISS approach by leveraging timing annotation techniques focusing on synchronizing execution of critical sections as well as access orderr of protected shared data inside critical sections. This approach is good for most practical uses since number of critical sections is much less than number of shared variables considered in Wu s approach. In case shared variables outside of critical sections are suspected of inducing bugs, designers can switch to shared-variable approach for detailed debugging. Neverless, it is a challenge to hle simulation of atomic instructions properly to reproduce same atomicity in an actual physical target system faithfully. An idealistic approach is to translate each target atomic instruction into an equivalent host atomic instructionn [7]. The atomicity is n naturally reproduced protected by host machine. However, a direct translation may not always be possible as different architectures support different types of atomic instructions. To resolve this issue, we developed a generic effective solution by recording checking usage status of each memory address. Additionally, we apply timing [4] annotation [8] techniques to guaranteee that all mutually exclusive critical sections are executed in a deterministic order. We have implemented our proposed approach tested it on a few stard benchmark cases. In general, our proposed simulator performs 295% faster on averagee than shared-variable approach. In sum, our approach can greatly help designers with system verification because it is able to reproduce same deterministic order atomicity of executed programs. The remainder of this paper is organized as follows. Section 2 discusses related work. The proposed deterministic d MCISS technique is elaborated in Section 3. We show correctness of on Critical-Section-Level in Section 4. We n report experimental resultss in Section 5 give a brief conclusion in Section 6. II. RELATED WORK For multi-core a task, different task-interleaving orders may result in simulations, each core can concurrently execute different execution results. The issue of determinism is widely discussed as it deeply complicates debugging of parallel programs limits ability to test multithreaded code. Therefore, correctly yet efficiently reproducing concurrent behaviors of multiple simulated cores has emerged as one of most critical challenges. QEMU [9], a widely adopted emulation approach, utilizes a dynamic binary translation technique but synchronizes only at basic b boundaries performs up to hundred MIPS (Million Instructions per Second). QEMU simulates a target multi-core design in a round-robin of a few basic bs. QEMU has become mature after sequence. Each target core is simulated for a -slice years of active development by open source community. A wealth of support for different architectures device models makes it a popular simulation platform. However, current QEMU cannot guarantee determinism cannot take advantage of parallelism available in underlying multi-core host architectures. To furr improve performance of QEMU, COREMU [10] attempts parallelization by leveraging available multiple host cores. COREMU emulates multiple cores by creating multiple instances of existing sequential emulator, QEMU, uses a thin library layer to hle inter-core device communication. COREMU shows negligible uniprocessor performance overhead scales much better compared to QEMU. Both QEMU COREMU honor atomicity are good for many applications, but y consider no timing information cannot faithfully simulate concurrent system behaviors or guarantee deterministic results. In or words, even with same inputs, results can be different if executed at different s in different environments. To reproduce consistent task-interleaving orders faithfully in a deterministic simulation, an intuitive solution is adopting CA (cycle- cycle-by-cycle. However, performance of CA-based simulators is inadequate due to excessivee accurate) based simulators [1], [2] that simulate system behaviors overhead. Instead of synchronizing at each cycle, Gligor et al. in [11] performs s both at basic b boundaries predefined intervals of. While it can achieve about 17x speedup compared to cycle-accurate approach, simulation performance is only about 3 MIPS results contain 10% errors. Similarly, Graphite [12], a distributed parallel multicore functional simulator, also applies a periodical approach. Each ISSS romly chooses anor ISS for c comparison front-running ISS is put to sleep woken up later at a dynamically determined period of. Although efficient, periodical approaches cannot guarantee accurate executions. To guarantee deterministic results for parallel applications, Kendo [13] is a software-based multithreading approach that enforces a deterministic interleaving of acquisitions based on a deterministic scheduling algorithm. Noneless, specific scheduling order introduces considerable performance overhead. In contrast, DMP approach [14] leverages Transactional Memory technology implements fully deterministic inter-thread communications. It encapsulatess each data access in a transaction atomically executes transactions with a deterministic commit order. Although this approach has negligible performance overhead, it has to rely on a specific transaction-memory-equipped hardware that is not generally available. As discussed in introduction section, shared-variable approach developed by Wu et al. [4], [5] is efficient guarantees deterministic accurate concurrent system behavior simulations. However, overhead of shared-variable approach is still significant. In practice, software developers most likely use critical sections to protect critical shared accesses. Since number of critical sections is much less than number of shared variables, critical-section approaches are more efficient. However, all existing approaches [10], [15] honor only atomicity constraints but not execution order of critical sections hence cannot guarantee deterministic simulation results.

Some attempt to ensure equivalent atomicity between target host machines.

specific manages physical addresses using a shared-controller to all simulation threads.

simulation overhead for -status instruction from specific host machine architectures such as x86 to implement equivalent atomic instructions.

Instead, COREMU [10] adopts compare--swap instruction such as compare--swap instruction may not be available on ARM or or architectures.

To resolve issues discussed above, we propose an efficient critical-section level deterministic d MCISS approach which is not limited to

CRITICAL-SECTION-LEVEL TIMING SYNCHRONIZATION With all shared variables protected in critical sections, a consistent -un order can guarantee

We extend Wu s timing framework [4] develop a fast accurate deterministic critical- program developers do use critical sections, which are

With this fact, s of shared variables inside critical sections can be omitted as y follow order of operations. We use case in Fig.

Here we have a two-core target system execution with two ISSs entering a critical section each in indicated chronological order.

3( (a), at every shared- marked variable access point, which can be eir a access point as a red dot or a regular shared-variable access point

3(b), while achieving approach it is sufficient to consider same simulation results.

To simplify discussion, sync point before each or un operation is referred to as a sync point. We begin by discussing how to identify sync points.

instructions. To identify an un operation, we usually check if move (or memory store) instruction occurred after identified instruction.

address is a target address. If addresses match, n this will be an un operation. Sync points are inserted before each un operation.

The status is set when is acquired is cleared when is released.

compare--swap [16] or a pair like load- to link/store-conditional [17].

We use x86 assembly codes of spin spin un in Fig. 4 to demonstrate how to identify sync points throughh architecture specific move instructions.

3 Some attempt to ensure equivalent atomicity between target host machines. For instance, PTLsim [15] is a cycle accurate full system x86-64 micro-architectural simulator that allows each atomic operation to acquire a specific manages physical addresses using a shared-controller to all simulation threads. Althoughh this approach avoids using a global to all memory regions so that parallel accesses to different memory regions are allowed, extra simulation overhead for -status instruction from specific host machine architectures such as x86 to implement equivalent atomic instructions. An issue is that equivalent atomic monitoring is simply impractical. Instead, COREMU [10] adopts compare--swap instruction such as compare--swap instruction may not be available on ARM or or architectures. Therefore, approach is limited is not portable. To resolve issues discussed above, we propose an efficient critical-section level deterministic d MCISS approach which is not limited to underlying host architectures. The details are elaborated in following section. III. CRITICAL-SECTION-LEVEL TIMING SYNCHRONIZATION With all shared variables protected in critical sections, a consistent -un order can guarantee deterministic multi-core instruction-set simulations. We extend Wu s timing framework [4] develop a fast accurate deterministic critical- program developers do use critical sections, which are composed of section-level MCISS. As discussed in introduction section, most parallel -un operations, to protect accesses of shared variables. With this fact, s of shared variables inside critical sections can be omitted as y follow order of operations. We use case in Fig. 3 to illustrate how of critical sections works. Here we have a two-core target system execution with two ISSs entering a critical section each in indicated chronological order. Using Wu s Shared-Variable-Level approach, it synchronizes, as shown in Fig. 3( (a), at every shared- marked variable access point, which can be eir a access point as a red dot or a regular shared-variable access point marked as a green dot. In contrast, for our proposed Critical-Section-Level only at accesses, as shown in Fig. 3(b), while achieving approach it is sufficient to consider same simulation results. In following, we elaborate details of issues solutions associated with proposed Critical-Section-Level approach. To simplify discussion, sync point before each or un operation is referred to as a sync point. We begin by discussing how to identify sync points. A. Identifying Lock Sync Points To identify a operation for a sync point, we usually do so by recognizing certain architecture-specific instructions. To identify an un operation, we usually check if move (or memory store) instruction occurred after identified instruction. Note that uning move instruction always writes a certain value (1 in x86) to address to release so in practice we only need to check if data address is a target address. If addresses match, n this will be an un operation. Sync points are inserted before each un operation. To manage status precisely, we simply use a address table to record each unique address its associated status. The status is set when is acquired is cleared when is released. Depending on host architectures provided, we detect operations by identifying atomic read--update instructions such as test--set [16], compare--swap [16] or a pair like load- to link/store-conditional [17]. These instructions are used implement operations for parallel programming purposes enforce atomicity across operation threads. We use x86 assembly codes of spin spin un in Fig. 4 to demonstrate how to identify sync points throughh architecture specific move instructions. Spin is a technique used in Linux multiprocessor kernel [18] environment for acquisition. If a processor finds that target is open, it can acquire for subsequent executions. Conversely, if processor finds that target is closed (or ed) by anor processor, it spins around by repeatedly executing an instruction loop until is released (or opened). The spinning process represents a busy wait consumes actual CPU As illustrated in Fig. 4( (a), x86 instruction decb decreases spin value. A test is n performed on sign flag. If it was cleared, or spin was set to 1 (uned), n execution continues to line labeled L 3. Orwise, a busy-waiting loop at line L 2 is executed until spin assumes a positive value. Then it renews execution from line L 1 checks if it is Fig. 3: A two-core target system simulation example shows that ISS A s in a critical section first, n ISS B tries to acquire to enter in a critical section. (a) In Shared-Variable-Level approach, all access points (red dots) regular shared-variablee access points (green dots) are considered points; (b) In preliminary Critical-Section-Level points (red dots) are considered. approach, only -variable access Fig. 4: (a) The x86 assembly codes of spin (b) spin un.

indeed safe to proceed without having acquired by anor processor.

Neverless, since an un operation has to follow match with a operation to form a critical section, we only need to examine move instructions executed after a instruction is identified.

When an ISS encounters a move instruction inside critical section, it looks up address table checks if data address of move instruction is in table. If so, n it is an un sync point.

In this way, our approach can systematically enforce deterministic execution order of -un operations by maintaining acquisition status.

We refore have a generic solution that is applicable to different host architectures can guarantee true atomicity exactly as in physical target system. B.

The two-core target system simulation example in Fig. 5(a) demonstrates a preliminary Critical-Section-Level simulation result without spin- waiting optimization.

With each instruction considered as a sync point, process of spin-waiting introduces excessive sync points, as shown in Fig. 5(a).

If corresponding status indicates that is held by anor ISS, n we simply put ISS B into a spin-jumping mode without doing actual until is released, as shown in Fig. 5(b).

Most importantly, execution order of critical sections is still correctly maintained with spin-jumping.

In next section, we elaborate details of timing calculation for spin-jumping activation. C.

To ensure correct timing behavior, we have to compute exact activating of -waiting calculation mechanism.

4 indeed safe to proceed without having acquired by anor processor. Conversely, corresponding un operation cannot be easily identified as it is simply just a move instruction, which sets value 1 into memory address to release acquired, as illustrated in Fig. 4(b). Neverless, since an un operation has to follow match with a operation to form a critical section, we only need to examine move instructions executed after a instruction is identified. As number of move instructions inside a critical section is usually small, extra checking overhead induced by our sync point identification approach is negligible in practice. When an ISS encounters a move instruction inside critical section, it looks up address table checks if data address of move instruction is in table. If so, n it is an un sync point. Then ISS performs un operation leaves critical section to continue its non-critical section execution until next sync point. In this way, our approach can systematically enforce deterministic execution order of -un operations by maintaining acquisition status. Also, integrity of atomicity is guaranteed by maintaining acquisition status annotated of each instruction. We refore have a generic solution that is applicable to different host architectures can guarantee true atomicity exactly as in physical target system. B. Spin-waiting Optimization To furr improve performance, we find that we can eliminate unnecessary overhead by optimizing spin-waiting acquisition process in simulation. The two-core target system simulation example in Fig. 5(a) demonstrates a preliminary Critical-Section-Level simulation result without spin- waiting optimization. Suppose that ISS A acquires a first n ISS B tries to acquire later is in spin-waiting mode before is released by ISS A. With each instruction considered as a sync point, process of spin-waiting introduces excessive sync points, as shown in Fig. 5(a). In fact, when ISS B is attempting acquisition, we can easily identify wher has been acquired by ors by checking status of corresponding entry on address table. If corresponding status indicates that is held by anor ISS, n we simply put ISS B into a spin-jumping mode without doing actual until is released, as shown in Fig. 5(b). The name spin-jumping indicates that it does not require at each acquisition attempt but only first attempt. Therefore, overhead is reduced. Most importantly, execution order of critical sections is still correctly maintained with spin-jumping. The accurate activating for ISS B can be easily calculated based on its initial acquisition span of each spinning period. In next section, we elaborate details of timing calculation for spin-jumping activation. C. Lock Activation An interesting non-trivial issue is that in case of multiple -competing ISSs, winner is determined by its activating, but not initial acquistion. To ensure correct timing behavior, we have to compute exact activating of -waiting calculation mechanism. The spin-jumping can be ISS through following spin-jumping calculated as where _ is period of for repeating acquisition attempt, or (t1 t0) as in Fig. 6. _ is when -holding ISS releases, or t2 in Fig. 6. _ is when -competing ISS first attempts its acquisition of, or t0 in Fig. 6. Since -releasing may not match spinning period, we take ceiling of ( _ _ ) to compute precise spin-jumping _ _ _ _ _, add it to initial acquisition t0 to derive actual activating t3. If re are more than one competitors, we calculate possible activating according to above method consider winner to be one whose activating is closest to release. In Fig. 6, ISSS C is winner since activating of ISS C, t3, is closest to release, t 2, compared to anor activating of ISS B, t 4. In case of a tie, n we simply follow priority rule of target system to determine winner. Afterward, new activating is re-computed based on timing of new winner. D. The Critical-Section-Level Simulations We now use flow chart in Fig. 7 to summarize our proposed Critical-Section-Level approach. For executions on each ISS, once a operation is identified, a sync function is invoked. As shown in Fig. 7(b), sync function first confirms if local of invoking ISS is minimum one among all Fig. 5: A Critical-Section-Level two-core target system simulation example illustrating spin-waiting optimization. (a) A preliminary approach synchronizing at every operation, including spin- waiting instruction. (b) Spin-jumping for reduced overhead. Fig. 6: An example illustrating timing calculation of spin-jumping.

Fig. 7: The flow charts of (a) proposed Critical-Section-Level Synchronization approach, (b) sync function (c) un sync function.

If it is free, n ISS acquires enters critical-section execution.

Note that in this case it is important to add spin-jumping to waiting ISS for accurate timing. As depictedd in Fig.

Based on this fact, we simply check data address of every move instruction after an identified instruction or in a critical section until

points inside a critical section, we have to follow same acquisition flow to determine when se nested s can be acquired or released.

We repeat above process until user application terminates. IV.

The Machine Correctness Theorem in [19] has proved correctness of shared-memory n approach, which guarantees same results as that from

In our case, when all shared variables are protected within critical sections, we can prove that our approach of synchronizing at -un

E, its critical-section-level execution, E, are equivalent if only if execution orders of all critical sections in every program of E is

Let s st for a shared variable, l st for a operation u st for an un operation.

..s n are protected inside critical section.

Below, we show denotationn of access order of critical section by synchronizing at -un operation on critical-section-level execution E :.

operation after access of un operation is shown below as.

5 Fig. 7: The flow charts of (a) proposed Critical-Section-Level Synchronization approach, (b) sync function (c) un sync function. sync points (including sync un sync points). After being confirmed to be of minimum local, sync function n checks wher target is free. If it is free, n ISS acquires enters critical-section execution. If is not available, n ISSS waits in a -specific queuee until is released. Note that in this case it is important to add spin-jumping to waiting ISS for accurate timing. As depictedd in Fig. 7(c), if an un operation is identified, un sync function compares local of invoking ISS with or sync points to confirm that it has minimum local before releasing. Then it computes activating s of all -waiting ISSs selects one with closest activating to release as winning ; or waiting ISSs simply stay in queue for next release point. Un operations are usually done by move instruction each un operation has to match occur after a instruction. Based on this fact, we simply check data address of every move instruction after an identified instruction or in a critical section until we recognizee that data address is a address. Only n un operations release. For cases with nested s or recursive s, i.e., having sync points inside a critical section, we have to follow same acquisition flow to determine when se nested s can be acquired or released. When encountering an un operation, we have to make sure wher we are still in a nested critical section before switching to a non-critical section mode for execution. If so, we free only current critical section continue monitoring higher level critical sections until all critical sections are freed. We repeat above process until user application terminates. IV. CORRECTNESS PROOF In this section we show correctness of our proposed Critical-Section-Level approach. The Machine Correctness Theorem in [19] has proved correctness of shared-memory n approach, which guarantees same results as that from original program. In our case, when all shared variables are protected within critical sections, we can prove that our approach of synchronizing at -un operations does guarantee same simulation results as those of shared-variable approach, hence as those of original program. A shared-variable execution, E, its critical-section-level execution, E, are equivalent if only if execution orders of all critical sections in every program of E is same as program of E, moreover, all shared variables are protected inside critical sections. The following notations are used for clarity. Let s st for a shared variable, l st for a operation u st for an un operation. T T are access on shared-variable execution, E, critical-section-level execution, E, respectively. Assume shared variables s 1, s 2,...s n are protected inside critical section. Below, we show denotation of accesss order of critical section on shared-variable execution E:. Below, we show denotationn of access order of critical section by synchronizing at -un operation on critical-section-level execution E :. Since critical section is always executed atomically, access of shared variables inside critical section falling before access of operation after access of un operation is shown below as. Therefore, access order of shared variables inside critical section on critical-section-level execution is equivalent to shared-variable execution. According to Machine Correctness Theorem, critical- section-level execution is equivalent to shared- execution is proved to be as same as original program. Consequently, by maintaining execution order of critical sections among simulated programs, correctness of multi-core simulation is guaranteed. V. EXPERIMENTAL RESULTS We have implemented tested our proposed deterministic Critical-Section-Level MCISS simulation approach with excellent results. Our experimental setup uses a host machine equipped with variable shared-variable an Intel Xeon GHz quad-core 4GB RAM. The target architecture for simulation is Andes 16/32-bit mixed length RISC ISA [20]. We tested our implementation with 4 target cores using multi-core benchmark test cases FFM, FFT, Ocean, LU, Barnes from SPLASH-2 [21] a deterministic execution test case from RACEY [22]. The simulation performance comparisons are summarized in Fig. 8. In general, we can achieve a 295% speed up on average against shared-variable approach. The observed top simulation

6 Simulation speed (MIPS) speed is improved to 596 MIPS. The average sync number ratio of sync points over shared-variable sync points of SPLASH-2 is 14.5%, which shows that our proposed approach has much less sync overhead compared to shared-variable approach. The accuracy of our approach is verified by comparing computational results trace of critical section execution order of our critical-section level approach to that of shared-variable approach, we confirm that our approach is 100% accurate. Additionally, test results, particularly those of RACEY, show that our proposed Critical-Section-Level approach is deterministic. In contrast, when running both SPLASH-2 RACEY benchmarks with no, results are non-deterministic variance of total simulated instruction count is about 50%. We also confirm that total number of sync points determines sync overhead. From test results of RACEY, we observe that total sync overhead is proportional to number of sync points. As shown in Table 1, sync number ratio is 43%, which is almost equal to sync overhead ratio of 42%. The sync number ratio of RACEY is higher because it is a intensive program for determinism checking. Note that sync number ratio is number of sync points over all shared-variable sync points that sync overhead ratio is sync overhead of our approach over that of shared-variable approach. VI. CONCLUSIONS Shared variable Critical-section fft lu radix fmm ocean barnes Fig. 8: A comparison of shared-variable approach criticalsection-level approach simulation speeds. In this paper, we have proposed a new Critical-Section-Level timing approach for deterministic MCISS. Our proposed approach synchronizes only at -variable access points instead of every shared-variable access point. The experimental results show that our Critical-Section-Level approach is 100% accurate compared with shared-variable approach while performing 295% faster. Our future work is to include or points from or possible source-level operations such as barrier, signal, wait. Additionally, it is also important to consider scalability of simulation approach. TABLE 1: Comparison of sync number sync overhead of Critical- Section-Level Shared Variable Level. Sync Number Sync Overhead Lock Sync (ms) Shared-Variable Sync (ms) Ratio 43% 42% VII. REFERENCES [1] D. Burger T. M. Austin, The SimpleScalar tool set, version 2.0, SIGARCH Comput. Archit. News, [2] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B. Werner, Simics: A full system simulation platform, Computer, [3] J. Schnerr, O. Bringmann, W. Rosenstiel, Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs, in DATE 05 [4] M.-H. Wu, C.-Y. Fu, P.-C. Wang, R.-S. Tsay, An effective approach for fast accurate multi-core instruction-set simulation, in EMSOFT 09 [5] M.-H. Wu, P.-C. Wang, C.-Y. Fu, R.-S. Tsay, A high-parallelism distributed scheduling mechanism for multi-core instruction-set simulation, in DAC 11 [6] M.-H. Wu, W.-C. Lee, C.-Y. Chuang, R.-S. Tsay, Automatic generation of software TLM in multiple abstraction layers for efficient HW/SW co-simulation, in DATE 10. [7] R. Lantz, Parallel SimOS - Performance Scalability for Large System Simulation, PhD Thesis, Stford University, [8] K.-L. Lin, C.-K. Lo, R.-S. Tsay, Source-level timing annotation for fast accurate TLM computation model generation, in ASPDAC 10. [9] F. Bellard, QEMU, a fast portable dynamic translator, in Proceedings of UNENIX Annual Technical Conference, [10] Z. Wang, R. Liu, Y. Chen, X. Wu, H. Chen, W. Zhang, B. Zang, COREMU: a scalable portable parallel full-system emulator, in Proceedings of 16th ACM symposium on Principles practice of parallel programming, 2011, [11] M. Gligor, N. Fournel, F. Pétrot, Using binary translation in event driven simulation for fast flexible MPSoC simulation, in Proceedings of 7th IEEE/ACM international conference on Hardware/software codesign system synsis, [12] J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, A. Agarwal, Graphite: A distributed parallel simulator for multicores, in HPCA, 2010,. [13] M. Olszewski, J. Ansel, S. Amarasinghe, Kendo: efficient deterministic multithreading in software, in Proceeding of 14th international conference on Architectural support for programming languages operating systems, [14] J. Devietti, B. Lucia, L. Ceze, M. Oskin, DMP: deterministic shared memory multiprocessing, in Proceeding of 14th international conference on Architectural support for programming languages operating systems, [15] M. T. Yourst, PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator, in ISPASS [16] Intel Corp., Intel 64 IA-32 Architectures Software Developer s Manual Volume 2A: Instruction Set Reference, A-M.. [17] E. Jensen, G. Hagense, J. Broughton, A New Approach to Exclusive Data Access in Shared Memory Multiprocessors, Lawrence Livermore National Laboratory, Technical Report,1987. [18] D. Bovet M. Cesati, Understing Linux Kernel, Second Edition, 2nd ed. O Reilly Associates, Inc., [19] M.-H. Wu, C.-Y. Fu, P.-C. Wang, R.-S. Tsay, A Distributed Timing Synchronization Technique for Parallel Multi-Core Instruction- Set Simulation, accepted by ACM Trans. Embed. Comput. Syst., [20] Andes Technology Corp., AndeStar instruction set architecture manual/andes programming guide [21] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, A. Gupta, The SPLASH-2 programs: characterization methodological considerations, SIGARCH Comput. Archit. News, [22] M. Xu, R. Bodik, M. D. Hill, A flight data recorder for enabling full-system multiprocessor deterministic replay, in Computer Architecture, 2003.

DOM: A Data-dependency-Oriented Modeling Approach for Efficient Simulation of OS Preemptive Scheduling

DOM: A Data-dependency-Oriented Modeling Approach for Efficient Simulation of OS Preempte Scheduling Peng-Chih Wang, Meng-Huan Wu, and Ren-Song Tsay Department of Computer Science National Tsing Hua Unersity