Tsay 6 National. Hsinchu, Taiwan {fwyu. Timing. synchronization. access and using a. wall. can. software/hardware. co- simulation.

Size: px
Start display at page:

Download "Tsay 6 National. Hsinchu, Taiwan {fwyu. Timing. synchronization. access and using a. wall. can. software/hardware. co- simulation."

Transcription

1 A Critical-Section-Level for Deterministic Multi-Core Instruction- Set Simulations Timing Synchronization Approach Fan-Wei Yu 1, Bo-Han Zeng 2, Yu-Hung Huang 3, Hsin-I Tsing Hua University Wu 4, Che-Rung Lee 5 Ren-Song Tsay 6 National Department of Computer Science Hsinchu, Taiwan {fwyu 1, bhzeng 2, yhhuang 3, cherung 5, rstsay 6 }@cs.nthu.edu. tw, hiwu@gmail.com 4 Abstract This paper proposess a Critical-Section-Level timing approach for deterministic Multi-Core at each Instruction-Set Simulation (MCISS). By synchronizing access instead of every shared-variable access using a simple usage status managing scheme, our approach significantly improves simulation performance while executing all critical sections in a deterministic order. Experiments show that our approach performs 295% faster than shared- variable approach on average can effectively simulation. facilitate system-level software/hardware co- Keywords Deterministic, Multi-core instruction-set simulations, Timing Synchronization I. INTRODUCTION Multi-core architectures have become mainstream have gradually replaced single-core architectures due to power wall issue. As a result, parallel programming models are required for developing software applications that leverage computing power of multi-core systems. Neverless, parallel applications are very difficult to debug. In general, key to debugging parallel applications is to capture concurrent behaviors precisely. The Multi- most promising solution for this issue. Therefore, in recent years core Instruction-Set Simulations (MCISS) approach is considered re have been many research projects focusing on developing fast accurate MCISSs for accurate concurrent behavior simulations. The most important goal of any simulator is to reproduce behaviors of an actual physical target system faithfully, but this goal has long been a great challenge for MCISS. For a multi-core system, each core can execute a task concurrently, different task- interleaving orders may result in different execution behaviors lead to different results. Mainly, different interleaving orders imply potentially different shared-data access sequences. We use example in Fig. 1 to illustrate importance of having a deterministic interleaving order. We assume that re are two processes each runs on a separate ISS. If both processes access same shared data x, n due to possible execution speed difference between different ISSs, access orders of x can be different, such as in cases shown in Fig. 1(b) 1(c). For this example, we have a non-deterministic final value of x. Fig. 1: (a) Two processes, each running on an ISS, access a same shared-data x; (b), (c) Different interleaving orders result in different outputs. Therefore, to reproducee shared-datis to simulate system operation cycle-by-cycle [1], [2], [3]. This type of approach can guarantee accurate simulation results, but its excessive simulation makes access orders faithfully, an intuitive solution it unfeasible for realistic system design use. Instead of synchronizing at each cycle, Wu et al. [4], [5] develop an efficient shared-variable approach based on timing annotation techniques have a deterministic MCISS for accurate concurrent system behavior simulations. Although Wu s approach synchronizes only at shared-variable locations is much more efficient than cycle-based overhead as it has to consider all possible shared-variable locations to be safe. With proper, a deterministic interleaving order can be maintained. However, also introduces certain overhead, which reduces simulation speed. Different granularity induces different simulation speed approaches, it still incurs significant versus accuracy. Wu et al. [6] introduce three TLM-abstraction layers which include Instruction-Level, Data-Level Shared- points Variable-Level to cover possible timing between concurrent hardware software components. In practice, however, we find that it is sufficient to observe only interleaving order of critical sections instead of all shared variables as most parallel application developers do use critical sections to protect shared dataa avoid race conditions. We /DATE13/ 2013 EDAA.

2 Fig. 2: The correlation between simulation speed degree of abstraction. refore extend abstraction layer to a higher modeling abstraction named Critical-Section-Level, as shown in Fig. 2. We hence develop a much more efficient deterministic Critical-Section-Level MCISS approach by leveraging timing annotation techniques focusing on synchronizing execution of critical sections as well as access orderr of protected shared data inside critical sections. This approach is good for most practical uses since number of critical sections is much less than number of shared variables considered in Wu s approach. In case shared variables outside of critical sections are suspected of inducing bugs, designers can switch to shared-variable approach for detailed debugging. Neverless, it is a challenge to hle simulation of atomic instructions properly to reproduce same atomicity in an actual physical target system faithfully. An idealistic approach is to translate each target atomic instruction into an equivalent host atomic instructionn [7]. The atomicity is n naturally reproduced protected by host machine. However, a direct translation may not always be possible as different architectures support different types of atomic instructions. To resolve this issue, we developed a generic effective solution by recording checking usage status of each memory address. Additionally, we apply timing [4] annotation [8] techniques to guaranteee that all mutually exclusive critical sections are executed in a deterministic order. We have implemented our proposed approach tested it on a few stard benchmark cases. In general, our proposed simulator performs 295% faster on averagee than shared-variable approach. In sum, our approach can greatly help designers with system verification because it is able to reproduce same deterministic order atomicity of executed programs. The remainder of this paper is organized as follows. Section 2 discusses related work. The proposed deterministic d MCISS technique is elaborated in Section 3. We show correctness of on Critical-Section-Level in Section 4. We n report experimental resultss in Section 5 give a brief conclusion in Section 6. II. RELATED WORK For multi-core a task, different task-interleaving orders may result in simulations, each core can concurrently execute different execution results. The issue of determinism is widely discussed as it deeply complicates debugging of parallel programs limits ability to test multithreaded code. Therefore, correctly yet efficiently reproducing concurrent behaviors of multiple simulated cores has emerged as one of most critical challenges. QEMU [9], a widely adopted emulation approach, utilizes a dynamic binary translation technique but synchronizes only at basic b boundaries performs up to hundred MIPS (Million Instructions per Second). QEMU simulates a target multi-core design in a round-robin of a few basic bs. QEMU has become mature after sequence. Each target core is simulated for a -slice years of active development by open source community. A wealth of support for different architectures device models makes it a popular simulation platform. However, current QEMU cannot guarantee determinism cannot take advantage of parallelism available in underlying multi-core host architectures. To furr improve performance of QEMU, COREMU [10] attempts parallelization by leveraging available multiple host cores. COREMU emulates multiple cores by creating multiple instances of existing sequential emulator, QEMU, uses a thin library layer to hle inter-core device communication. COREMU shows negligible uniprocessor performance overhead scales much better compared to QEMU. Both QEMU COREMU honor atomicity are good for many applications, but y consider no timing information cannot faithfully simulate concurrent system behaviors or guarantee deterministic results. In or words, even with same inputs, results can be different if executed at different s in different environments. To reproduce consistent task-interleaving orders faithfully in a deterministic simulation, an intuitive solution is adopting CA (cycle- cycle-by-cycle. However, performance of CA-based simulators is inadequate due to excessivee accurate) based simulators [1], [2] that simulate system behaviors overhead. Instead of synchronizing at each cycle, Gligor et al. in [11] performs s both at basic b boundaries predefined intervals of. While it can achieve about 17x speedup compared to cycle-accurate approach, simulation performance is only about 3 MIPS results contain 10% errors. Similarly, Graphite [12], a distributed parallel multicore functional simulator, also applies a periodical approach. Each ISSS romly chooses anor ISS for c comparison front-running ISS is put to sleep woken up later at a dynamically determined period of. Although efficient, periodical approaches cannot guarantee accurate executions. To guarantee deterministic results for parallel applications, Kendo [13] is a software-based multithreading approach that enforces a deterministic interleaving of acquisitions based on a deterministic scheduling algorithm. Noneless, specific scheduling order introduces considerable performance overhead. In contrast, DMP approach [14] leverages Transactional Memory technology implements fully deterministic inter-thread communications. It encapsulatess each data access in a transaction atomically executes transactions with a deterministic commit order. Although this approach has negligible performance overhead, it has to rely on a specific transaction-memory-equipped hardware that is not generally available. As discussed in introduction section, shared-variable approach developed by Wu et al. [4], [5] is efficient guarantees deterministic accurate concurrent system behavior simulations. However, overhead of shared-variable approach is still significant. In practice, software developers most likely use critical sections to protect critical shared accesses. Since number of critical sections is much less than number of shared variables, critical-section approaches are more efficient. However, all existing approaches [10], [15] honor only atomicity constraints but not execution order of critical sections hence cannot guarantee deterministic simulation results.

3 Some attempt to ensure equivalent atomicity between target host machines. For instance, PTLsim [15] is a cycle accurate full system x86-64 micro-architectural simulator that allows each atomic operation to acquire a specific manages physical addresses using a shared-controller to all simulation threads. Althoughh this approach avoids using a global to all memory regions so that parallel accesses to different memory regions are allowed, extra simulation overhead for -status instruction from specific host machine architectures such as x86 to implement equivalent atomic instructions. An issue is that equivalent atomic monitoring is simply impractical. Instead, COREMU [10] adopts compare--swap instruction such as compare--swap instruction may not be available on ARM or or architectures. Therefore, approach is limited is not portable. To resolve issues discussed above, we propose an efficient critical-section level deterministic d MCISS approach which is not limited to underlying host architectures. The details are elaborated in following section. III. CRITICAL-SECTION-LEVEL TIMING SYNCHRONIZATION With all shared variables protected in critical sections, a consistent -un order can guarantee deterministic multi-core instruction-set simulations. We extend Wu s timing framework [4] develop a fast accurate deterministic critical- program developers do use critical sections, which are composed of section-level MCISS. As discussed in introduction section, most parallel -un operations, to protect accesses of shared variables. With this fact, s of shared variables inside critical sections can be omitted as y follow order of operations. We use case in Fig. 3 to illustrate how of critical sections works. Here we have a two-core target system execution with two ISSs entering a critical section each in indicated chronological order. Using Wu s Shared-Variable-Level approach, it synchronizes, as shown in Fig. 3( (a), at every shared- marked variable access point, which can be eir a access point as a red dot or a regular shared-variable access point marked as a green dot. In contrast, for our proposed Critical-Section-Level only at accesses, as shown in Fig. 3(b), while achieving approach it is sufficient to consider same simulation results. In following, we elaborate details of issues solutions associated with proposed Critical-Section-Level approach. To simplify discussion, sync point before each or un operation is referred to as a sync point. We begin by discussing how to identify sync points. A. Identifying Lock Sync Points To identify a operation for a sync point, we usually do so by recognizing certain architecture-specific instructions. To identify an un operation, we usually check if move (or memory store) instruction occurred after identified instruction. Note that uning move instruction always writes a certain value (1 in x86) to address to release so in practice we only need to check if data address is a target address. If addresses match, n this will be an un operation. Sync points are inserted before each un operation. To manage status precisely, we simply use a address table to record each unique address its associated status. The status is set when is acquired is cleared when is released. Depending on host architectures provided, we detect operations by identifying atomic read--update instructions such as test--set [16], compare--swap [16] or a pair like load- to link/store-conditional [17]. These instructions are used implement operations for parallel programming purposes enforce atomicity across operation threads. We use x86 assembly codes of spin spin un in Fig. 4 to demonstrate how to identify sync points throughh architecture specific move instructions. Spin is a technique used in Linux multiprocessor kernel [18] environment for acquisition. If a processor finds that target is open, it can acquire for subsequent executions. Conversely, if processor finds that target is closed (or ed) by anor processor, it spins around by repeatedly executing an instruction loop until is released (or opened). The spinning process represents a busy wait consumes actual CPU As illustrated in Fig. 4( (a), x86 instruction decb decreases spin value. A test is n performed on sign flag. If it was cleared, or spin was set to 1 (uned), n execution continues to line labeled L 3. Orwise, a busy-waiting loop at line L 2 is executed until spin assumes a positive value. Then it renews execution from line L 1 checks if it is Fig. 3: A two-core target system simulation example shows that ISS A s in a critical section first, n ISS B tries to acquire to enter in a critical section. (a) In Shared-Variable-Level approach, all access points (red dots) regular shared-variablee access points (green dots) are considered points; (b) In preliminary Critical-Section-Level points (red dots) are considered. approach, only -variable access Fig. 4: (a) The x86 assembly codes of spin (b) spin un.

4 indeed safe to proceed without having acquired by anor processor. Conversely, corresponding un operation cannot be easily identified as it is simply just a move instruction, which sets value 1 into memory address to release acquired, as illustrated in Fig. 4(b). Neverless, since an un operation has to follow match with a operation to form a critical section, we only need to examine move instructions executed after a instruction is identified. As number of move instructions inside a critical section is usually small, extra checking overhead induced by our sync point identification approach is negligible in practice. When an ISS encounters a move instruction inside critical section, it looks up address table checks if data address of move instruction is in table. If so, n it is an un sync point. Then ISS performs un operation leaves critical section to continue its non-critical section execution until next sync point. In this way, our approach can systematically enforce deterministic execution order of -un operations by maintaining acquisition status. Also, integrity of atomicity is guaranteed by maintaining acquisition status annotated of each instruction. We refore have a generic solution that is applicable to different host architectures can guarantee true atomicity exactly as in physical target system. B. Spin-waiting Optimization To furr improve performance, we find that we can eliminate unnecessary overhead by optimizing spin-waiting acquisition process in simulation. The two-core target system simulation example in Fig. 5(a) demonstrates a preliminary Critical-Section-Level simulation result without spin- waiting optimization. Suppose that ISS A acquires a first n ISS B tries to acquire later is in spin-waiting mode before is released by ISS A. With each instruction considered as a sync point, process of spin-waiting introduces excessive sync points, as shown in Fig. 5(a). In fact, when ISS B is attempting acquisition, we can easily identify wher has been acquired by ors by checking status of corresponding entry on address table. If corresponding status indicates that is held by anor ISS, n we simply put ISS B into a spin-jumping mode without doing actual until is released, as shown in Fig. 5(b). The name spin-jumping indicates that it does not require at each acquisition attempt but only first attempt. Therefore, overhead is reduced. Most importantly, execution order of critical sections is still correctly maintained with spin-jumping. The accurate activating for ISS B can be easily calculated based on its initial acquisition span of each spinning period. In next section, we elaborate details of timing calculation for spin-jumping activation. C. Lock Activation An interesting non-trivial issue is that in case of multiple -competing ISSs, winner is determined by its activating, but not initial acquistion. To ensure correct timing behavior, we have to compute exact activating of -waiting calculation mechanism. The spin-jumping can be ISS through following spin-jumping calculated as where _ is period of for repeating acquisition attempt, or (t1 t0) as in Fig. 6. _ is when -holding ISS releases, or t2 in Fig. 6. _ is when -competing ISS first attempts its acquisition of, or t0 in Fig. 6. Since -releasing may not match spinning period, we take ceiling of ( _ _ ) to compute precise spin-jumping _ _ _ _ _, add it to initial acquisition t0 to derive actual activating t3. If re are more than one competitors, we calculate possible activating according to above method consider winner to be one whose activating is closest to release. In Fig. 6, ISSS C is winner since activating of ISS C, t3, is closest to release, t 2, compared to anor activating of ISS B, t 4. In case of a tie, n we simply follow priority rule of target system to determine winner. Afterward, new activating is re-computed based on timing of new winner. D. The Critical-Section-Level Simulations We now use flow chart in Fig. 7 to summarize our proposed Critical-Section-Level approach. For executions on each ISS, once a operation is identified, a sync function is invoked. As shown in Fig. 7(b), sync function first confirms if local of invoking ISS is minimum one among all Fig. 5: A Critical-Section-Level two-core target system simulation example illustrating spin-waiting optimization. (a) A preliminary approach synchronizing at every operation, including spin- waiting instruction. (b) Spin-jumping for reduced overhead. Fig. 6: An example illustrating timing calculation of spin-jumping.

5 Fig. 7: The flow charts of (a) proposed Critical-Section-Level Synchronization approach, (b) sync function (c) un sync function. sync points (including sync un sync points). After being confirmed to be of minimum local, sync function n checks wher target is free. If it is free, n ISS acquires enters critical-section execution. If is not available, n ISSS waits in a -specific queuee until is released. Note that in this case it is important to add spin-jumping to waiting ISS for accurate timing. As depictedd in Fig. 7(c), if an un operation is identified, un sync function compares local of invoking ISS with or sync points to confirm that it has minimum local before releasing. Then it computes activating s of all -waiting ISSs selects one with closest activating to release as winning ; or waiting ISSs simply stay in queue for next release point. Un operations are usually done by move instruction each un operation has to match occur after a instruction. Based on this fact, we simply check data address of every move instruction after an identified instruction or in a critical section until we recognizee that data address is a address. Only n un operations release. For cases with nested s or recursive s, i.e., having sync points inside a critical section, we have to follow same acquisition flow to determine when se nested s can be acquired or released. When encountering an un operation, we have to make sure wher we are still in a nested critical section before switching to a non-critical section mode for execution. If so, we free only current critical section continue monitoring higher level critical sections until all critical sections are freed. We repeat above process until user application terminates. IV. CORRECTNESS PROOF In this section we show correctness of our proposed Critical-Section-Level approach. The Machine Correctness Theorem in [19] has proved correctness of shared-memory n approach, which guarantees same results as that from original program. In our case, when all shared variables are protected within critical sections, we can prove that our approach of synchronizing at -un operations does guarantee same simulation results as those of shared-variable approach, hence as those of original program. A shared-variable execution, E, its critical-section-level execution, E, are equivalent if only if execution orders of all critical sections in every program of E is same as program of E, moreover, all shared variables are protected inside critical sections. The following notations are used for clarity. Let s st for a shared variable, l st for a operation u st for an un operation. T T are access on shared-variable execution, E, critical-section-level execution, E, respectively. Assume shared variables s 1, s 2,...s n are protected inside critical section. Below, we show denotation of accesss order of critical section on shared-variable execution E:. Below, we show denotationn of access order of critical section by synchronizing at -un operation on critical-section-level execution E :. Since critical section is always executed atomically, access of shared variables inside critical section falling before access of operation after access of un operation is shown below as. Therefore, access order of shared variables inside critical section on critical-section-level execution is equivalent to shared-variable execution. According to Machine Correctness Theorem, critical- section-level execution is equivalent to shared- execution is proved to be as same as original program. Consequently, by maintaining execution order of critical sections among simulated programs, correctness of multi-core simulation is guaranteed. V. EXPERIMENTAL RESULTS We have implemented tested our proposed deterministic Critical-Section-Level MCISS simulation approach with excellent results. Our experimental setup uses a host machine equipped with variable shared-variable an Intel Xeon GHz quad-core 4GB RAM. The target architecture for simulation is Andes 16/32-bit mixed length RISC ISA [20]. We tested our implementation with 4 target cores using multi-core benchmark test cases FFM, FFT, Ocean, LU, Barnes from SPLASH-2 [21] a deterministic execution test case from RACEY [22]. The simulation performance comparisons are summarized in Fig. 8. In general, we can achieve a 295% speed up on average against shared-variable approach. The observed top simulation

6 Simulation speed (MIPS) speed is improved to 596 MIPS. The average sync number ratio of sync points over shared-variable sync points of SPLASH-2 is 14.5%, which shows that our proposed approach has much less sync overhead compared to shared-variable approach. The accuracy of our approach is verified by comparing computational results trace of critical section execution order of our critical-section level approach to that of shared-variable approach, we confirm that our approach is 100% accurate. Additionally, test results, particularly those of RACEY, show that our proposed Critical-Section-Level approach is deterministic. In contrast, when running both SPLASH-2 RACEY benchmarks with no, results are non-deterministic variance of total simulated instruction count is about 50%. We also confirm that total number of sync points determines sync overhead. From test results of RACEY, we observe that total sync overhead is proportional to number of sync points. As shown in Table 1, sync number ratio is 43%, which is almost equal to sync overhead ratio of 42%. The sync number ratio of RACEY is higher because it is a intensive program for determinism checking. Note that sync number ratio is number of sync points over all shared-variable sync points that sync overhead ratio is sync overhead of our approach over that of shared-variable approach. VI. CONCLUSIONS Shared variable Critical-section fft lu radix fmm ocean barnes Fig. 8: A comparison of shared-variable approach criticalsection-level approach simulation speeds. In this paper, we have proposed a new Critical-Section-Level timing approach for deterministic MCISS. Our proposed approach synchronizes only at -variable access points instead of every shared-variable access point. The experimental results show that our Critical-Section-Level approach is 100% accurate compared with shared-variable approach while performing 295% faster. Our future work is to include or points from or possible source-level operations such as barrier, signal, wait. Additionally, it is also important to consider scalability of simulation approach. TABLE 1: Comparison of sync number sync overhead of Critical- Section-Level Shared Variable Level. Sync Number Sync Overhead Lock Sync (ms) Shared-Variable Sync (ms) Ratio 43% 42% VII. REFERENCES [1] D. Burger T. M. Austin, The SimpleScalar tool set, version 2.0, SIGARCH Comput. Archit. News, [2] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B. Werner, Simics: A full system simulation platform, Computer, [3] J. Schnerr, O. Bringmann, W. Rosenstiel, Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs, in DATE 05 [4] M.-H. Wu, C.-Y. Fu, P.-C. Wang, R.-S. Tsay, An effective approach for fast accurate multi-core instruction-set simulation, in EMSOFT 09 [5] M.-H. Wu, P.-C. Wang, C.-Y. Fu, R.-S. Tsay, A high-parallelism distributed scheduling mechanism for multi-core instruction-set simulation, in DAC 11 [6] M.-H. Wu, W.-C. Lee, C.-Y. Chuang, R.-S. Tsay, Automatic generation of software TLM in multiple abstraction layers for efficient HW/SW co-simulation, in DATE 10. [7] R. Lantz, Parallel SimOS - Performance Scalability for Large System Simulation, PhD Thesis, Stford University, [8] K.-L. Lin, C.-K. Lo, R.-S. Tsay, Source-level timing annotation for fast accurate TLM computation model generation, in ASPDAC 10. [9] F. Bellard, QEMU, a fast portable dynamic translator, in Proceedings of UNENIX Annual Technical Conference, [10] Z. Wang, R. Liu, Y. Chen, X. Wu, H. Chen, W. Zhang, B. Zang, COREMU: a scalable portable parallel full-system emulator, in Proceedings of 16th ACM symposium on Principles practice of parallel programming, 2011, [11] M. Gligor, N. Fournel, F. Pétrot, Using binary translation in event driven simulation for fast flexible MPSoC simulation, in Proceedings of 7th IEEE/ACM international conference on Hardware/software codesign system synsis, [12] J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, A. Agarwal, Graphite: A distributed parallel simulator for multicores, in HPCA, 2010,. [13] M. Olszewski, J. Ansel, S. Amarasinghe, Kendo: efficient deterministic multithreading in software, in Proceeding of 14th international conference on Architectural support for programming languages operating systems, [14] J. Devietti, B. Lucia, L. Ceze, M. Oskin, DMP: deterministic shared memory multiprocessing, in Proceeding of 14th international conference on Architectural support for programming languages operating systems, [15] M. T. Yourst, PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator, in ISPASS [16] Intel Corp., Intel 64 IA-32 Architectures Software Developer s Manual Volume 2A: Instruction Set Reference, A-M.. [17] E. Jensen, G. Hagense, J. Broughton, A New Approach to Exclusive Data Access in Shared Memory Multiprocessors, Lawrence Livermore National Laboratory, Technical Report,1987. [18] D. Bovet M. Cesati, Understing Linux Kernel, Second Edition, 2nd ed. O Reilly Associates, Inc., [19] M.-H. Wu, C.-Y. Fu, P.-C. Wang, R.-S. Tsay, A Distributed Timing Synchronization Technique for Parallel Multi-Core Instruction- Set Simulation, accepted by ACM Trans. Embed. Comput. Syst., [20] Andes Technology Corp., AndeStar instruction set architecture manual/andes programming guide [21] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, A. Gupta, The SPLASH-2 programs: characterization methodological considerations, SIGARCH Comput. Archit. News, [22] M. Xu, R. Bodik, M. D. Hill, A flight data recorder for enabling full-system multiprocessor deterministic replay, in Computer Architecture, 2003.

DOM: A Data-dependency-Oriented Modeling Approach for Efficient Simulation of OS Preemptive Scheduling

DOM: A Data-dependency-Oriented Modeling Approach for Efficient Simulation of OS Preemptive Scheduling DOM: A Data-dependency-Oriented Modeling Approach for Efficient Simulation of OS Preempte Scheduling Peng-Chih Wang, Meng-Huan Wu, and Ren-Song Tsay Department of Computer Science National Tsing Hua Unersity

More information

A Shared-Variable-Based Synchronization Approach to Efficient Cache Coherence Simulation for Multi-Core Systems

A Shared-Variable-Based Synchronization Approach to Efficient Cache Coherence Simulation for Multi-Core Systems A Shared-Variable-Based Synchronization Approach to Efficient Cache Coherence Simulation for Multi-Core Systems Cheng-Yang Fu, Meng-Huan Wu, and Ren-Song Tsay Department of Computer Science National Tsing

More information

A Highly Efficient Virtualization-Assisted Approach for Full-System Virtual Prototypes

A Highly Efficient Virtualization-Assisted Approach for Full-System Virtual Prototypes R1-1 SASIMI 2018 Proceedings A Highly Efficient Virtualization-Assisted Approach for Full-System Virtual Prototypes Hsin-I Wu 1, Chi-Kang Chen 2, Da-Yi Guo 3 and Ren-Song Tsay 4 1,3,4 Department of Computer

More information

Cycle accurate transaction-driven simulation with multiple processor simulators

Cycle accurate transaction-driven simulation with multiple processor simulators Cycle accurate transaction-driven simulation with multiple processor simulators Dohyung Kim 1a) and Rajesh Gupta 2 1 Engineering Center, Google Korea Ltd. 737 Yeoksam-dong, Gangnam-gu, Seoul 135 984, Korea

More information

DMP Deterministic Shared Memory Multiprocessing

DMP Deterministic Shared Memory Multiprocessing DMP Deterministic Shared Memory Multiprocessing University of Washington Joe Devietti, Brandon Lucia, Luis Ceze, Mark Oskin A multithreaded voting machine 2 thread 0 thread 1 while (more_votes) { load

More information

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Parallel Simulation Accelerates Embedded Software Development, Debug and Test Parallel Simulation Accelerates Embedded Software Development, Debug and Test Larry Lapides Imperas Software Ltd. larryl@imperas.com Page 1 Modern SoCs Have Many Concurrent Processing Elements SMP cores

More information

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song June 24, 2016 Table of Contents 1

More information

A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems

A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems Zhenjiang Dong, Jun Wang, George Riley, Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute

More information

A SoC simulator, the newest component in Open64 Report and Experience in Design and Development of a baseband SoC

A SoC simulator, the newest component in Open64 Report and Experience in Design and Development of a baseband SoC A SoC simulator, the newest component in Open64 Report and Experience in Design and Development of a baseband SoC Wendong Wang, Tony Tuo, Kevin Lo, Dongchen Ren, Gary Hau, Jun zhang, Dong Huang {wendong.wang,

More information

COREMU: a Portable and Scalable Parallel Full-system Emulator

COREMU: a Portable and Scalable Parallel Full-system Emulator COREMU: a Portable and Scalable Parallel Full-system Emulator Haibo Chen Parallel Processing Institute Fudan University http://ppi.fudan.edu.cn/haibo_chen Full-System Emulator Useful tool for multicore

More information

Do you have to reproduce the bug on the first replay attempt?

Do you have to reproduce the bug on the first replay attempt? Do you have to reproduce the bug on the first replay attempt? PRES: Probabilistic Replay with Execution Sketching on Multiprocessors Soyeon Park, Yuanyuan Zhou University of California, San Diego Weiwei

More information

Profiling High Level Abstraction Simulators of Multiprocessor Systems

Profiling High Level Abstraction Simulators of Multiprocessor Systems Profiling High Level Abstraction Simulators of Multiprocessor Systems Liana Duenha, Rodolfo Azevedo Computer Systems Laboratory (LSC) Institute of Computing, University of Campinas - UNICAMP Campinas -

More information

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Cristiano Pereira, Harish Patil, Brad Calder $ Computer Science and Engineering, University of California, San Diego

More information

2. Futile Stall HTM HTM HTM. Transactional Memory: TM [1] TM. HTM exponential backoff. magic waiting HTM. futile stall. Hardware Transactional Memory:

2. Futile Stall HTM HTM HTM. Transactional Memory: TM [1] TM. HTM exponential backoff. magic waiting HTM. futile stall. Hardware Transactional Memory: 1 1 1 1 1,a) 1 HTM 2 2 LogTM 72.2% 28.4% 1. Transactional Memory: TM [1] TM Hardware Transactional Memory: 1 Nagoya Institute of Technology, Nagoya, Aichi, 466-8555, Japan a) tsumura@nitech.ac.jp HTM HTM

More information

Explicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic!

Explicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic! Explicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University of Washington Parallel Programming is Hard

More information

Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs

Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs , pp.45-49 http://dx.doi.org/10.14257/astl.2014.76.12 Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs Hyun-Ji Kim 1, Byoung-Kwi Lee 2, Ok-Kyoon Ha 3, and Yong-Kee Jun 1 1 Department

More information

MELD: Merging Execution- and Language-level Determinism. Joe Devietti, Dan Grossman, Luis Ceze University of Washington

MELD: Merging Execution- and Language-level Determinism. Joe Devietti, Dan Grossman, Luis Ceze University of Washington MELD: Merging Execution- and Language-level Determinism Joe Devietti, Dan Grossman, Luis Ceze University of Washington CoreDet results overhead normalized to nondet 800% 600% 400% 200% 0% 2 4 8 2 4 8 2

More information

Implementing Atomic Section by Using Hybrid Concurrent Control

Implementing Atomic Section by Using Hybrid Concurrent Control 2007 IFIP International Conference on Network and Parallel Computing - Workshops Implementing Atomic Section by Using Hybrid Concurrent Control Lei Zhao, Yu Zhang Department of Computer Science & Technology

More information

Deterministic Shared Memory Multiprocessing

Deterministic Shared Memory Multiprocessing Deterministic Shared Memory Multiprocessing Luis Ceze, University of Washington joint work with Owen Anderson, Tom Bergan, Joe Devietti, Brandon Lucia, Karin Strauss, Dan Grossman, Mark Oskin. Safe MultiProcessing

More information

Deterministic Scaling

Deterministic Scaling Deterministic Scaling Gabriel Southern Madan Das Jose Renau Dept. of Computer Engineering, University of California Santa Cruz {gsouther, madandas, renau}@soe.ucsc.edu http://masc.soe.ucsc.edu ABSTRACT

More information

Integrating Real-Time Synchronization Schemes into Preemption Threshold Scheduling

Integrating Real-Time Synchronization Schemes into Preemption Threshold Scheduling Integrating Real-Time Synchronization Schemes into Preemption Threshold Scheduling Saehwa Kim, Seongsoo Hong and Tae-Hyung Kim {ksaehwa, sshong}@redwood.snu.ac.kr, tkim@cse.hanyang.ac.kr Abstract Preemption

More information

Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic!

Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin Department of Computer Science and Engineering University

More information

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5 Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5 Multiple Processes OS design is concerned with the management of processes and threads: Multiprogramming Multiprocessing Distributed processing

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Frequent Value Compression in Packet-based NoC Architectures

Frequent Value Compression in Packet-based NoC Architectures Frequent Value Compression in Packet-based NoC Architectures Ping Zhou, BoZhao, YuDu, YiXu, Youtao Zhang, Jun Yang, Li Zhao ECE Department CS Department University of Pittsburgh University of Pittsburgh

More information

XXXX Leveraging Transactional Execution for Memory Consistency Model Emulation

XXXX Leveraging Transactional Execution for Memory Consistency Model Emulation XXXX Leveraging Transactional Execution for Memory Consistency Model Emulation Ragavendra Natarajan, University of Minnesota Antonia Zhai, University of Minnesota System emulation is widely used in today

More information

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Dan Wallin and Erik Hagersten Uppsala University Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden

More information

AN EFFICIENT DEADLOCK DETECTION AND RESOLUTION ALGORITHM FOR GENERALIZED DEADLOCKS. Wei Lu, Chengkai Yu, Weiwei Xing, Xiaoping Che and Yong Yang

AN EFFICIENT DEADLOCK DETECTION AND RESOLUTION ALGORITHM FOR GENERALIZED DEADLOCKS. Wei Lu, Chengkai Yu, Weiwei Xing, Xiaoping Che and Yong Yang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 2, April 2017 pp. 703 710 AN EFFICIENT DEADLOCK DETECTION AND RESOLUTION

More information

Chimera: Hybrid Program Analysis for Determinism

Chimera: Hybrid Program Analysis for Determinism Chimera: Hybrid Program Analysis for Determinism Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor - 1 - * Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

More information

Deterministic Process Groups in

Deterministic Process Groups in Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

A Fast Timing-Accurate MPSoC HW/SW Co-Simulation Platform based on a Novel Synchronization Scheme

A Fast Timing-Accurate MPSoC HW/SW Co-Simulation Platform based on a Novel Synchronization Scheme A Fast Timing-Accurate MPSoC HW/SW Co-Simulation Platform based on a Novel Synchronization Scheme Mingyan Yu, Junjie Song, Fangfa Fu, Siyue Sun, and Bo Liu Abstract Fast and accurate full-system simulation

More information

Full-System Timing-First Simulation

Full-System Timing-First Simulation Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin Madison The Problem Design of future computer systems uses simulation

More information

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Abdullah Muzahid, Shanxiang Qi, and University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu MICRO December

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):

More information

PQEMU: A Parallel System Emulator Based on QEMU

PQEMU: A Parallel System Emulator Based on QEMU 2011 IEEE 17th International Conference on Parallel and Distributed Systems PQEMU: A Parallel System Emulator Based on QEMU Jiun-Hung Ding MediaTek-NTHU Joint Lab Department of Computer Science National

More information

The University of Texas at Arlington

The University of Texas at Arlington The University of Texas at Arlington Lecture 10: Threading and Parallel Programming Constraints CSE 5343/4342 Embedded d Systems II Objectives: Lab 3: Windows Threads (win32 threading API) Convert serial

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs Authors: Jos e L. Abell an, Juan Fern andez and Manuel E. Acacio Presenter: Guoliang Liu Outline Introduction Motivation Background

More information

Modeling and Simulating Discrete Event Systems in Metropolis

Modeling and Simulating Discrete Event Systems in Metropolis Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract

More information

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University 740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

More information

The University of Texas at Arlington

The University of Texas at Arlington The University of Texas at Arlington Lecture 6: Threading and Parallel Programming Constraints CSE 5343/4342 Embedded Systems II Based heavily on slides by Dr. Roger Walker More Task Decomposition: Dependence

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

Major Project, CSD411

Major Project, CSD411 Major Project, CSD411 Deterministic Execution In Multithreaded Applications Supervisor Prof. Sorav Bansal Ashwin Kumar 2010CS10211 Himanshu Gupta 2010CS10220 1 Deterministic Execution In Multithreaded

More information

Efficient and highly portable deterministic multithreading (DetLock)

Efficient and highly portable deterministic multithreading (DetLock) Computing (2014) 96:1131 1147 DOI 10.1007/s00607-013-0370-9 Efficient and highly portable deterministic multithreading (DetLock) Hamid Mushtaq Zaid Al-Ars Koen Bertels Received: 22 March 2013 / Accepted:

More information

CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators

CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators Dohyung Kim Soonhoi Ha Rajesh Gupta Department of Computer Science and Engineering School of Computer Science and Engineering

More information

DMP: DETERMINISTIC SHARED- MEMORY MULTIPROCESSING

DMP: DETERMINISTIC SHARED- MEMORY MULTIPROCESSING ... DMP: DETERMINISTIC SHARED- MEMORY MULTIPROCESSING... SHARED-MEMORY MULTICORE AND MULTIPROCESSOR SYSTEMS ARE NONDETERMINISTIC, WHICH FRUSTRATES DEBUGGING AND COMPLICATES TESTING OF MULTITHREADED CODE,

More information

Implementation of Process Networks in Java

Implementation of Process Networks in Java Implementation of Process Networks in Java Richard S, Stevens 1, Marlene Wan, Peggy Laramie, Thomas M. Parks, Edward A. Lee DRAFT: 10 July 1997 Abstract A process network, as described by G. Kahn, is a

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Priority Inheritance Spin Locks for Multiprocessor Real-Time Systems

Priority Inheritance Spin Locks for Multiprocessor Real-Time Systems Priority Inheritance Spin Locks for Multiprocessor Real-Time Systems Cai-Dong Wang, Hiroaki Takada, and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1 Hongo,

More information

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Power-Mode-Aware Buffer Synthesis for Low-Power

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Real-Time Automatic Detection of Stuck Transactions

Real-Time Automatic Detection of Stuck Transactions Real-Time Automatic Detection of Stuck Transactions Diploma thesis subject of Peter Hofer (0855349) in cooperation with Compuware Austria GmbH September 2012 Abstract The execution of business transactions

More information

A Light-Weight Cooperative Multi-threading with Hardware Supported Thread-Management on an Embedded Multi-Processor System

A Light-Weight Cooperative Multi-threading with Hardware Supported Thread-Management on an Embedded Multi-Processor System A Light-Weight Cooperative Multi-threading with Hardware Supported Thread-Management on an Embedded Multi-Processor System Bo-Cheng Charles Lai EE Department UCLA CA 90095-1594 bclai@ee.ucla.edu Patrick

More information

Chapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration

Chapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration Chapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration Hector Posadas, Sara Real, and Eugenio Villar Abstract Design Space Exploration for complex,

More information

Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator

Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator Zhenman Fang 1,2, Qinghao Min 2, Keyong Zhou 2, Yi Lu 2, Yibin Hu 2, Weihua Zhang 2, Haibo Chen 3, Jian Li 4, Binyu Zang 2 1 The State

More information

Concurrency. Glossary

Concurrency. Glossary Glossary atomic Executing as a single unit or block of computation. An atomic section of code is said to have transactional semantics. No intermediate state for the code unit is visible outside of the

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Deterministic Replay and Data Race Detection for Multithreaded Programs

Deterministic Replay and Data Race Detection for Multithreaded Programs Deterministic Replay and Data Race Detection for Multithreaded Programs Dongyoon Lee Computer Science Department - 1 - The Shift to Multicore Systems 100+ cores Desktop/Server 8+ cores Smartphones 2+ cores

More information

COREMU: A Scalable and Portable Parallel Full-system Emulator

COREMU: A Scalable and Portable Parallel Full-system Emulator COREMU: A Scalable and Portable Parallel Full-system Emulator Zhaoguo Wang Ran Liu Yufei Chen Xi Wu Haibo Chen Weihua Zhang Binyu Zang Parallel Processing Institute, Fudan University {zgwang, ranliu, chenyufei,

More information

CFix. Automated Concurrency-Bug Fixing. Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. University of Wisconsin Madison

CFix. Automated Concurrency-Bug Fixing. Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. University of Wisconsin Madison CFix Automated Concurrency-Bug Fixing Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. University of Wisconsin Madison 1 Bugs Need to be Fixed Buggy software is an unfortunate fact; There

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

TERN: Stable Deterministic Multithreading through Schedule Memoization

TERN: Stable Deterministic Multithreading through Schedule Memoization TERN: Stable Deterministic Multithreading through Schedule Memoization Heming Cui, Jingyue Wu, Chia-che Tsai, Junfeng Yang Columbia University Appeared in OSDI 10 Nondeterministic Execution One input many

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer

More information

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Noise Injection Techniques to Expose Subtle and Unintended Message Races Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Adil Gheewala*, Jih-Kwon Peir*, Yen-Kuang Chen**, Konrad Lai** *Department of CISE, University of Florida,

More information

Operating Systems Overview. Chapter 2

Operating Systems Overview. Chapter 2 1 Operating Systems Overview 2 Chapter 2 3 An operating System: The interface between hardware and the user From the user s perspective: OS is a program that controls the execution of application programs

More information

Locks. Dongkun Shin, SKKU

Locks. Dongkun Shin, SKKU Locks 1 Locks: The Basic Idea To implement a critical section A lock variable must be declared A lock variable holds the state of the lock Available (unlocked, free) Acquired (locked, held) Exactly one

More information

Suggested Solutions (Midterm Exam October 27, 2005)

Suggested Solutions (Midterm Exam October 27, 2005) Suggested Solutions (Midterm Exam October 27, 2005) 1 Short Questions (4 points) Answer the following questions (True or False). Use exactly one sentence to describe why you choose your answer. Without

More information

Utilizing Concurrency: A New Theory for Memory Wall

Utilizing Concurrency: A New Theory for Memory Wall Utilizing Concurrency: A New Theory for Memory Wall Xian-He Sun (&) and Yu-Hang Liu Illinois Institute of Technology, Chicago, USA {sun,yuhang.liu}@iit.edu Abstract. In addition to locality, data access

More information

MS Windows Concurrency Mechanisms Prepared By SUFIAN MUSSQAA AL-MAJMAIE

MS Windows Concurrency Mechanisms Prepared By SUFIAN MUSSQAA AL-MAJMAIE MS Windows Concurrency Mechanisms Prepared By SUFIAN MUSSQAA AL-MAJMAIE 163103058 April 2017 Basic of Concurrency In multiple processor system, it is possible not only to interleave processes/threads but

More information

Remaining Contemplation Questions

Remaining Contemplation Questions Process Synchronisation Remaining Contemplation Questions 1. The first known correct software solution to the critical-section problem for two processes was developed by Dekker. The two processes, P0 and

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Kernel Synchronization I. Changwoo Min

Kernel Synchronization I. Changwoo Min 1 Kernel Synchronization I Changwoo Min 2 Summary of last lectures Tools: building, exploring, and debugging Linux kernel Core kernel infrastructure syscall, module, kernel data structures Process management

More information

Decoupled Software Pipelining in LLVM

Decoupled Software Pipelining in LLVM Decoupled Software Pipelining in LLVM 15-745 Final Project Fuyao Zhao, Mark Hahnenberg fuyaoz@cs.cmu.edu, mhahnenb@andrew.cmu.edu 1 Introduction 1.1 Problem Decoupled software pipelining [5] presents an

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

Parallel SimOS: Scalability and Performance for Large System Simulation

Parallel SimOS: Scalability and Performance for Large System Simulation Parallel SimOS: Scalability and Performance for Large System Simulation Ph.D. Oral Defense Robert E. Lantz Computer Systems Laboratory Stanford University 1 Overview This work develops methods to simulate

More information

POSIX Threads: a first step toward parallel programming. George Bosilca

POSIX Threads: a first step toward parallel programming. George Bosilca POSIX Threads: a first step toward parallel programming George Bosilca bosilca@icl.utk.edu Process vs. Thread A process is a collection of virtual memory space, code, data, and system resources. A thread

More information

RATCOP: Relational Analysis Tool for Concurrent Programs

RATCOP: Relational Analysis Tool for Concurrent Programs RATCOP: Relational Analysis Tool for Concurrent Programs Suvam Mukherjee 1, Oded Padon 2, Sharon Shoham 2, Deepak D Souza 1, and Noam Rinetzky 2 1 Indian Institute of Science, India 2 Tel Aviv University,

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures

Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures Bo Hong Drexel University, Philadelphia, PA 19104 bohong@coe.drexel.edu Abstract A new hardware-based design is presented

More information

Topic C Memory Models

Topic C Memory Models Memory Memory Non- Topic C Memory CPEG852 Spring 2014 Guang R. Gao CPEG 852 Memory Advance 1 / 29 Memory 1 Memory Memory Non- 2 Non- CPEG 852 Memory Advance 2 / 29 Memory Memory Memory Non- Introduction:

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

ECMon: Exposing Cache Events for Monitoring

ECMon: Exposing Cache Events for Monitoring ECMon: Exposing Cache Events for Monitoring Vijay Nagarajan Rajiv Gupta University of California at Riverside, CSE Department, Riverside, CA 92521 {vijay,gupta}@cs.ucr.edu ABSTRACT The advent of multicores

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Cycle Approximate Simulation of RISC-V Processors

Cycle Approximate Simulation of RISC-V Processors Cycle Approximate Simulation of RISC-V Processors Lee Moore, Duncan Graham, Simon Davidmann Imperas Software Ltd. Felipe Rosa Universidad Federal Rio Grande Sul Embedded World conference 27 February 2018

More information

Statistical Performance Comparisons of Computers

Statistical Performance Comparisons of Computers Tianshi Chen 1, Yunji Chen 1, Qi Guo 1, Olivier Temam 2, Yue Wu 1, Weiwu Hu 1 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology (ICT), Chinese Academy of Sciences, Beijing,

More information

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy Operating Systems Designed and Presented by Dr. Ayman Elshenawy Elsefy Dept. of Systems & Computer Eng.. AL-AZHAR University Website : eaymanelshenawy.wordpress.com Email : eaymanelshenawy@yahoo.com Reference

More information

Accelerating String Matching Using Multi-threaded Algorithm

Accelerating String Matching Using Multi-threaded Algorithm Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National

More information