Parallel Discrete Event Simulation of Transaction Level Models
|
|
- Rebecca Taylor
- 5 years ago
- Views:
Transcription
1 Parallel Discrete Event Simulation of Transaction Level Models Rainer Dömer, Weiwei Chen, Xu Han Center for Embedded Computer Systems University of California, Irvine, USA Abstract Describing Multi-Processor Systems-on-Chip (MP- SoC) at the abstract Electronic System Level (ESL) is one task, validating them efficiently is another. Here, fast and accurate system-level simulation is critical. Recently, Parallel Discrete Event Simulation (PDES) has gained significant attraction again as it promises to utilize the existing parallelism in today s multicore CPU hosts. This paper discusses the parallel simulation of Transaction-Level Models (TLMs) described in System-Level Description Languages (SLDLs), such as SystemC and SpecC. We review how PDES exploits the explicit parallelism in the ESL design models and uses the parallel processing units available on multicore host PCs to significantly reduce the simulation time. We show experimental results for two highly parallel benchmarks as well as for two actual embedded applications. I. INTRODUCTION Describing complex Multi-Processor Systems-on-Chip (MPSoC) at the abstract Electronic System Level (ESL) is one difficult task. However, getting the models to work correctly is another. The large size and great complexity of modern embedded systems with their heterogeneous components, complex interconnects, and sophisticated functionality pose enormous challenges to system validation and debugging. Accurate yet fast ESL simulation is a key to enabling effective and efficient design validation and implementation. In this paper, we review recent approaches for simulating MPSoC designs described at the system or transaction level. We focus in particular on parallel simulation techniques. The well-known approach of Parallel Discrete Event Simulation (PDES) has recently gained a lot of attraction again due to the inexpensive availability of parallel processing capabilities in today s multi-core CPU hosts. PDES holds the promise to map the explicit parallelism described in Transaction-Level Models (TLMs) efficiently onto the parallel cores available on the simulation host. As such, it can exploit the available parallelism and significantly reduce the simulation time. A. Parallel execution in SystemC and SpecC In this paper, we discuss the parallel simulation of TLMs described in the System-Level Description Languages (SLDLs) SystemC [7] and SpecC [6]. We note that, in contrast to flat and sequential C/C++ programming code, both languages explicitly specify the key ESL concepts in the design model, including behavioral and structural hierarchy, potential for parallelism and pipelining, communication channels, and constraints on timing. Having these intrinsic features of the application explicit in the model enables efficient design space exploration and automatic refinement by computer-aided design (CAD) tools. While both SystemC and SpecC model parallel execution explicitly, there is a significant difference in the parallel execution semantics between the languages. Both SLDLs define their execution semantics by use of DE-based scheduling of multiple concurrent threads which are managed and coordinated by a central simulation kernel. However, SystemC requires cooperative multi-threading, whereas SpecC allows preemptive multi-threading [4]. SystemC semantics guarantee the uninterrupted execution of threads (including their accesses to any shared variables) which makes it hard to implement a truly parallel simulator. In short, complex inter-dependency analysis over all variables in the system model is a prerequisite to parallel multi-core simulation in SystemC. In contrast to the cooperative scheduling mandated for SystemC models, multi-threading in SpecC is explicitly defined as preemptive. This makes a parallel simulation significantly easier because only shared variables in channels (which act as monitors) need to be protected for mutually exclusive access. Such protection can easily be inserted automatically into the model [2]. B. Related Work Most ESL design frameworks today still rely on regular Discrete Event (DE) simulators which issue only a single thread at any time to avoid complex synchronization of the concurrent threads. As such, the simulator kernel becomes an obstacle in improving simulation performance on multi-core host machines [8]. A well-studied solution to this problem is Parallel Discrete Event Simulation (PDES) [1, 5, 11]. [9, 10] discuss the PDES on Hardware Description Languages, VHDL and Verilog. To apply PDES solutions to today s SLDLs and actually allow parallel execution on multi-core processors, the simulator kernel needs to be modified to issue and properly synchronize multiple OS kernel threads in each scheduling step. [3] and [12] have extended the SystemC simulator kernel accordingly. Clusters with single-core nodes are targeted in [3] which uses multiple schedulers on different processing nodes and defines a master node for time synchronization. A parallelized SystemC kernel for fast simulation on SMP machines is presented in [12] which issues multiple runable OS kernel threads in each simulation cycle. The SpecC-based scheduling approach de /12/$ IEEE 227
2 scribed in [2] is very similar. However, it features a detailed synchronization protection mechanism which automatically instruments any user-defined and hierarchical channels. As such, this approach does not need to work around the cooperative SystemC execution semantics, neither does it require a specially prepared channel library. II. SYSTEM-LEVEL DISCRETE EVENT SIMULATION System design models with explicitly specified parallelism carry the promise of increased simulation performance through parallel execution of the threads on the available hardware resources of a multi-core host. In this section, we will first review the DE scheduling scheme used in traditional SLDL simulators which issues only a single thread at any time. We will then discuss the extended scheduling algorithm with true multi-threading capability on symmetric multiprocessing (multi-core) machines. With the exception of the different requirements for protection of communication and synchronization between the concurrent threads, as outlined in Section I.A above, the following sections apply equally to both SystemC and SpecC SLDLs. A. Traditional Discrete Event Simulation In traditional system-level simulation, a regular DE simulator is used. Threads are created for the explicit parallelism described in the models (e.g. par{} and pipe{} statements in SpecC, and SC THREADS in SystemC). The threads are generally managed in the scheduler by use of queues, such as READY, which contains all threads that ready to execute, and WAIT, which contains all waiting threads. When running, the threads communicate via events and shared variables or channels, and advance simulation time using wait-for-time constructs. start increases, the scheduler is called to move the simulation forward. As shown in Fig. 1, at any time the traditional scheduler runs a single thread which is picked from the READY queue. Within a delta-cycle, the choice of the next thread to run is nondeterministic (by definition). If the READY queue is empty, the scheduler will fill the queue again by waking threads who have received events they were waiting for. These are taken out of the WAIT queue and a new delta-cycle begins. If the READY queue is still empty after event delivery, the scheduler advances the simulation time, moves all threads with the earliest timestamp from the WAITFOR queue into the READY queue, and resumes execution (timed-cycle). te that at any time, there is only one thread actively executing in the traditional simulation. B. Parallel Discrete Event Simulation A parallel DE scheduler basically works the same way as the traditional scheduler, with one exception: in each cycle, it picks multiple threads from the READY queue and runs them in parallel on the available processor cores. For SLDL simulators in particular, the parallel scheduler picks and runs as many threads as CPU cores are available 1. In other words, it keeps all parallel processing resources as busy as possible. wait(cond_s, L); signal(cond_th, L); start end wait(cond_s, L) end Fig. 1.: Traditional DE scheduler. delta-cycle timed-cycle Fig. 2.: Parallel DE scheduler. Fig. 2 shows the extended control flow of the parallel scheduler. te the added loop at the left which issues running threads as long as CPU cores are available and the READY queue still has candidates. te that for the traditional sequential scheduler any threading model (including user-level or kernel-level threads) is acceptable since only one thread is running at any time. For the parallel scheduler to be effective on a multi-core platform, in contrast, OS kernel threads are necessary. Here, the underlying operating system needs to be aware of the multiple simulator threads so that it can utilize the available parallel CPU cores. Traditional DE simulation is driven by events and simulation time advances. Whenever events are delivered or time 1 Scheduling more threads to run than cores are available would be possible, but this would hurt the simulation performance due to unnecessary preemptive scheduling by the underlying operating system. 228
3 Fig. 3.: Simulation results for highly parallel benchmark designs. III. PARALLEL SYSTEM-LEVEL BENCHMARKS To demonstrate the potential of parallel simulation, we have designed two highly parallel benchmark design models, a parallel floating-point multiplication example and a parallel recursive Fibonacci calculator. Both benchmark designs are systemlevel models specified in SpecC SLDL. For all our experiments, we use a symmetric multiprocessing (SMP) capable server running 64-bit Fedora 12 Linux. The SMP hardware specifically consists of 2 Intel (R) Xeon (R) X5650 processors running at 2.67 GHz 2. Each CPU contains 6 parallel cores, each of which supports 2 hyperthreads per core. Thus, in total the server hardware supports up to 24 threads running in parallel. A. Parallel floating-point multiplications Our first parallel benchmark fmul is a simple stress-test example for parallel floating-point calculations. Specifically, fmul creates 256 parallel instances which perform 10 million double-precision floating-point multiplications each. As an extreme example, the parallel threads are completely independent, i. e. do not communicate or share any variables. The chart in Fig. 3 shows the experimental results for our parallel simulator when executing this benchmark. To demonstrate the scalability of parallel execution on our server, we vary the number of parallel threads admitted by the parallel scheduler (the value #CPUs in Fig. 2) between 1 and 30. The elapsed simulator run times are shown in the red line with circle points. Clearly, with more and more CPU cores used, the elapsed simulation time decreases in near logarithmic 2 To ensure consistent timing measurements, we have disabled the dynamic frequency scaling and turbo mode of the processors. fashion and flattens when no more CPU cores become available (> 24). When plotting the relative speedup, as shown in the green line with triangle points, one can see that, as expected, the simulation speed increases in nearly linear manner the more parallel cores are used. The maximal speedup is about 17x for this example on our 24-core server. B. Parallel Fibonacci calculation Our second parallel benchmark fibo calculates the Fibonacci series in parallel and recursive fashion. Recall that a Fibonacci number is defined as the sum of the previous two Fibonacci numbers, fib(n) = fib(n 1) + fib(n 2), and the first two numbers are fib(0) = 0 and fib(1) = 1. Our fibo design parallelizes the Fibonacci calculation by letting two parallel units compute the two previous numbers in the series. This parallel decomposition continues up to a user-specified depth limit (in our case 8), from where on the classic recursive calculation method is used. In contrast to the fmul example above, the fibo benchmark uses shared variables to communicate the input and calculated output values between the units, as well as a few counters to keep track of the actual number of parallel threads (for statistical purposes). Thus, the threads are not fully independent from each other. Also, the computational load is not evenly distributed among the instances due to the fact that the number of calculations increases by a factor of approximately (the golden ratio) for every next number. The fibo simulation results are also plotted in Fig. 3. The elapsed simulator run times, shown in the dark red line with square points, show the same decreasing curve as the fmul example. Accordingly, the relative simulation speedup, shown in the blue line with diamond points, also shows the same ex- 229
4 pected behavior. Speed increases in nearly linear fashion until it reaches saturation at about a factor of 14x. When comparing the fmul and fibo benchmark results, we notice a more regular behavior of the fmul example due to its even load and zero inter-thread communication. IV. EXPERIMENTAL RESULTS FOR EMBEDDED APPLICATIONS To demonstrate the improved simulation speed of the parallel multi-core simulator for actual embedded applications, we use a H.264 video decoder and a JPEG encoder application [2]. A. H.264 AVC Decoder with Parallel Slice Decoding The H.264 Advanced Video Coding (AVC) standard [13] is widely used in video applications, such as internet streaming, disc storage, and television services. It provides high-quality video at less than half the bit rate compared to its predecessors H.263 and H.262. At the same time, it requires more computing resources for both video encoding and decoding. The H.264 decoder takes as input a stream of encoded video frames. A frame can be further split into one or more slices, as illustrated in the upper right part of Fig. 4. tably, slices are independent of each other in the sense that decoding one slice will not require any data from the other slices. For this reason, parallelism exists at the slice-level and parallel slice decoders can be used to decode multiple slices in a frame simultaneously. against the sequential reference simulator (the table also includes the CPU load reported by the Linux OS). We can clearly see that the 24 available cores on the server are under-utilized. The reason is obviouly the maximum available parallelism in the model (2.05). The measured speedups are also somewhat lower than the maximum due to the overhead introduced in parallelizing and synchronizing the slice decoders. Simulator Reference Multi-Core Par. issued threads: n/a 24 sim. time sim. time speedup spec 37.71s (90%) 21.01s (173%) 1.79 arch 38.26s (90%) 21.26s (171%) 1.80 models sched 38.10s (91%) 22.31s (166%) 1.71 net 38.18s (91%) 22.45s (166%) 1.70 tlm 39.17s (90%) 22.32s (167%) 1.75 comm 49.92s (90%) 33.55s (156%) 1.49 Fig. 5.: Simulation results for H.264 Decoder example. Simulator Reference Multi-Core Par. issued threads: n/a 24 sim. time sim. time speedup spec 3.46s (93%) 2.82s (146%) 1.23 arch 3.92s (94%) 2.83s (147%) 1.39 models sched 4.64s (93%) 3.61s (137%) 1.29 net 13.87s (97%) 45.96s (130%) 0.30 Fig. 6.: Simulation results for JPEG Encoder example. B. JPEG Image Encoder Fig. 4.: Parallelized H.264 decoder model. Fig. 4 shows the block diagram of our H.264 decoder model. The decoding of a frame begins with reading new slices from the input stream. These are then dispatched into four parallel slice decoders. Finally, a synchronizer block completes the decoding by applying a deblocking filter to the decoded frame. All the blocks communicate via FIFO channels. Internally, each slice decoder consists of the regular H.264 decoder functions, such as entropy decoding, inverse quantization and transformation, motion compensation, and intra-prediction. For our experiment, we use the stream Harbour with 299 video frames, each with 4 slices of equal size. As discussed in [2], 68.4% of the total computation time is spent in the parallel slice decoding. Thus, while the maximum parallelism is 4 in this model, the effective maximum speedup we can gain is only 2.05 (due to the sequential parts). Table 5 lists the simulator run times for TLMs at different abstraction levels. We compare the elapsed simulation time As a second embedded application, Table 6 shows the simulation speedup for a JPEG Encoder example which performs its DCT, Quantization and Zigzag modules for the 3 color components in parallel, followed by a sequential Huffman encoder at the end. Again, the available parallelism in the model is low (maximal 3 parallel threads, followed by a significant sequential part). Some speedup is gained by the multicore parallel simulator for the higher level models (spec, arch, sched). However, the simulator performance decreases for the model at the lowest abstraction level (net) due to the high number of bus transactions and arbitrations which are not parallelized. V. CONCLUDING REMARKS Parallel Discrete Event Simulation (PDES) has recently come into fashion again. PDES is highly desirable for ESL design due to the constantly rising complexity of embedded systems which makes accurate and fast simulation challenging. After a brief review of the essential differences between traditional discrete event simulation and PDES, we have examined two highly parallel benchmark applications, fmul and fibo, and two actual embedded design examples, a H.264 decoder and a JPEG encoder. While the measured simulation 230
5 times of the parallel fmul and fibo benchmarks promise a linear increase in simulation speed due to their excellent scalability, the speedup achieved for the two embedded applications is quite limited. Nevertheless, the measured speedup is significant (nearly 2x!) and can make a real difference in shortening the validation time of such systems. Given the need for higher simulation speeds and the demonstrated potential of parallel simulation, it becomes clear that PDES is and will be an area of active research. tably, the basic PDES algorithm discussed in Section B is very straightforward and advanced techniques, including conservative and optimistic approaches, can further improve the simulation speed by optimizing the scheduling around synchronization barriers. However, all efforts are limited by the amount of exposed parallelism in the application. We can conclude that the parallelism available in the application, which in TLMs is explicitly specified by SLDL constructs, can be exploited by parallel simulators and then results in effective reduction of simulator run time. However, the Grand Challenge remains, the problem of how to efficiently parallelize applications. ACKNOWLEDGMENTS [8] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele. Scalably Distributed SystemC Simulation for Embedded Applications. In International Symposium on Industrial Embedded Systems, SIES 2008., pages , June [9] T. Li, Y. Guo, and S.-K. Li. Design and Implementation of a Parallel Verilog Simulator: PVSim. VLSI Design, International Conference on, pages , [10] E. Naroska. Parallel VHDL simulation. In DATE 98: Proceedings of the conference on Design, automation and test in Europe, pages , [11] D. Nicol and P. Heidelberger. Parallel Execution for Serial Simulators. ACM Transactions on Modeling and Computer Simulation, 6(3): , July [12] E. P, P. Chandran, J. Chandra, B. P. Simon, and D. Ravi. Parallelizing SystemC Kernel for Fast Hardware Simulation on SMP Machines. In PADS 09: Proceedings of the 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation, pages 80 87, Washington, DC, USA, IEEE Computer Society. [13] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC video coding standard. Circuits and Systems for Video Technology, IEEE Transactions on, 13(7): , july This work has been supported in part by funding from the National Science Foundation (NSF) under research grant NSF Award # The authors thank the NSF for the valuable support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] K. Chandy and J. Misra. Distributed Simulation: A Case Study in Design and Verification of Distributed Programs. Software Engineering, IEEE Transactions on, SE-5(5): , Sept [2] W. Chen, X. Han, and R. Dömer. Multiprocessor SoC Platforms: A Component-Based Design Approach. IEEE Design and Test of Computers, 28(3):20 31, May/June [3] B. Chopard, P. Combes, and J. Zory. A Conservative Approach to SystemC Parallelization. In V. N. Alexandrov, G. D. van Albada, P. M. A. Sloot, and J. Dongarra, editors, International Conference on Computational Science (4), volume 3994 of Lecture tes in Computer Science, pages Springer, [4] R. Dömer, W. Chen, X. Han, and A. Gerstlauer. Multi-core parallel simulation of system-level description languages. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC), Yokohama, Japan, January [5] R. Fujimoto. Parallel Discrete Event Simulation. Communications of the ACM, 33(10):30 53, Oct [6] D. D. Gajski, J. Zhu, R. Dömer, A. Gerstlauer, and S. Zhao. SpecC: Specification Language and Design Methodology. Kluwer, [7] T. Grötker, S. Liao, G. Martin, and S. Swan. System Design with SystemC. Kluwer,
A Parallel Transaction-Level Model of H.264 Video Decoder
Center for Embedded Computer Systems University of California, Irvine A Parallel Transaction-Level Model of H.264 Video Decoder Xu Han, Weiwei Chen and Rainer Doemer Technical Report CECS-11-03 June 2,
More informationOut-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)
Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer
More informationAdvances in Parallel Discrete Event Simulation EECS Colloquium, May 9, Advances in Parallel Discrete Event Simulation For Embedded System Design
Advances in Parallel Discrete Event Simulation For Embedded System Design Rainer Dömer doemer@uci.edu With contributions by Weiwei Chen and Xu Han Center for Embedded Computer Systems University of California,
More informationMulticore Simulation of Transaction-Level Models Using the SoC Environment
Transaction-Level Validation of Multicore Architectures Multicore Simulation of Transaction-Level Models Using the SoC Environment Weiwei Chen, Xu Han, and Rainer Dömer University of California, Irvine
More informationEliminating Race Conditions in System-Level Models by using Parallel Simulation Infrastructure
1 Eliminating Race Conditions in System-Level Models by using Parallel Simulation Infrastructure Weiwei Chen, Che-Wei Chang, Xu Han, Rainer Dömer Center for Embedded Computer Systems University of California,
More informationA Distributed Parallel Simulator for Transaction Level Models with Relaxed Timing
Center for Embedded Computer Systems University of California, Irvine A Distributed Parallel Simulator for Transaction Level Models with Relaxed Timing Weiwei Chen, Rainer Dömer Technical Report CECS-11-02
More informationComputer-Aided Recoding for Multi-Core Systems
Computer-Aided Recoding for Multi-Core Systems Rainer Dömer doemer@uci.edu With contributions by P. Chandraiah Center for Embedded Computer Systems University of California, Irvine Outline Embedded System
More informationEfficient Modeling of Embedded Systems using Designer-controlled Recoding. Rainer Dömer. With contributions by Pramod Chandraiah
Efficient Modeling of Embedded Systems using Rainer Dömer With contributions by Pramod Chandraiah Center for Embedded Computer Systems University of California, Irvine Outline Introduction Designer-controlled
More informationSpecC Methodology for High-Level Modeling
EDP 2002 9 th IEEE/DATC Electronic Design Processes Workshop SpecC Methodology for High-Level Modeling Rainer Dömer Daniel D. Gajski Andreas Gerstlauer Center for Embedded Computer Systems Universitiy
More informationSystemC Coding Guideline for Faster Out-of-order Parallel Discrete Event Simulation
SystemC Coding Guideline for Faster Out-of-order Parallel Discrete Event Simulation Zhongqi Cheng, Tim Schmidt, Rainer Dömer Center for Embedded and Cyber-Physical Systems University of California, Irvine,
More informationTransaction-Level Modeling Definitions and Approximations. 2. Definitions of Transaction-Level Modeling
Transaction-Level Modeling Definitions and Approximations EE290A Final Report Trevor Meyerowitz May 20, 2005 1. Introduction Over the years the field of electronic design automation has enabled gigantic
More informationProfiling High Level Abstraction Simulators of Multiprocessor Systems
Profiling High Level Abstraction Simulators of Multiprocessor Systems Liana Duenha, Rodolfo Azevedo Computer Systems Laboratory (LSC) Institute of Computing, University of Campinas - UNICAMP Campinas -
More informationCreating Explicit Communication in SoC Models Using Interactive Re-Coding
Creating Explicit Communication in SoC Models Using Interactive Re-Coding Pramod Chandraiah, Junyu Peng, Rainer Dömer Center for Embedded Computer Systems University of California, Irvine California, USA
More informationParallel and Distributed VHDL Simulation
Parallel and Distributed VHDL Simulation Dragos Lungeanu Deptartment of Computer Science University of Iowa C.J. chard Shi Department of Electrical Engineering University of Washington Abstract This paper
More informationQuantitative Analysis of Transaction Level Models for the AMBA Bus
Quantitative Analysis of Transaction Level Models for the AMBA Bus Gunar Schirner, Rainer Dömer Center of Embedded Computer Systems University of California, Irvine hschirne@uci.edu, doemer@uci.edu Abstract
More informationCycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components
Cycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components Rainer Dömer, Andreas Gerstlauer, Dongwan Shin Technical Report CECS-04-19 July 22, 2004 Center for Embedded Computer Systems University
More informationSystem-On-Chip Architecture Modeling Style Guide
Center for Embedded Computer Systems University of California, Irvine System-On-Chip Architecture Modeling Style Guide Junyu Peng Andreas Gerstlauer Rainer Dömer Daniel D. Gajski Technical Report CECS-TR-04-22
More informationRTOS Scheduling in Transaction Level Models
RTOS Scheduling in Transaction Level Models Haobo Yu, Andreas Gerstlauer, Daniel Gajski CECS Technical Report 03-12 March 20, 2003 Center for Embedded Computer Systems Information and Computer Science
More informationRTOS Modeling for System Level Design
RTOS Modeling for System Level Design Andreas Gerstlauer Haobo Yu Daniel D. Gajski Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697, USA E-mail: {gerstl,haoboy,gajski}@cecs.uci.edu
More informationEmbedded Software Generation from System Level Design Languages
Embedded Software Generation from System Level Design Languages Haobo Yu, Rainer Dömer, Daniel Gajski Center for Embedded Computer Systems University of California, Irvine, USA haoboy,doemer,gajski}@cecs.uci.edu
More informationESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)
ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages
More informationA Hybrid Instruction Set Simulator for System Level Design
Center for Embedded Computer Systems University of California, Irvine A Hybrid Instruction Set Simulator for System Level Design Yitao Guo, Rainer Doemer Technical Report CECS-10-06 June 11, 2010 Center
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationTHE increasing complexity of embedded systems poses
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 12, DECEMBER 2014 1859 Out-of-Order Parallel Discrete Event Simulation for Transaction Level Models Weiwei Chen,
More informationModeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano
Modeling and Simulation of System-on on-chip Platorms Donatella Sciuto 10/01/2007 Politecnico di Milano Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, 20131, Milano Key SoC Market
More informationAutomatic Generation of Communication Architectures
i Topic: Network and communication system Automatic Generation of Communication Architectures Dongwan Shin, Andreas Gerstlauer, Rainer Dömer and Daniel Gajski Center for Embedded Computer Systems University
More informationA Generic RTOS Model for Real-time Systems Simulation with SystemC
A Generic RTOS Model for Real-time Systems Simulation with SystemC R. Le Moigne, O. Pasquier, J-P. Calvez Polytech, University of Nantes, France rocco.lemoigne@polytech.univ-nantes.fr Abstract The main
More informationModel-based Analysis of Event-driven Distributed Real-time Embedded Systems
Model-based Analysis of Event-driven Distributed Real-time Embedded Systems Gabor Madl Committee Chancellor s Professor Nikil Dutt (Chair) Professor Tony Givargis Professor Ian Harris University of California,
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationA Review on Parallel Logic Simulation
Volume 114 No. 12 2017, 191-199 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A Review on Parallel Logic Simulation 1 S. Karthik and 2 S. Saravana
More informationQuantitative Analysis of Transaction Level Models for the AMBA Bus
Quantitative Analysis of Transaction Level Models for the AMBA Bus Gunar Schirner and Rainer Dömer Center for Embedded Computer Systems University of California, Irvine Motivation Higher productivity is
More informationThe SpecC System-Level Design Language and Methodology, Part 1. Class 309
Embedded Systems Conference San Francisco 2002 The SpecC System-Level Design Language and Methodology, Part 1 Class 309 Rainer Dömer Center for Embedded Computer Systems Universitiy of California, Irvine,
More informationExploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors
Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,
More informationParallel Computing Concepts. CSInParallel Project
Parallel Computing Concepts CSInParallel Project July 26, 2012 CONTENTS 1 Introduction 1 1.1 Motivation................................................ 1 1.2 Some pairs of terms...........................................
More informationMulti-level Design Methodology using SystemC and VHDL for JPEG Encoder
THE INSTITUTE OF ELECTRONICS, IEICE ICDV 2011 INFORMATION AND COMMUNICATION ENGINEERS Multi-level Design Methodology using SystemC and VHDL for JPEG Encoder Duy-Hieu Bui, Xuan-Tu Tran SIS Laboratory, University
More informationRISC Compiler and Simulator, Release V0.5.0: Out-of-Order Parallel Simulatable SystemC Subset
Center for Embedded and Cyber-physical Systems University of California, Irvine RISC Compiler and Simulator, Release V0.5.0: Out-of-Order Parallel Simulatable SystemC Subset Guantao Liu, Tim Schmidt, Zhongqi
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationEFFICIENT AND EXTENSIBLE TRANSACTION LEVEL MODELING BASED ON AN OBJECT ORIENTED MODEL OF BUS TRANSACTIONS
EFFICIENT AND EXTENSIBLE TRANSACTION LEVEL MODELING BASED ON AN OBJECT ORIENTED MODEL OF BUS TRANSACTIONS Rauf Salimi Khaligh, Martin Radetzki Institut für Technische Informatik,Universität Stuttgart Pfaffenwaldring
More informationMay-Happen-in-Parallel Analysis based on Segment Graphs for Safe ESL Models
May-Happen-in-Parallel Analysis based on Segment Graphs for Safe ESL Models Weiwei Chen, Xu Han, Rainer Dömer Center for Embedded Computer Systems University of California, Irvine, USA weiweic@uci.edu,
More informationEmbedded System Design and Modeling EE382V, Fall 2008
Embedded System Design and Modeling EE382V, Fall 2008 Lecture Notes 3 The SpecC System-Level Design Language Dates: Sep 9&11, 2008 Scribe: Mahesh Prabhu Languages: Any language would have characters, words
More informationCPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University
Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads
More informationRTOS Scheduling in Transaction Level Models
RTOS Scheduling in Transaction Level Models Haobo Yu, Andreas Gerstlauer, Daniel Gajski Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697, USA {haoboy,gerstl,gajksi}@cecs.uci.edu
More informationCompression of Stereo Images using a Huffman-Zip Scheme
Compression of Stereo Images using a Huffman-Zip Scheme John Hamann, Vickey Yeh Department of Electrical Engineering, Stanford University Stanford, CA 94304 jhamann@stanford.edu, vickey@stanford.edu Abstract
More informationSOFTWARE AND DRIVER SYNTHESIS FROM TRANSACTION LEVEL MODELS
SOFTWARE AND DRIVER SYNTHESIS FROM TRANSACTION LEVEL MODELS Haobo Yu, Rainer Dömer, Daniel D. Gajski Center of Embedded Computer Systems University of California, Irvine Abstract This work presents a method
More informationParallel Computers. c R. Leduc
Parallel Computers Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Why Parallel Computing?
More informationRISC Compiler and Simulator, Release V0.4.0: Out-of-Order Parallel Simulatable SystemC Subset
Center for Embedded and Cyber-physical Systems University of California, Irvine RISC Compiler and Simulator, Release V0.4.0: Out-of-Order Parallel Simulatable SystemC Subset Guantao Liu, Tim Schmidt, Zhongqi
More informationCOMPILED CODE IN DISTRIBUTED LOGIC SIMULATION. Jun Wang Carl Tropper. School of Computer Science McGill University Montreal, Quebec, CANADA H3A2A6
Proceedings of the 2006 Winter Simulation Conference L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, eds. COMPILED CODE IN DISTRIBUTED LOGIC SIMULATION Jun Wang Carl
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationEE382N.23: Embedded System Design and Modeling
EE382N.23: Embedded System Design and Modeling Lecture 3 Language Semantics Andreas Gerstlauer Electrical and Computer Engineering University of Texas at Austin gerstl@ece.utexas.edu Lecture 3: Outline
More informationA Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems
A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems Zhenjiang Dong, Jun Wang, George Riley, Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing This document consists of two parts. The first part introduces basic concepts and issues that apply generally in discussions of parallel computing. The second part consists
More informationAN INTERACTIVE MODEL RE-CODER FOR EFFICIENT SOC SPECIFICATION
AN INTERACTIVE MODEL RE-CODER FOR EFFICIENT SOC SPECIFICATION Center for Embedded Computer Systems University of California Irvine pramodc@uci.edu, doemer@uci.edu Abstract To overcome the complexity in
More informationCosimulation of ITRON-Based Embedded Software with SystemC
Cosimulation of ITRON-Based Embedded Software with SystemC Shin-ichiro Chikada, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada Graduate School of Information Science, Nagoya University Information Technology
More informationFast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda
Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE 5359 Gaurav Hansda 1000721849 gaurav.hansda@mavs.uta.edu Outline Introduction to H.264 Current algorithms for
More informationLecture 14. Synthesizing Parallel Programs IAP 2007 MIT
6.189 IAP 2007 Lecture 14 Synthesizing Parallel Programs Prof. Arvind, MIT. 6.189 IAP 2007 MIT Synthesizing parallel programs (or borrowing some ideas from hardware design) Arvind Computer Science & Artificial
More informationParallel Programming Multicore systems
FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have
More informationIntroducing Preemptive Scheduling in Abstract RTOS Models using Result Oriented Modeling
Introducing Preemptive Scheduling in Abstract RTOS Models using Result Oriented Modeling Gunar Schirner, Rainer Dömer Center of Embedded Computer Systems, University of California Irvine E-mail: {hschirne,
More informationThe Design and Evaluation of Hierarchical Multilevel Parallelisms for H.264 Encoder on Multi-core. Architecture.
UDC 0043126, DOI: 102298/CSIS1001189W The Design and Evaluation of Hierarchical Multilevel Parallelisms for H264 Encoder on Multi-core Architecture Haitao Wei 1, Junqing Yu 1, and Jiang Li 1 1 School of
More informationLars Schor, and Lothar Thiele ETH Zurich, Switzerland
Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele ETH Zurich, Switzerland Efficient i Execution of KPN on MPSoC Efficiency regarding speed-up small memory footprint portability
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationCommunication Abstractions for System-Level Design and Synthesis
Communication Abstractions for System-Level Design and Synthesis Andreas Gerstlauer Technical Report CECS-03-30 October 16, 2003 Center for Embedded Computer Systems University of California, Irvine Irvine,
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationECE519 Advanced Operating Systems
IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationLECTURE 3:CPU SCHEDULING
LECTURE 3:CPU SCHEDULING 1 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time CPU Scheduling Operating Systems Examples Algorithm Evaluation 2 Objectives
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationEfficient Usage of Concurrency Models in an Object-Oriented Co-design Framework
Efficient Usage of Concurrency Models in an Object-Oriented Co-design Framework Piyush Garg Center for Embedded Computer Systems, University of California Irvine, CA 92697 pgarg@cecs.uci.edu Sandeep K.
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts
More informationCS377P Programming for Performance Multicore Performance Multithreading
CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX
More informationAll MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes
MSEE Curriculum All MSEE students are required to take the following two core courses: 3531-571 Linear systems 3531-507 Probability and Random Processes The course requirements for students majoring in
More informationModeling and Simulating Discrete Event Systems in Metropolis
Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract
More informationCo-synthesis and Accelerator based Embedded System Design
Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationParallel Simulation Accelerates Embedded Software Development, Debug and Test
Parallel Simulation Accelerates Embedded Software Development, Debug and Test Larry Lapides Imperas Software Ltd. larryl@imperas.com Page 1 Modern SoCs Have Many Concurrent Processing Elements SMP cores
More informationGOP Level Parallelism on H.264 Video Encoder for Multicore Architecture
2011 International Conference on Circuits, System and Simulation IPCSIT vol.7 (2011) (2011) IACSIT Press, Singapore GOP Level on H.264 Video Encoder for Multicore Architecture S.Sankaraiah 1 2, H.S.Lam,
More informationSimulation Of Computer Systems. Prof. S. Shakya
Simulation Of Computer Systems Prof. S. Shakya Purpose & Overview Computer systems are composed from timescales flip (10-11 sec) to time a human interacts (seconds) It is a multi level system Different
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationOperating Systems Overview. Chapter 2
Operating Systems Overview Chapter 2 Operating System A program that controls the execution of application programs An interface between the user and hardware Masks the details of the hardware Layers and
More informationWhy Parallel Architecture
Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming
More informationRTOS Modeling for System Level Design
RTOS Modeling for System Level Design Design, Automation and Test in Europe Conference and Exhibition (DATE 03) Andreas Gerslauer, Haobo Yu, Daniel D. Gajski 2010. 1. 20 Presented by Jinho Choi c KAIST
More informationSystem-level design refinement using SystemC. Robert Dale Walstrom. A thesis submitted to the graduate faculty
System-level design refinement using SystemC by Robert Dale Walstrom A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Computer
More informationOptimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased
Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics
More informationHigh Efficiency Video Decoding on Multicore Processor
High Efficiency Video Decoding on Multicore Processor Hyeonggeon Lee 1, Jong Kang Park 2, and Jong Tae Kim 1,2 Department of IT Convergence 1 Sungkyunkwan University Suwon, Korea Department of Electrical
More informationCS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University
CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):
More informationMoore s Law. Computer architect goal Software developer assumption
Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer
More informationA COMPARISON OF CABAC THROUGHPUT FOR HEVC/H.265 VS. AVC/H.264. Massachusetts Institute of Technology Texas Instruments
2013 IEEE Workshop on Signal Processing Systems A COMPARISON OF CABAC THROUGHPUT FOR HEVC/H.265 VS. AVC/H.264 Vivienne Sze, Madhukar Budagavi Massachusetts Institute of Technology Texas Instruments ABSTRACT
More informationQUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose
QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,
More informationEE382V: System-on-a-Chip (SoC) Design
EE382V: System-on-a-Chip (SoC) Design Lecture 10 Task Partitioning Sources: Prof. Margarida Jacome, UT Austin Prof. Lothar Thiele, ETH Zürich Andreas Gerstlauer Electrical and Computer Engineering University
More informationMultimedia Decoder Using the Nios II Processor
Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationDesign, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library
Design, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library Jingun Hong 1, Wasuwee Sodsong 1, Seongwook Chung 1, Cheong Ghil Kim 2, Yeongkyu Lim 3, Shin-Dug Kim
More informationThe modularity requirement
The modularity requirement The obvious complexity of an OS and the inherent difficulty of its design lead to quite a few problems: an OS is often not completed on time; It often comes with quite a few
More information