Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Size: px

Start display at page:

Download "Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)"

Kathryn Josephine Martin
5 years ago
Views:

Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.

1 Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer Systems University of California, Irvine Outline Electronic System Level Design Project context, goals, and overview Simulation Traditional Discrete Event Simulation (DES) Discrete Event Simulation (PDES) Out-of-Order Discrete Event Simulation (OoO PDES) Project Realization Ongoing Research and Development Promising Experimental Results benchmarks Highly parallel applications Concluding Remarks Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 2 (c) 2014 R. Doemer et.al. 1

2 Electronic System Level (ESL) Electronic System Level s Abstract description of a complete system Hardware + Software Key Concepts in System ing Explicit Structure Block diagram structure Connectivity through ports Explicit Hierarchy System composed of components Explicit Concurrency Potential for parallel execution Potential for pipelined execution Explicit Communication and Computation Modules Channels and Interfaces System B0 B1 B2 B3 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 3 ESL Simulation Evaluation through Simulation Efficient system-level simulation is critical Fast and accurate! Complexity of system models grows constantly Need for speed! Simulation! ism is explicitly specified in model : SC_THREAD, SC_METHOD processing is available in standard PCs Multi-core hosts readily available Many-core technology is arriving Target Simulation Platform Intel Many Integrated Core (MIC) Architecture Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 4 (c) 2014 R. Doemer et.al. 2

3 Project Overview Planned Design Flow Static Analysis, Optimization Application CoFluent Studio Specification, ing C++ ROSE-based Recoding Compiler C++ Meta Component s CoFluent Studio ing Input ROSE based Recoding Compiler OoO PDES technology Intel MIC Architecture Target Platform C++ Compiler Executable PC Simulation Compiler (ICC) Executable Xeon Phi Platform Simulation Synthesis Tools Design Implementation Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 5 Project Overview Planned Design Flow Static Analysis, Optimization Application CoFluent Studio ing Suite CoFluent Studio Specification, ing C++ Compiler with OoO Analysis ROSE-based Recoding Compiler C++ Meta Component s CoFluent Studio ing Input ROSE based Recoding Compiler OoO PDES technology Intel MIC Architecture Target Platform C++ Compiler Executable PC Simulation Compiler (ICC) Executable Xeon Phi Platform Simulation Synthesis Tools Design Implementation OoO Xeon Phi Simulator Platform Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 6 (c) 2014 R. Doemer et.al. 3

with CoFluent Studio ing and Simulation Tool Suite Supports model-driven architecture (MDA) Based on Eclipse modeling framework (EMF) CoFluent ing Concept ensures well-defined model as input

4 with CoFluent Studio ing and Simulation Tool Suite Supports model-driven architecture (MDA) Based on Eclipse modeling framework (EMF) CoFluent ing Concept ensures well-defined model as input Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 7 ESL Simulation Traditional Discrete Event Simulation (DES) Reference simulators run sequentially, only one thread at a time (cooperative multi-threading model) Cannot utilize the capabilities of multi- or many-core hosts Discrete Event Simulation (PDES) Threads run in parallel (if at the same delta cycle and time) Simulation-cycles are absolute barriers! Out-of-order DE simulation (OoO PDES) Best technique known today, developed by CECS [DATE 12] Threads run in parallel and out-of-order even in different delta and time cycles if there are no conflicts! Aggressive, runs maximum number of threads in parallel, but fully preserves DES semantics and model accuracy! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 8 (c) 2014 R. Doemer et.al. 4

5 Discrete Event Simulation (DES) Traditional DES Concurrent threads of execution Managed by a central scheduler Driven by events and time advances Delta-cycle Time-cycle Partial temporal order with barriers Reference Simulator reference simulator uses cooperative multi-threading A single thread is active at any time! Cannot exploit parallelism Cannot utilize multiple cores th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 9 Discrete Event Simulation DES Threads execute in parallel iff in the same delta cycle, and in the same time cycle Significant speed up! Synchronous PDES: Cycle boundaries are absolute barriers! Aggressive DES Conservative Approaches Careful static analysis prevents conflicts Optimistic Approaches Conflicts are detected and addressed (roll back) th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 10 (c) 2014 R. Doemer et.al. 5

6 Out-of-Order DES Out-of-Order PDES Threads execute in parallel iff in the same delta cycle, and in the same time cycle, OR if there are no conflicts! Can utilize advanced compiler for static data conflict analysis Allows as many threads in parallel as possible Significantly higher speedup! Results at [DATE 12], [ASPDAC 12] Fully preserves DES execution semantics Accuracy in results and timing th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 11 Synchronous vs. Out-of-Order PDES Simple Example: video and audio decoding with different frame rates input stream 1: SC_MODULE(H264dec) 2: { sc_port<read_if> r; 3: sc_port<write_if> w; 4: 5: void main(){ 6: while(1){ 7: r >read(input_data); 8: decode_h264_frame(); 9: wait(33.3, SC_MS); 10: w >write(out_data); 11: } 12: }; H.264 decoder Stimulus DUT MP3 decoder 1: SC_MODULE(MP3dec) 2: { sc_port<read_if> r; 3: sc_port<write_if> w; 4: 5: void main(){ 6: while(1){ 7: r >read(input_data); 8: decode_mp3_frame(); 9: wait(26.12, SC_MS); 10: w >write(out_data); 11: } 12: }; H.264 Monitor 30fps Monitor MP3 Monitor 38.28fps Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 12 (c) 2014 R. Doemer et.al. 6

PDES Breaks cycle barrier Local times (per thread) H.264 decoder H.264 Monitor 30fps Stimulus DUT Monitor MP3 decoder MP3 Monitor 38.

13 Many-Core Target Platform Intel Many Integrated Core Architecture Intel Xeon Phi Coprocessor Provides 60 processor cores 4 hyper-threads per core 240

7 Synchronous vs. Out-of-Order PDES Simple Example: video and audio decoding with different frame rates Synchronous PDES Observes time and delta cycles Global time Out-of-Order PDES Breaks cycle barrier Local times (per thread) H.264 decoder H.264 Monitor 30fps Stimulus DUT Monitor MP3 decoder MP3 Monitor 38.28fps input stream PDES: [ms] OoO PDES: [ms] Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 13 Many-Core Target Platform Intel Many Integrated Core Architecture Intel Xeon Phi Coprocessor Provides 60 processor cores 4 hyper-threads per core 240 parallel hardware threads! Hardware Features Vector processing unit (VPU) Extended Math Unit (EMU) for transcendental operations Bidirectional ring interconnect Peak performance over 1 teraflops (double-precision) Uses familiar and standard programming models Appears as a regular Linux machine with 240 cores! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 14 (c) 2014 R. Doemer et.al. 7

8 Project Realization Ongoing Research and Development 1. Compiler with Out-of-Order PDES Analysis frontend for ROSE (lexer, parser, int. representation) Segment Graph data structure for advanced conflict analysis Code generator for parallel execution 2. Simulator with Out-of-Order Scheduler scheduler with fast conflict table lookup Target platform Intel MIC architecture Optimal thread-to-core task mapping kernel extension Protected communication Mutually-exclusive access to shared resources Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 15 Project Realization Compiler Build abstract syntax tree Build internal representation Build segment graph Build variable access lists Identify potential conflicts Build segment tables Instrument wait() calls Protect user-defined channels Generate parallel C++ model Library POSIX multi-threading Reentrant primitives Protected central resources Protected standard channels Out-of-order parallel scheduler Library Compiler C++ C++ Compiler Executable Multi-Core Host PC Simulation Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 16 (c) 2014 R. Doemer et.al. 8

9 Promising Experimental Results What Speedup is achievable on Today s Multi-Core and Many-Core Host Platforms? Early results using manually coded or SpecC-based examples Experimental Setup SMP Host PC 2 Intel Xeon X5650 CPUs at 2.66 GHz 6 cores each, 2 hyper-threads per core 24 parallel hardware threads available Many Integrated Core (MIC) Platform 1 Intel Xeon Phi Coprocessor 5110P at GHz 60 cores on ring-bus, 4 hyper-threads per core 240 parallel hardware threads available Highly parallel benchmarks floating-point multiplications (fmul) Fibonacci calculation (fibo) Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 17 Benchmark Results Experimental Results (2 Intel Xeon X5650 CPUs, 2x6x2 cores) fibo elapsed time [sec] fmul elapsed time [sec] fibo rel. speedup fmul rel. speedup x x Execution Time [sec] Multi-Core Host Cores Speedup Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 18 (c) 2014 R. Doemer et.al. 9

Benchmark Results Experimental Results (Intel Xeon Phi coproc., 60x4 cores) 300 120.00 250 200 fibo elapsed time [sec] fmul elapsed time [sec] fibo rel. speedup fmul rel.

$19 GPU Pipeline Example Graphics Application: Mandelbrot Set Mathematical set of points Two-dimensional fractal shape Complex computation Recursive function Extreme parallelism Pixel$

10 Benchmark Results Experimental Results (Intel Xeon Phi coproc., 60x4 cores) fibo elapsed time [sec] fmul elapsed time [sec] fibo rel. speedup fmul rel. speedup 103x x Execution Time Many-Core Host Cores Speedup Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 19 GPU Pipeline Example Graphics Application: Mandelbrot Set Mathematical set of points Two-dimensional fractal shape Complex computation Recursive function Extreme parallelism Pixel level TLM abstraction slices Configurable Executable Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 20 (c) 2014 R. Doemer et.al. 10

GPU Pipeline Example Graphics Application: Mandelbrot Set When synthesized, real-time rendering is no problem When simulated, regular

21 GPU Pipeline Example Graphics Application: Mandelbrot Set DES can significantly speed up simulation!

Hosts: Intel Core 2 Quad (4 cores), and Dual Xeon (12 cores) Speedup 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 4 Core Host, PDES Up to 3.

11 GPU Pipeline Example Graphics Application: Mandelbrot Set When synthesized, real-time rendering is no problem When simulated, regular DES is very slow DES can significantly speed up simulation! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 21 GPU Pipeline Example Graphics Application: Mandelbrot Set DES can significantly speed up simulation! Experimental Results Sequence of 100 Mandelbrot images (640x448, depth 4096) SpecC models with increasing number of parallel blocks Hosts: Intel Core 2 Quad (4 cores), and Dual Xeon (12 cores) Speedup Core Host, PDES Up to 3.7x speedup! 2 CPU 6 Core Host, PDES 5.9x speedup! 2 CPU 6 Core Host, OoO PDES 6.3x speedup! Cores Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 22 (c) 2014 R. Doemer et.al. 11

12 GPU Pipeline Example Graphics Application: Mandelbrot Set DES can significantly speed up simulation! Experimental Results Sequence of 100 Mandelbrot images (640x448, depth 4096) Simplified PDES model (Posix based, manually created) Many Core Platform: Intel Xeon Phi (60 x 4 cores) Speedup 50x 40x 30x 20x 10x 0x Scales well on many-core platforms! 4 Core Host, PDES Up to 3.7x speedup! 2 CPU 6 Core Host, PDES 5.9x speedup! 2 CPU 6 Core Host, OoO PDES 6.3x speedup! 60x4 Core Xeon Phi, Posix PDES Up to 46x speedup! Cores Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 23 Concluding Remarks ESL design needs fast and accurate simulation Traditional DES and PDES are insufficient Out-of-order PDES Novel, aggressive, fast Maximum parallelism Fully semantics compliant and accurate Promise of near-linear speedup on highly parallel platforms Compiler and Simulator Compiler with Out-of-Order PDES Analysis Simulator with Out-of-Order Scheduler Ongoing and Future Work Completion of implementation, further evaluation Collaboration with Accellera LWG Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 24 (c) 2014 R. Doemer et.al. 12

13 References (1) [DATE 12] W. Chen, X. Han, R. Dömer: "Out-of-Order Simulation for ESL Design", Proceedings of DATE, Dresden, Germany, March [ASPDAC 12] R. Dömer, W. Chen, X. Han: " Discrete Event Simulation of Transaction Level s", Proceedings of ASPDAC, Sydney, Australia, February [ASPDAC 12] W. Chen, R. Dömer: "An Optimizing Compiler for Out-of-Order ESL Simulation Exploiting Instance Isolation", Proceedings of ASPDAC, Sydney, Australia, February [IEEE D&T 11] W. Chen, X. Han, R. Dömer: "Multicore Simulation of Transaction-Level s Using the SoC Environment", IEEE Design & Test of Computers, vol. 28, no. 3, pp , May-June [ASPDAC 11] R. Dömer, W. Chen, X. Han, A. Gerstlauer: "Multi-Core Simulation of System-Level Description Languages", Proceedings of ASPDAC, Yokohama, Japan, January [HLDVT 10] W. Chen, X. Han, R. Dömer: "ESL Design and Multi-Core Validation using the System-on-Chip Environment", Proceedings of HLDVT, Anaheim, California, June Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 25 References (2) [DATE 14] W. Chen, X. Han, R. Dömer: "May-Happen-in- Analysis based on Segment Graphs for Safe ESL s", Accepted for publication at DATE, Dresden, Germany, March Best Paper Award! [DATE 13] W. Chen, R. Dömer: "Optimized Out-of-Order Discrete Event Simulation Using Predictions", Proceedings of DATE, Grenoble, France, March [IEEE D&T 13] W. Chen, X. Han, C. Chang, R. Dömer: "Advances in Discrete Event Simulation for Electronic System-Level Design", IEEE Design & Test of Computers, vol. 30, no. 1, pp , Jan.-Feb [HLDVT 12] W. Chen, C. Chang, X. Han, R. Dömer: "Eliminating Race Conditions in System-Level s by using Simulation Infrastructure", Proceedings of HLDVT 2012, Huntington Beach, California, November Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 26 (c) 2014 R. Doemer et.al. 13

Advances in Parallel Discrete Event Simulation EECS Colloquium, May 9, Advances in Parallel Discrete Event Simulation For Embedded System Design

Advances in Parallel Discrete Event Simulation For Embedded System Design Rainer Dömer doemer@uci.edu With contributions by Weiwei Chen and Xu Han Center for Embedded Computer Systems University of California,