Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)
|
|
- Kathryn Josephine Martin
- 5 years ago
- Views:
Transcription
1 Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer Systems University of California, Irvine Outline Electronic System Level Design Project context, goals, and overview Simulation Traditional Discrete Event Simulation (DES) Discrete Event Simulation (PDES) Out-of-Order Discrete Event Simulation (OoO PDES) Project Realization Ongoing Research and Development Promising Experimental Results benchmarks Highly parallel applications Concluding Remarks Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 2 (c) 2014 R. Doemer et.al. 1
2 Electronic System Level (ESL) Electronic System Level s Abstract description of a complete system Hardware + Software Key Concepts in System ing Explicit Structure Block diagram structure Connectivity through ports Explicit Hierarchy System composed of components Explicit Concurrency Potential for parallel execution Potential for pipelined execution Explicit Communication and Computation Modules Channels and Interfaces System B0 B1 B2 B3 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 3 ESL Simulation Evaluation through Simulation Efficient system-level simulation is critical Fast and accurate! Complexity of system models grows constantly Need for speed! Simulation! ism is explicitly specified in model : SC_THREAD, SC_METHOD processing is available in standard PCs Multi-core hosts readily available Many-core technology is arriving Target Simulation Platform Intel Many Integrated Core (MIC) Architecture Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 4 (c) 2014 R. Doemer et.al. 2
3 Project Overview Planned Design Flow Static Analysis, Optimization Application CoFluent Studio Specification, ing C++ ROSE-based Recoding Compiler C++ Meta Component s CoFluent Studio ing Input ROSE based Recoding Compiler OoO PDES technology Intel MIC Architecture Target Platform C++ Compiler Executable PC Simulation Compiler (ICC) Executable Xeon Phi Platform Simulation Synthesis Tools Design Implementation Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 5 Project Overview Planned Design Flow Static Analysis, Optimization Application CoFluent Studio ing Suite CoFluent Studio Specification, ing C++ Compiler with OoO Analysis ROSE-based Recoding Compiler C++ Meta Component s CoFluent Studio ing Input ROSE based Recoding Compiler OoO PDES technology Intel MIC Architecture Target Platform C++ Compiler Executable PC Simulation Compiler (ICC) Executable Xeon Phi Platform Simulation Synthesis Tools Design Implementation OoO Xeon Phi Simulator Platform Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 6 (c) 2014 R. Doemer et.al. 3
4 with CoFluent Studio ing and Simulation Tool Suite Supports model-driven architecture (MDA) Based on Eclipse modeling framework (EMF) CoFluent ing Concept ensures well-defined model as input Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 7 ESL Simulation Traditional Discrete Event Simulation (DES) Reference simulators run sequentially, only one thread at a time (cooperative multi-threading model) Cannot utilize the capabilities of multi- or many-core hosts Discrete Event Simulation (PDES) Threads run in parallel (if at the same delta cycle and time) Simulation-cycles are absolute barriers! Out-of-order DE simulation (OoO PDES) Best technique known today, developed by CECS [DATE 12] Threads run in parallel and out-of-order even in different delta and time cycles if there are no conflicts! Aggressive, runs maximum number of threads in parallel, but fully preserves DES semantics and model accuracy! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 8 (c) 2014 R. Doemer et.al. 4
5 Discrete Event Simulation (DES) Traditional DES Concurrent threads of execution Managed by a central scheduler Driven by events and time advances Delta-cycle Time-cycle Partial temporal order with barriers Reference Simulator reference simulator uses cooperative multi-threading A single thread is active at any time! Cannot exploit parallelism Cannot utilize multiple cores th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 9 Discrete Event Simulation DES Threads execute in parallel iff in the same delta cycle, and in the same time cycle Significant speed up! Synchronous PDES: Cycle boundaries are absolute barriers! Aggressive DES Conservative Approaches Careful static analysis prevents conflicts Optimistic Approaches Conflicts are detected and addressed (roll back) th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 10 (c) 2014 R. Doemer et.al. 5
6 Out-of-Order DES Out-of-Order PDES Threads execute in parallel iff in the same delta cycle, and in the same time cycle, OR if there are no conflicts! Can utilize advanced compiler for static data conflict analysis Allows as many threads in parallel as possible Significantly higher speedup! Results at [DATE 12], [ASPDAC 12] Fully preserves DES execution semantics Accuracy in results and timing th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 11 Synchronous vs. Out-of-Order PDES Simple Example: video and audio decoding with different frame rates input stream 1: SC_MODULE(H264dec) 2: { sc_port<read_if> r; 3: sc_port<write_if> w; 4: 5: void main(){ 6: while(1){ 7: r >read(input_data); 8: decode_h264_frame(); 9: wait(33.3, SC_MS); 10: w >write(out_data); 11: } 12: }; H.264 decoder Stimulus DUT MP3 decoder 1: SC_MODULE(MP3dec) 2: { sc_port<read_if> r; 3: sc_port<write_if> w; 4: 5: void main(){ 6: while(1){ 7: r >read(input_data); 8: decode_mp3_frame(); 9: wait(26.12, SC_MS); 10: w >write(out_data); 11: } 12: }; H.264 Monitor 30fps Monitor MP3 Monitor 38.28fps Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 12 (c) 2014 R. Doemer et.al. 6
7 Synchronous vs. Out-of-Order PDES Simple Example: video and audio decoding with different frame rates Synchronous PDES Observes time and delta cycles Global time Out-of-Order PDES Breaks cycle barrier Local times (per thread) H.264 decoder H.264 Monitor 30fps Stimulus DUT Monitor MP3 decoder MP3 Monitor 38.28fps input stream PDES: [ms] OoO PDES: [ms] Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 13 Many-Core Target Platform Intel Many Integrated Core Architecture Intel Xeon Phi Coprocessor Provides 60 processor cores 4 hyper-threads per core 240 parallel hardware threads! Hardware Features Vector processing unit (VPU) Extended Math Unit (EMU) for transcendental operations Bidirectional ring interconnect Peak performance over 1 teraflops (double-precision) Uses familiar and standard programming models Appears as a regular Linux machine with 240 cores! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 14 (c) 2014 R. Doemer et.al. 7
8 Project Realization Ongoing Research and Development 1. Compiler with Out-of-Order PDES Analysis frontend for ROSE (lexer, parser, int. representation) Segment Graph data structure for advanced conflict analysis Code generator for parallel execution 2. Simulator with Out-of-Order Scheduler scheduler with fast conflict table lookup Target platform Intel MIC architecture Optimal thread-to-core task mapping kernel extension Protected communication Mutually-exclusive access to shared resources Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 15 Project Realization Compiler Build abstract syntax tree Build internal representation Build segment graph Build variable access lists Identify potential conflicts Build segment tables Instrument wait() calls Protect user-defined channels Generate parallel C++ model Library POSIX multi-threading Reentrant primitives Protected central resources Protected standard channels Out-of-order parallel scheduler Library Compiler C++ C++ Compiler Executable Multi-Core Host PC Simulation Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 16 (c) 2014 R. Doemer et.al. 8
9 Promising Experimental Results What Speedup is achievable on Today s Multi-Core and Many-Core Host Platforms? Early results using manually coded or SpecC-based examples Experimental Setup SMP Host PC 2 Intel Xeon X5650 CPUs at 2.66 GHz 6 cores each, 2 hyper-threads per core 24 parallel hardware threads available Many Integrated Core (MIC) Platform 1 Intel Xeon Phi Coprocessor 5110P at GHz 60 cores on ring-bus, 4 hyper-threads per core 240 parallel hardware threads available Highly parallel benchmarks floating-point multiplications (fmul) Fibonacci calculation (fibo) Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 17 Benchmark Results Experimental Results (2 Intel Xeon X5650 CPUs, 2x6x2 cores) fibo elapsed time [sec] fmul elapsed time [sec] fibo rel. speedup fmul rel. speedup x x Execution Time [sec] Multi-Core Host Cores Speedup Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 18 (c) 2014 R. Doemer et.al. 9
10 Benchmark Results Experimental Results (Intel Xeon Phi coproc., 60x4 cores) fibo elapsed time [sec] fmul elapsed time [sec] fibo rel. speedup fmul rel. speedup 103x x Execution Time Many-Core Host Cores Speedup Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 19 GPU Pipeline Example Graphics Application: Mandelbrot Set Mathematical set of points Two-dimensional fractal shape Complex computation Recursive function Extreme parallelism Pixel level TLM abstraction slices Configurable Executable Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 20 (c) 2014 R. Doemer et.al. 10
11 GPU Pipeline Example Graphics Application: Mandelbrot Set When synthesized, real-time rendering is no problem When simulated, regular DES is very slow DES can significantly speed up simulation! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 21 GPU Pipeline Example Graphics Application: Mandelbrot Set DES can significantly speed up simulation! Experimental Results Sequence of 100 Mandelbrot images (640x448, depth 4096) SpecC models with increasing number of parallel blocks Hosts: Intel Core 2 Quad (4 cores), and Dual Xeon (12 cores) Speedup Core Host, PDES Up to 3.7x speedup! 2 CPU 6 Core Host, PDES 5.9x speedup! 2 CPU 6 Core Host, OoO PDES 6.3x speedup! Cores Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 22 (c) 2014 R. Doemer et.al. 11
12 GPU Pipeline Example Graphics Application: Mandelbrot Set DES can significantly speed up simulation! Experimental Results Sequence of 100 Mandelbrot images (640x448, depth 4096) Simplified PDES model (Posix based, manually created) Many Core Platform: Intel Xeon Phi (60 x 4 cores) Speedup 50x 40x 30x 20x 10x 0x Scales well on many-core platforms! 4 Core Host, PDES Up to 3.7x speedup! 2 CPU 6 Core Host, PDES 5.9x speedup! 2 CPU 6 Core Host, OoO PDES 6.3x speedup! 60x4 Core Xeon Phi, Posix PDES Up to 46x speedup! Cores Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 23 Concluding Remarks ESL design needs fast and accurate simulation Traditional DES and PDES are insufficient Out-of-order PDES Novel, aggressive, fast Maximum parallelism Fully semantics compliant and accurate Promise of near-linear speedup on highly parallel platforms Compiler and Simulator Compiler with Out-of-Order PDES Analysis Simulator with Out-of-Order Scheduler Ongoing and Future Work Completion of implementation, further evaluation Collaboration with Accellera LWG Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 24 (c) 2014 R. Doemer et.al. 12
13 References (1) [DATE 12] W. Chen, X. Han, R. Dömer: "Out-of-Order Simulation for ESL Design", Proceedings of DATE, Dresden, Germany, March [ASPDAC 12] R. Dömer, W. Chen, X. Han: " Discrete Event Simulation of Transaction Level s", Proceedings of ASPDAC, Sydney, Australia, February [ASPDAC 12] W. Chen, R. Dömer: "An Optimizing Compiler for Out-of-Order ESL Simulation Exploiting Instance Isolation", Proceedings of ASPDAC, Sydney, Australia, February [IEEE D&T 11] W. Chen, X. Han, R. Dömer: "Multicore Simulation of Transaction-Level s Using the SoC Environment", IEEE Design & Test of Computers, vol. 28, no. 3, pp , May-June [ASPDAC 11] R. Dömer, W. Chen, X. Han, A. Gerstlauer: "Multi-Core Simulation of System-Level Description Languages", Proceedings of ASPDAC, Yokohama, Japan, January [HLDVT 10] W. Chen, X. Han, R. Dömer: "ESL Design and Multi-Core Validation using the System-on-Chip Environment", Proceedings of HLDVT, Anaheim, California, June Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 25 References (2) [DATE 14] W. Chen, X. Han, R. Dömer: "May-Happen-in- Analysis based on Segment Graphs for Safe ESL s", Accepted for publication at DATE, Dresden, Germany, March Best Paper Award! [DATE 13] W. Chen, R. Dömer: "Optimized Out-of-Order Discrete Event Simulation Using Predictions", Proceedings of DATE, Grenoble, France, March [IEEE D&T 13] W. Chen, X. Han, C. Chang, R. Dömer: "Advances in Discrete Event Simulation for Electronic System-Level Design", IEEE Design & Test of Computers, vol. 30, no. 1, pp , Jan.-Feb [HLDVT 12] W. Chen, C. Chang, X. Han, R. Dömer: "Eliminating Race Conditions in System-Level s by using Simulation Infrastructure", Proceedings of HLDVT 2012, Huntington Beach, California, November Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 26 (c) 2014 R. Doemer et.al. 13
Advances in Parallel Discrete Event Simulation EECS Colloquium, May 9, Advances in Parallel Discrete Event Simulation For Embedded System Design
Advances in Parallel Discrete Event Simulation For Embedded System Design Rainer Dömer doemer@uci.edu With contributions by Weiwei Chen and Xu Han Center for Embedded Computer Systems University of California,
More informationParallel Discrete Event Simulation of Transaction Level Models
Parallel Discrete Event Simulation of Transaction Level Models Rainer Dömer, Weiwei Chen, Xu Han Center for Embedded Computer Systems University of California, Irvine, USA doemer@uci.edu, weiwei.chen@uci.edu,
More informationComputer-Aided Recoding for Multi-Core Systems
Computer-Aided Recoding for Multi-Core Systems Rainer Dömer doemer@uci.edu With contributions by P. Chandraiah Center for Embedded Computer Systems University of California, Irvine Outline Embedded System
More informationA Parallel Transaction-Level Model of H.264 Video Decoder
Center for Embedded Computer Systems University of California, Irvine A Parallel Transaction-Level Model of H.264 Video Decoder Xu Han, Weiwei Chen and Rainer Doemer Technical Report CECS-11-03 June 2,
More informationSystemC Coding Guideline for Faster Out-of-order Parallel Discrete Event Simulation
SystemC Coding Guideline for Faster Out-of-order Parallel Discrete Event Simulation Zhongqi Cheng, Tim Schmidt, Rainer Dömer Center for Embedded and Cyber-Physical Systems University of California, Irvine,
More informationEfficient Modeling of Embedded Systems using Designer-controlled Recoding. Rainer Dömer. With contributions by Pramod Chandraiah
Efficient Modeling of Embedded Systems using Rainer Dömer With contributions by Pramod Chandraiah Center for Embedded Computer Systems University of California, Irvine Outline Introduction Designer-controlled
More informationHybrid Analysis of SystemC Models for Fast and Accurate Parallel Simulation
Hybrid Analysis of SystemC Models for Fast and Accurate Parallel Simulation Tim Schmidt, Guantao Liu, and Rainer Dömer Center for Embedded and Cyber-physical Systems University of California, Irvine, USA
More informationEliminating Race Conditions in System-Level Models by using Parallel Simulation Infrastructure
1 Eliminating Race Conditions in System-Level Models by using Parallel Simulation Infrastructure Weiwei Chen, Che-Wei Chang, Xu Han, Rainer Dömer Center for Embedded Computer Systems University of California,
More informationSystem-On-Chip Architecture Modeling Style Guide
Center for Embedded Computer Systems University of California, Irvine System-On-Chip Architecture Modeling Style Guide Junyu Peng Andreas Gerstlauer Rainer Dömer Daniel D. Gajski Technical Report CECS-TR-04-22
More informationRISC Compiler and Simulator, Release V0.5.0: Out-of-Order Parallel Simulatable SystemC Subset
Center for Embedded and Cyber-physical Systems University of California, Irvine RISC Compiler and Simulator, Release V0.5.0: Out-of-Order Parallel Simulatable SystemC Subset Guantao Liu, Tim Schmidt, Zhongqi
More informationMay-Happen-in-Parallel Analysis based on Segment Graphs for Safe ESL Models
May-Happen-in-Parallel Analysis based on Segment Graphs for Safe ESL Models Weiwei Chen, Xu Han, Rainer Dömer Center for Embedded Computer Systems University of California, Irvine, USA weiweic@uci.edu,
More informationRISC Compiler and Simulator, Release V0.4.0: Out-of-Order Parallel Simulatable SystemC Subset
Center for Embedded and Cyber-physical Systems University of California, Irvine RISC Compiler and Simulator, Release V0.4.0: Out-of-Order Parallel Simulatable SystemC Subset Guantao Liu, Tim Schmidt, Zhongqi
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationMulticore Simulation of Transaction-Level Models Using the SoC Environment
Transaction-Level Validation of Multicore Architectures Multicore Simulation of Transaction-Level Models Using the SoC Environment Weiwei Chen, Xu Han, and Rainer Dömer University of California, Irvine
More informationEE382N.23: Embedded System Design and Modeling
EE382N.23: Embedded System Design and Modeling Lecture 3 Language Semantics Andreas Gerstlauer Electrical and Computer Engineering University of Texas at Austin gerstl@ece.utexas.edu Lecture 3: Outline
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCreating Explicit Communication in SoC Models Using Interactive Re-Coding
Creating Explicit Communication in SoC Models Using Interactive Re-Coding Pramod Chandraiah, Junyu Peng, Rainer Dömer Center for Embedded Computer Systems University of California, Irvine California, USA
More informationTHE increasing complexity of embedded systems poses
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 12, DECEMBER 2014 1859 Out-of-Order Parallel Discrete Event Simulation for Transaction Level Models Weiwei Chen,
More informationNative Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization
Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Mian-Muhammad Hamayun, Frédéric Pétrot and Nicolas Fournel System Level Synthesis
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationCS377P Programming for Performance Multicore Performance Multithreading
CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX
More informationFractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures
Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin
More informationSpecC Methodology for High-Level Modeling
EDP 2002 9 th IEEE/DATC Electronic Design Processes Workshop SpecC Methodology for High-Level Modeling Rainer Dömer Daniel D. Gajski Andreas Gerstlauer Center for Embedded Computer Systems Universitiy
More informationCycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components
Cycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components Rainer Dömer, Andreas Gerstlauer, Dongwan Shin Technical Report CECS-04-19 July 22, 2004 Center for Embedded Computer Systems University
More informationQuantitative Analysis of Transaction Level Models for the AMBA Bus
Quantitative Analysis of Transaction Level Models for the AMBA Bus Gunar Schirner and Rainer Dömer Center for Embedded Computer Systems University of California, Irvine Motivation Higher productivity is
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationDIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING
1 DSP applications DSP platforms The synthesis problem Models of computation OUTLINE 2 DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: Time-discrete representation
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationOffload acceleration of scientific calculations within.net assemblies
Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationConcurrent Programing: Why you should care, deeply
Concurrent Programing: Why you should care, deeply 1 Student Questions 1. it is said that user-level threads are implemented by a library at the user-level. we have POSIX for starting user threads in C++.
More informationRISC Compiler and Simulator, Alpha Release V0.2.1: Out-of-Order Parallel Simulatable SystemC Subset
Center for Embedded and Cyber-physical Systems University of California, Irvine RISC Compiler and Simulator, Alpha Release V0.2.1: Out-of-Order Parallel Simulatable SystemC Subset Guantao Liu, Tim Schmidt,
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationA Parallelizing Compiler for Multicore Systems
A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationAN INTERACTIVE MODEL RE-CODER FOR EFFICIENT SOC SPECIFICATION
AN INTERACTIVE MODEL RE-CODER FOR EFFICIENT SOC SPECIFICATION Center for Embedded Computer Systems University of California Irvine pramodc@uci.edu, doemer@uci.edu Abstract To overcome the complexity in
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationGOP Level Parallelism on H.264 Video Encoder for Multicore Architecture
2011 International Conference on Circuits, System and Simulation IPCSIT vol.7 (2011) (2011) IACSIT Press, Singapore GOP Level on H.264 Video Encoder for Multicore Architecture S.Sankaraiah 1 2, H.S.Lam,
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationNative Offload of Haskell Repa Programs to Integrated GPUs
Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal Glew, Todd Anderson, Rajkishore Barik Intel Labs. September 28, 2016 General purpose computing on integrated
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationQuantitative Analysis of Transaction Level Models for the AMBA Bus
Quantitative Analysis of Transaction Level Models for the AMBA Bus Gunar Schirner, Rainer Dömer Center of Embedded Computer Systems University of California, Irvine hschirne@uci.edu, doemer@uci.edu Abstract
More informationOn Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy
On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,
More informationCycle accurate transaction-driven simulation with multiple processor simulators
Cycle accurate transaction-driven simulation with multiple processor simulators Dohyung Kim 1a) and Rajesh Gupta 2 1 Engineering Center, Google Korea Ltd. 737 Yeoksam-dong, Gangnam-gu, Seoul 135 984, Korea
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationOverview of Intel Xeon Phi Coprocessor
Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationUsing POSIX Threading to Build Scalable Multi-Core Applications
d font sizes: Title: Arial 14, bold Using POSIX Threading to Build Scalable Multi-Core Applications Kerry Johnson Senior Product Manager QNX Software Systems kjohnson@qnx.com Introduction Until recently,
More informationShared Memory. SMP Architectures and Programming
Shared Memory SMP Architectures and Programming 1 Why work with shared memory parallel programming? Speed Ease of use CLUMPS Good starting point 2 Shared Memory Processes or threads share memory No explicit
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationAn Adaptive Control Scheme for Multi-threaded Graphics Programs
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 498 An Adaptive Control Scheme for Multi-threaded Graphics Programs
More informationParallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationAutomatic Generation of Communication Architectures
i Topic: Network and communication system Automatic Generation of Communication Architectures Dongwan Shin, Andreas Gerstlauer, Rainer Dömer and Daniel Gajski Center for Embedded Computer Systems University
More informationParallel Simulation Accelerates Embedded Software Development, Debug and Test
Parallel Simulation Accelerates Embedded Software Development, Debug and Test Larry Lapides Imperas Software Ltd. larryl@imperas.com Page 1 Modern SoCs Have Many Concurrent Processing Elements SMP cores
More informationIntroduction to Microprocessor
Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationWhy Parallel Architecture
Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming
More informationGuillimin HPC Users Meeting January 13, 2017
Guillimin HPC Users Meeting January 13, 2017 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit
More informationA Hybrid Instruction Set Simulator for System Level Design
Center for Embedded Computer Systems University of California, Irvine A Hybrid Instruction Set Simulator for System Level Design Yitao Guo, Rainer Doemer Technical Report CECS-10-06 June 11, 2010 Center
More informationChapter 14 Performance and Processor Design
Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationCosimulation of ITRON-Based Embedded Software with SystemC
Cosimulation of ITRON-Based Embedded Software with SystemC Shin-ichiro Chikada, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada Graduate School of Information Science, Nagoya University Information Technology
More informationEffective Data-Race Detection for the Kernel
Effective Data-Race Detection for the Kernel John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk Microsoft Research Presented by Thaddeus Czauski 06 Aug 2011 CS 5204 2 How do we prevent
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationLecture 13. Shared memory: Architecture and programming
Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationTransaction-Level Modeling Definitions and Approximations. 2. Definitions of Transaction-Level Modeling
Transaction-Level Modeling Definitions and Approximations EE290A Final Report Trevor Meyerowitz May 20, 2005 1. Introduction Over the years the field of electronic design automation has enabled gigantic
More informationComputer Architecture and Structured Parallel Programming James Reinders, Intel
Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer
More informationThreads Cannot Be Implemented As a Library
Threads Cannot Be Implemented As a Library Authored by Hans J. Boehm Presented by Sarah Sharp February 18, 2008 Outline POSIX Thread Library Operation Vocab Problems with pthreads POSIX Thread Library
More informationHalfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9
Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with
More informationA GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou
A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled
More informationExpressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17
Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More information