Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Size: px
Start display at page:

Download "Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)"

Transcription

1 Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer Systems University of California, Irvine Outline Electronic System Level Design Project context, goals, and overview Simulation Traditional Discrete Event Simulation (DES) Discrete Event Simulation (PDES) Out-of-Order Discrete Event Simulation (OoO PDES) Project Realization Ongoing Research and Development Promising Experimental Results benchmarks Highly parallel applications Concluding Remarks Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 2 (c) 2014 R. Doemer et.al. 1

2 Electronic System Level (ESL) Electronic System Level s Abstract description of a complete system Hardware + Software Key Concepts in System ing Explicit Structure Block diagram structure Connectivity through ports Explicit Hierarchy System composed of components Explicit Concurrency Potential for parallel execution Potential for pipelined execution Explicit Communication and Computation Modules Channels and Interfaces System B0 B1 B2 B3 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 3 ESL Simulation Evaluation through Simulation Efficient system-level simulation is critical Fast and accurate! Complexity of system models grows constantly Need for speed! Simulation! ism is explicitly specified in model : SC_THREAD, SC_METHOD processing is available in standard PCs Multi-core hosts readily available Many-core technology is arriving Target Simulation Platform Intel Many Integrated Core (MIC) Architecture Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 4 (c) 2014 R. Doemer et.al. 2

3 Project Overview Planned Design Flow Static Analysis, Optimization Application CoFluent Studio Specification, ing C++ ROSE-based Recoding Compiler C++ Meta Component s CoFluent Studio ing Input ROSE based Recoding Compiler OoO PDES technology Intel MIC Architecture Target Platform C++ Compiler Executable PC Simulation Compiler (ICC) Executable Xeon Phi Platform Simulation Synthesis Tools Design Implementation Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 5 Project Overview Planned Design Flow Static Analysis, Optimization Application CoFluent Studio ing Suite CoFluent Studio Specification, ing C++ Compiler with OoO Analysis ROSE-based Recoding Compiler C++ Meta Component s CoFluent Studio ing Input ROSE based Recoding Compiler OoO PDES technology Intel MIC Architecture Target Platform C++ Compiler Executable PC Simulation Compiler (ICC) Executable Xeon Phi Platform Simulation Synthesis Tools Design Implementation OoO Xeon Phi Simulator Platform Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 6 (c) 2014 R. Doemer et.al. 3

4 with CoFluent Studio ing and Simulation Tool Suite Supports model-driven architecture (MDA) Based on Eclipse modeling framework (EMF) CoFluent ing Concept ensures well-defined model as input Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 7 ESL Simulation Traditional Discrete Event Simulation (DES) Reference simulators run sequentially, only one thread at a time (cooperative multi-threading model) Cannot utilize the capabilities of multi- or many-core hosts Discrete Event Simulation (PDES) Threads run in parallel (if at the same delta cycle and time) Simulation-cycles are absolute barriers! Out-of-order DE simulation (OoO PDES) Best technique known today, developed by CECS [DATE 12] Threads run in parallel and out-of-order even in different delta and time cycles if there are no conflicts! Aggressive, runs maximum number of threads in parallel, but fully preserves DES semantics and model accuracy! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 8 (c) 2014 R. Doemer et.al. 4

5 Discrete Event Simulation (DES) Traditional DES Concurrent threads of execution Managed by a central scheduler Driven by events and time advances Delta-cycle Time-cycle Partial temporal order with barriers Reference Simulator reference simulator uses cooperative multi-threading A single thread is active at any time! Cannot exploit parallelism Cannot utilize multiple cores th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 9 Discrete Event Simulation DES Threads execute in parallel iff in the same delta cycle, and in the same time cycle Significant speed up! Synchronous PDES: Cycle boundaries are absolute barriers! Aggressive DES Conservative Approaches Careful static analysis prevents conflicts Optimistic Approaches Conflicts are detected and addressed (roll back) th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 10 (c) 2014 R. Doemer et.al. 5

6 Out-of-Order DES Out-of-Order PDES Threads execute in parallel iff in the same delta cycle, and in the same time cycle, OR if there are no conflicts! Can utilize advanced compiler for static data conflict analysis Allows as many threads in parallel as possible Significantly higher speedup! Results at [DATE 12], [ASPDAC 12] Fully preserves DES execution semantics Accuracy in results and timing th 1 th 2 th 3 th 4 T:Δ 0:0 10:0 10:1 10:2 20:0 20:1 20:2 30:0 Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 11 Synchronous vs. Out-of-Order PDES Simple Example: video and audio decoding with different frame rates input stream 1: SC_MODULE(H264dec) 2: { sc_port<read_if> r; 3: sc_port<write_if> w; 4: 5: void main(){ 6: while(1){ 7: r >read(input_data); 8: decode_h264_frame(); 9: wait(33.3, SC_MS); 10: w >write(out_data); 11: } 12: }; H.264 decoder Stimulus DUT MP3 decoder 1: SC_MODULE(MP3dec) 2: { sc_port<read_if> r; 3: sc_port<write_if> w; 4: 5: void main(){ 6: while(1){ 7: r >read(input_data); 8: decode_mp3_frame(); 9: wait(26.12, SC_MS); 10: w >write(out_data); 11: } 12: }; H.264 Monitor 30fps Monitor MP3 Monitor 38.28fps Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 12 (c) 2014 R. Doemer et.al. 6

7 Synchronous vs. Out-of-Order PDES Simple Example: video and audio decoding with different frame rates Synchronous PDES Observes time and delta cycles Global time Out-of-Order PDES Breaks cycle barrier Local times (per thread) H.264 decoder H.264 Monitor 30fps Stimulus DUT Monitor MP3 decoder MP3 Monitor 38.28fps input stream PDES: [ms] OoO PDES: [ms] Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 13 Many-Core Target Platform Intel Many Integrated Core Architecture Intel Xeon Phi Coprocessor Provides 60 processor cores 4 hyper-threads per core 240 parallel hardware threads! Hardware Features Vector processing unit (VPU) Extended Math Unit (EMU) for transcendental operations Bidirectional ring interconnect Peak performance over 1 teraflops (double-precision) Uses familiar and standard programming models Appears as a regular Linux machine with 240 cores! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 14 (c) 2014 R. Doemer et.al. 7

8 Project Realization Ongoing Research and Development 1. Compiler with Out-of-Order PDES Analysis frontend for ROSE (lexer, parser, int. representation) Segment Graph data structure for advanced conflict analysis Code generator for parallel execution 2. Simulator with Out-of-Order Scheduler scheduler with fast conflict table lookup Target platform Intel MIC architecture Optimal thread-to-core task mapping kernel extension Protected communication Mutually-exclusive access to shared resources Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 15 Project Realization Compiler Build abstract syntax tree Build internal representation Build segment graph Build variable access lists Identify potential conflicts Build segment tables Instrument wait() calls Protect user-defined channels Generate parallel C++ model Library POSIX multi-threading Reentrant primitives Protected central resources Protected standard channels Out-of-order parallel scheduler Library Compiler C++ C++ Compiler Executable Multi-Core Host PC Simulation Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 16 (c) 2014 R. Doemer et.al. 8

9 Promising Experimental Results What Speedup is achievable on Today s Multi-Core and Many-Core Host Platforms? Early results using manually coded or SpecC-based examples Experimental Setup SMP Host PC 2 Intel Xeon X5650 CPUs at 2.66 GHz 6 cores each, 2 hyper-threads per core 24 parallel hardware threads available Many Integrated Core (MIC) Platform 1 Intel Xeon Phi Coprocessor 5110P at GHz 60 cores on ring-bus, 4 hyper-threads per core 240 parallel hardware threads available Highly parallel benchmarks floating-point multiplications (fmul) Fibonacci calculation (fibo) Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 17 Benchmark Results Experimental Results (2 Intel Xeon X5650 CPUs, 2x6x2 cores) fibo elapsed time [sec] fmul elapsed time [sec] fibo rel. speedup fmul rel. speedup x x Execution Time [sec] Multi-Core Host Cores Speedup Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 18 (c) 2014 R. Doemer et.al. 9

10 Benchmark Results Experimental Results (Intel Xeon Phi coproc., 60x4 cores) fibo elapsed time [sec] fmul elapsed time [sec] fibo rel. speedup fmul rel. speedup 103x x Execution Time Many-Core Host Cores Speedup Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 19 GPU Pipeline Example Graphics Application: Mandelbrot Set Mathematical set of points Two-dimensional fractal shape Complex computation Recursive function Extreme parallelism Pixel level TLM abstraction slices Configurable Executable Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 20 (c) 2014 R. Doemer et.al. 10

11 GPU Pipeline Example Graphics Application: Mandelbrot Set When synthesized, real-time rendering is no problem When simulated, regular DES is very slow DES can significantly speed up simulation! Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 21 GPU Pipeline Example Graphics Application: Mandelbrot Set DES can significantly speed up simulation! Experimental Results Sequence of 100 Mandelbrot images (640x448, depth 4096) SpecC models with increasing number of parallel blocks Hosts: Intel Core 2 Quad (4 cores), and Dual Xeon (12 cores) Speedup Core Host, PDES Up to 3.7x speedup! 2 CPU 6 Core Host, PDES 5.9x speedup! 2 CPU 6 Core Host, OoO PDES 6.3x speedup! Cores Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 22 (c) 2014 R. Doemer et.al. 11

12 GPU Pipeline Example Graphics Application: Mandelbrot Set DES can significantly speed up simulation! Experimental Results Sequence of 100 Mandelbrot images (640x448, depth 4096) Simplified PDES model (Posix based, manually created) Many Core Platform: Intel Xeon Phi (60 x 4 cores) Speedup 50x 40x 30x 20x 10x 0x Scales well on many-core platforms! 4 Core Host, PDES Up to 3.7x speedup! 2 CPU 6 Core Host, PDES 5.9x speedup! 2 CPU 6 Core Host, OoO PDES 6.3x speedup! 60x4 Core Xeon Phi, Posix PDES Up to 46x speedup! Cores Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 23 Concluding Remarks ESL design needs fast and accurate simulation Traditional DES and PDES are insufficient Out-of-order PDES Novel, aggressive, fast Maximum parallelism Fully semantics compliant and accurate Promise of near-linear speedup on highly parallel platforms Compiler and Simulator Compiler with Out-of-Order PDES Analysis Simulator with Out-of-Order Scheduler Ongoing and Future Work Completion of implementation, further evaluation Collaboration with Accellera LWG Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 24 (c) 2014 R. Doemer et.al. 12

13 References (1) [DATE 12] W. Chen, X. Han, R. Dömer: "Out-of-Order Simulation for ESL Design", Proceedings of DATE, Dresden, Germany, March [ASPDAC 12] R. Dömer, W. Chen, X. Han: " Discrete Event Simulation of Transaction Level s", Proceedings of ASPDAC, Sydney, Australia, February [ASPDAC 12] W. Chen, R. Dömer: "An Optimizing Compiler for Out-of-Order ESL Simulation Exploiting Instance Isolation", Proceedings of ASPDAC, Sydney, Australia, February [IEEE D&T 11] W. Chen, X. Han, R. Dömer: "Multicore Simulation of Transaction-Level s Using the SoC Environment", IEEE Design & Test of Computers, vol. 28, no. 3, pp , May-June [ASPDAC 11] R. Dömer, W. Chen, X. Han, A. Gerstlauer: "Multi-Core Simulation of System-Level Description Languages", Proceedings of ASPDAC, Yokohama, Japan, January [HLDVT 10] W. Chen, X. Han, R. Dömer: "ESL Design and Multi-Core Validation using the System-on-Chip Environment", Proceedings of HLDVT, Anaheim, California, June Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 25 References (2) [DATE 14] W. Chen, X. Han, R. Dömer: "May-Happen-in- Analysis based on Segment Graphs for Safe ESL s", Accepted for publication at DATE, Dresden, Germany, March Best Paper Award! [DATE 13] W. Chen, R. Dömer: "Optimized Out-of-Order Discrete Event Simulation Using Predictions", Proceedings of DATE, Grenoble, France, March [IEEE D&T 13] W. Chen, X. Han, C. Chang, R. Dömer: "Advances in Discrete Event Simulation for Electronic System-Level Design", IEEE Design & Test of Computers, vol. 30, no. 1, pp , Jan.-Feb [HLDVT 12] W. Chen, C. Chang, X. Han, R. Dömer: "Eliminating Race Conditions in System-Level s by using Simulation Infrastructure", Proceedings of HLDVT 2012, Huntington Beach, California, November Out-of-Order Simulation (c) 2014 R. Doemer, et.al. 26 (c) 2014 R. Doemer et.al. 13

Advances in Parallel Discrete Event Simulation EECS Colloquium, May 9, Advances in Parallel Discrete Event Simulation For Embedded System Design

Advances in Parallel Discrete Event Simulation EECS Colloquium, May 9, Advances in Parallel Discrete Event Simulation For Embedded System Design Advances in Parallel Discrete Event Simulation For Embedded System Design Rainer Dömer doemer@uci.edu With contributions by Weiwei Chen and Xu Han Center for Embedded Computer Systems University of California,

More information

Parallel Discrete Event Simulation of Transaction Level Models

Parallel Discrete Event Simulation of Transaction Level Models Parallel Discrete Event Simulation of Transaction Level Models Rainer Dömer, Weiwei Chen, Xu Han Center for Embedded Computer Systems University of California, Irvine, USA doemer@uci.edu, weiwei.chen@uci.edu,

More information

Computer-Aided Recoding for Multi-Core Systems

Computer-Aided Recoding for Multi-Core Systems Computer-Aided Recoding for Multi-Core Systems Rainer Dömer doemer@uci.edu With contributions by P. Chandraiah Center for Embedded Computer Systems University of California, Irvine Outline Embedded System

More information

A Parallel Transaction-Level Model of H.264 Video Decoder

A Parallel Transaction-Level Model of H.264 Video Decoder Center for Embedded Computer Systems University of California, Irvine A Parallel Transaction-Level Model of H.264 Video Decoder Xu Han, Weiwei Chen and Rainer Doemer Technical Report CECS-11-03 June 2,

More information

SystemC Coding Guideline for Faster Out-of-order Parallel Discrete Event Simulation

SystemC Coding Guideline for Faster Out-of-order Parallel Discrete Event Simulation SystemC Coding Guideline for Faster Out-of-order Parallel Discrete Event Simulation Zhongqi Cheng, Tim Schmidt, Rainer Dömer Center for Embedded and Cyber-Physical Systems University of California, Irvine,

More information

Efficient Modeling of Embedded Systems using Designer-controlled Recoding. Rainer Dömer. With contributions by Pramod Chandraiah

Efficient Modeling of Embedded Systems using Designer-controlled Recoding. Rainer Dömer. With contributions by Pramod Chandraiah Efficient Modeling of Embedded Systems using Rainer Dömer With contributions by Pramod Chandraiah Center for Embedded Computer Systems University of California, Irvine Outline Introduction Designer-controlled

More information

Hybrid Analysis of SystemC Models for Fast and Accurate Parallel Simulation

Hybrid Analysis of SystemC Models for Fast and Accurate Parallel Simulation Hybrid Analysis of SystemC Models for Fast and Accurate Parallel Simulation Tim Schmidt, Guantao Liu, and Rainer Dömer Center for Embedded and Cyber-physical Systems University of California, Irvine, USA

More information

Eliminating Race Conditions in System-Level Models by using Parallel Simulation Infrastructure

Eliminating Race Conditions in System-Level Models by using Parallel Simulation Infrastructure 1 Eliminating Race Conditions in System-Level Models by using Parallel Simulation Infrastructure Weiwei Chen, Che-Wei Chang, Xu Han, Rainer Dömer Center for Embedded Computer Systems University of California,

More information

System-On-Chip Architecture Modeling Style Guide

System-On-Chip Architecture Modeling Style Guide Center for Embedded Computer Systems University of California, Irvine System-On-Chip Architecture Modeling Style Guide Junyu Peng Andreas Gerstlauer Rainer Dömer Daniel D. Gajski Technical Report CECS-TR-04-22

More information

RISC Compiler and Simulator, Release V0.5.0: Out-of-Order Parallel Simulatable SystemC Subset

RISC Compiler and Simulator, Release V0.5.0: Out-of-Order Parallel Simulatable SystemC Subset Center for Embedded and Cyber-physical Systems University of California, Irvine RISC Compiler and Simulator, Release V0.5.0: Out-of-Order Parallel Simulatable SystemC Subset Guantao Liu, Tim Schmidt, Zhongqi

More information

May-Happen-in-Parallel Analysis based on Segment Graphs for Safe ESL Models

May-Happen-in-Parallel Analysis based on Segment Graphs for Safe ESL Models May-Happen-in-Parallel Analysis based on Segment Graphs for Safe ESL Models Weiwei Chen, Xu Han, Rainer Dömer Center for Embedded Computer Systems University of California, Irvine, USA weiweic@uci.edu,

More information

RISC Compiler and Simulator, Release V0.4.0: Out-of-Order Parallel Simulatable SystemC Subset

RISC Compiler and Simulator, Release V0.4.0: Out-of-Order Parallel Simulatable SystemC Subset Center for Embedded and Cyber-physical Systems University of California, Irvine RISC Compiler and Simulator, Release V0.4.0: Out-of-Order Parallel Simulatable SystemC Subset Guantao Liu, Tim Schmidt, Zhongqi

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Multicore Simulation of Transaction-Level Models Using the SoC Environment

Multicore Simulation of Transaction-Level Models Using the SoC Environment Transaction-Level Validation of Multicore Architectures Multicore Simulation of Transaction-Level Models Using the SoC Environment Weiwei Chen, Xu Han, and Rainer Dömer University of California, Irvine

More information

EE382N.23: Embedded System Design and Modeling

EE382N.23: Embedded System Design and Modeling EE382N.23: Embedded System Design and Modeling Lecture 3 Language Semantics Andreas Gerstlauer Electrical and Computer Engineering University of Texas at Austin gerstl@ece.utexas.edu Lecture 3: Outline

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Creating Explicit Communication in SoC Models Using Interactive Re-Coding

Creating Explicit Communication in SoC Models Using Interactive Re-Coding Creating Explicit Communication in SoC Models Using Interactive Re-Coding Pramod Chandraiah, Junyu Peng, Rainer Dömer Center for Embedded Computer Systems University of California, Irvine California, USA

More information

THE increasing complexity of embedded systems poses

THE increasing complexity of embedded systems poses IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 12, DECEMBER 2014 1859 Out-of-Order Parallel Discrete Event Simulation for Transaction Level Models Weiwei Chen,

More information

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Mian-Muhammad Hamayun, Frédéric Pétrot and Nicolas Fournel System Level Synthesis

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

SpecC Methodology for High-Level Modeling

SpecC Methodology for High-Level Modeling EDP 2002 9 th IEEE/DATC Electronic Design Processes Workshop SpecC Methodology for High-Level Modeling Rainer Dömer Daniel D. Gajski Andreas Gerstlauer Center for Embedded Computer Systems Universitiy

More information

Cycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components

Cycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components Cycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components Rainer Dömer, Andreas Gerstlauer, Dongwan Shin Technical Report CECS-04-19 July 22, 2004 Center for Embedded Computer Systems University

More information

Quantitative Analysis of Transaction Level Models for the AMBA Bus

Quantitative Analysis of Transaction Level Models for the AMBA Bus Quantitative Analysis of Transaction Level Models for the AMBA Bus Gunar Schirner and Rainer Dömer Center for Embedded Computer Systems University of California, Irvine Motivation Higher productivity is

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING 1 DSP applications DSP platforms The synthesis problem Models of computation OUTLINE 2 DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: Time-discrete representation

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Offload acceleration of scientific calculations within.net assemblies

Offload acceleration of scientific calculations within.net assemblies Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

Concurrent Programing: Why you should care, deeply

Concurrent Programing: Why you should care, deeply Concurrent Programing: Why you should care, deeply 1 Student Questions 1. it is said that user-level threads are implemented by a library at the user-level. we have POSIX for starting user threads in C++.

More information

RISC Compiler and Simulator, Alpha Release V0.2.1: Out-of-Order Parallel Simulatable SystemC Subset

RISC Compiler and Simulator, Alpha Release V0.2.1: Out-of-Order Parallel Simulatable SystemC Subset Center for Embedded and Cyber-physical Systems University of California, Irvine RISC Compiler and Simulator, Alpha Release V0.2.1: Out-of-Order Parallel Simulatable SystemC Subset Guantao Liu, Tim Schmidt,

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

A Parallelizing Compiler for Multicore Systems

A Parallelizing Compiler for Multicore Systems A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

AN INTERACTIVE MODEL RE-CODER FOR EFFICIENT SOC SPECIFICATION

AN INTERACTIVE MODEL RE-CODER FOR EFFICIENT SOC SPECIFICATION AN INTERACTIVE MODEL RE-CODER FOR EFFICIENT SOC SPECIFICATION Center for Embedded Computer Systems University of California Irvine pramodc@uci.edu, doemer@uci.edu Abstract To overcome the complexity in

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture

GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture 2011 International Conference on Circuits, System and Simulation IPCSIT vol.7 (2011) (2011) IACSIT Press, Singapore GOP Level on H.264 Video Encoder for Multicore Architecture S.Sankaraiah 1 2, H.S.Lam,

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Native Offload of Haskell Repa Programs to Integrated GPUs

Native Offload of Haskell Repa Programs to Integrated GPUs Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal Glew, Todd Anderson, Rajkishore Barik Intel Labs. September 28, 2016 General purpose computing on integrated

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Quantitative Analysis of Transaction Level Models for the AMBA Bus

Quantitative Analysis of Transaction Level Models for the AMBA Bus Quantitative Analysis of Transaction Level Models for the AMBA Bus Gunar Schirner, Rainer Dömer Center of Embedded Computer Systems University of California, Irvine hschirne@uci.edu, doemer@uci.edu Abstract

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

Cycle accurate transaction-driven simulation with multiple processor simulators

Cycle accurate transaction-driven simulation with multiple processor simulators Cycle accurate transaction-driven simulation with multiple processor simulators Dohyung Kim 1a) and Rajesh Gupta 2 1 Engineering Center, Google Korea Ltd. 737 Yeoksam-dong, Gangnam-gu, Seoul 135 984, Korea

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Overview of Intel Xeon Phi Coprocessor

Overview of Intel Xeon Phi Coprocessor Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Using POSIX Threading to Build Scalable Multi-Core Applications

Using POSIX Threading to Build Scalable Multi-Core Applications d font sizes: Title: Arial 14, bold Using POSIX Threading to Build Scalable Multi-Core Applications Kerry Johnson Senior Product Manager QNX Software Systems kjohnson@qnx.com Introduction Until recently,

More information

Shared Memory. SMP Architectures and Programming

Shared Memory. SMP Architectures and Programming Shared Memory SMP Architectures and Programming 1 Why work with shared memory parallel programming? Speed Ease of use CLUMPS Good starting point 2 Shared Memory Processes or threads share memory No explicit

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

An Adaptive Control Scheme for Multi-threaded Graphics Programs

An Adaptive Control Scheme for Multi-threaded Graphics Programs Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 498 An Adaptive Control Scheme for Multi-threaded Graphics Programs

More information

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Automatic Generation of Communication Architectures

Automatic Generation of Communication Architectures i Topic: Network and communication system Automatic Generation of Communication Architectures Dongwan Shin, Andreas Gerstlauer, Rainer Dömer and Daniel Gajski Center for Embedded Computer Systems University

More information

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Parallel Simulation Accelerates Embedded Software Development, Debug and Test Parallel Simulation Accelerates Embedded Software Development, Debug and Test Larry Lapides Imperas Software Ltd. larryl@imperas.com Page 1 Modern SoCs Have Many Concurrent Processing Elements SMP cores

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Why Parallel Architecture

Why Parallel Architecture Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming

More information

Guillimin HPC Users Meeting January 13, 2017

Guillimin HPC Users Meeting January 13, 2017 Guillimin HPC Users Meeting January 13, 2017 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit

More information

A Hybrid Instruction Set Simulator for System Level Design

A Hybrid Instruction Set Simulator for System Level Design Center for Embedded Computer Systems University of California, Irvine A Hybrid Instruction Set Simulator for System Level Design Yitao Guo, Rainer Doemer Technical Report CECS-10-06 June 11, 2010 Center

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Cosimulation of ITRON-Based Embedded Software with SystemC

Cosimulation of ITRON-Based Embedded Software with SystemC Cosimulation of ITRON-Based Embedded Software with SystemC Shin-ichiro Chikada, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada Graduate School of Information Science, Nagoya University Information Technology

More information

Effective Data-Race Detection for the Kernel

Effective Data-Race Detection for the Kernel Effective Data-Race Detection for the Kernel John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk Microsoft Research Presented by Thaddeus Czauski 06 Aug 2011 CS 5204 2 How do we prevent

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Transaction-Level Modeling Definitions and Approximations. 2. Definitions of Transaction-Level Modeling

Transaction-Level Modeling Definitions and Approximations. 2. Definitions of Transaction-Level Modeling Transaction-Level Modeling Definitions and Approximations EE290A Final Report Trevor Meyerowitz May 20, 2005 1. Introduction Over the years the field of electronic design automation has enabled gigantic

More information

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Computer Architecture and Structured Parallel Programming James Reinders, Intel Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer

More information

Threads Cannot Be Implemented As a Library

Threads Cannot Be Implemented As a Library Threads Cannot Be Implemented As a Library Authored by Hans J. Boehm Presented by Sarah Sharp February 18, 2008 Outline POSIX Thread Library Operation Vocab Problems with pthreads POSIX Thread Library

More information

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9 Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with

More information

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled

More information

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information