Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling
|
|
- Delilah Melton
- 5 years ago
- Views:
Transcription
1 Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling Tobias Schwarzer 1, Joachim Falk 1, Michael Glaß 1, Jürgen Teich 1, Christian Zebelein 2, Christian Haubelt 2 1 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany 2 University of Rostock, Germany
2 Fine-granular Modeling with Data Flow r CPU1 r CPU2 a 7 Advantages: Exposes all degrees of available parallelism High flexibility of mapping actors to resources Disadvantages: Scheduling overhead at runtime May sacrifize the code-generation efficiency of the compiler Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
3 Coarse-granular Modeling c r CPU1 r CPU2 c a 7 Advantages: High code-generation efficiency Low scheduling overhead at runtime Disadvantages: May sacrifize parallelism significantly Reduces mapping flexibility Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
4 Proposed Solution Input: Fine-grained (data flow) modelling of applications Model of the target platform Given: a clustering technique [FKHTB08] to automatically combine fine-grained actors to composite actors using Quasi-Static Scheduling (QSS) Key Idea: Perform a Design Space Exploration (DSE) to find the best combinations of clusters and their mapping to available resources in terms of application throughput Contributions: A genetic representation for the exploration of clusters A repair strategy to efficiently restrict the search to feasible clusters only Synthesis-in-the-Loop: Automatic synthesis of each implementation candidate to the target platform to determine quality numbers, capturing the effects of (a) scheduling overhead, (b) memory accesses, and (c) compiler optimizations FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
5 Proposed System Compilation Flow a 7 a8 Automatic extraction of the dataflow graph Dataflow application Input Read in executable specification a 7 r CPU1 r CPU2 Explore design space Step 1: Determine clustering Step 2: Perform binding Architecture graph Synthesis-in-the-loop Run executable Obtain throughput DSE Evaluate implementation on target hardware r CPU3 r CPU4 Non-dominated design points Output Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
6 Fundamentals
7 Classic System Synthesis Problem graph g p = (A, C): Architecture graph g r = (R, L): r CPU1 r CPU2 r CPU3 r CPU4 a 7 dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
8 Classic System Synthesis Problem graph g p = (A, C): Architecture graph g r = (R, L): r CPU1 r CPU2 Binding β r CPU3 r CPU4 a 7 dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
9 Clustering-aware System Synthesis Clustering: Replace a connected subgraph containing only static actors of a DFG by a single composite actor cluster 1 a 7 dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
10 Clustering-aware System Synthesis Clustering: Replace a connected subgraph containing only static actors of a DFG by a single composite actor Each composite actor contains a Quasi-Static Schedule (QSS) executing sequences of statically scheduled actor firings i 1 QSS 1 o 1 QSS is represented by a Finite State Machine (FSM) o 2 #i 1 1 & #o 2 1 / < > Details in [FKHTB08] q 0 q 1 FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , dynamic actors #i 1 1 & #o 1 1 & #o 2 1 / <,a 7, > static actors FSM state Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
11 Clustering-aware System Synthesis cluster 2 Clustering: Replace a connected subgraph containing only static actors of a DFG by a single composite actor Each composite actor contains a Quasi-Static Schedule (QSS) executing sequences of statically scheduled actor firings cluster 1 QSS is represented by a Finite State Machine (FSM) a 7 Details in [FKHTB08] FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
12 Clustering-aware System Synthesis cluster 2 Clustering: Replace a connected subgraph containing only static actors of a DFG by a single composite actor Each composite actor contains a Quasi-Static Schedule (QSS) executing sequences of statically scheduled actor firings cluster 1 QSS is represented by a Finite State Machine (FSM) a 7 Details in [FKHTB08] FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
13 Clustering-aware System Synthesis cluster 2 r CPU1 r CPU2 Binding β cluster 1 r CPU3 r CPU4 a 7 dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
14 Design Space Exploration
15 Design Space Exploration - Opt4J Custom optimization problems are typically defined by providing three main classes: Creator: Generate random genotype objects Decoder: Convert the genotype into a phenotype Evaluator: Determine the quality of one phenotype Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
16 Design Space Exploration - Creator Generate random genotype objects: Associate each channel of the problem graph with a Boolean value using the clusterization function γ: γ = C S {,false} channel should be contained in a cluster false channel should not be contained in a cluster Exclude channels connected to a dynamic actor from genotype false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
17 Design Space Exploration - Decoder How to convert the genotype (/false edges) to a phenotype (clusters)? cluster 2? cluster 1 false a 7 a 7 dynamic actors static actors Conversion consists of two steps: 1. Determine clustering candidates 2. Repair clustering candidates to obtain a feasible cluster Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
18 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
19 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
20 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with stop? false stop? a 7 stop? Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
21 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
22 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with stop? false stop? a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
23 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
24 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 clustering candidate 1 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
25 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 clustering candidate 1 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
26 Step 1: Determine Clustering Candidate clustering candidate 2 Collect reachable actors starting at channel associated with false a 7 clustering candidate 1 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
27 Step 2: Repair Strategy false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
28 Step 2: Repair Strategy a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
29 Step 2: Repair Strategy a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
30 Step 2: Repair Strategy Clustering condition: For each pair of cluster input and output actor, there exists a directed path from the input actor to the output actor. a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
31 Step 2: Repair Strategy a 5 a 7 Clustering condition: For each pair of cluster input and output actor, there exists a directed path from the input actor to the output actor. Repair strategy: Transform an infeasible clustering candidate into (multiple) feasible clusters cluster candidate input actors {a2,a6} clustering candidate output actors {a3,a4,a7} Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
32 Step 2: Repair Strategy Repair clustering candidate to obtain feasible cluster: a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
33 Step 2: Repair Strategy Repair clustering candidate to obtain feasible cluster: a 7 cluster candidate input actors {a2,a6} clustering candidate output actors {a3,a4,a7} Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
34 Step 2: Repair Strategy I = { } O= {, } Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor a 7 cluster candidate input actors {a2,a6} clustering candidate output actors {a3,a4,a7} Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
35 Step 2: Repair Strategy I = {, } O= {, } I = { } O= {, } I = {, } O= {, } Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor I = { } O= {,,a 7 } I = { } O= {,,a 7 } a 7 I = { } O= {, } I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
36 Step 2: Repair Strategy EC 4 I = {, } O= {, } EC 2 I = { } O= {, } EC 1 I = {, } O= {, } I = { } O= {,,a 7 } I = { } O= {,,a 7 } Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor Create equivalence classes (ECs) a 7 I = { } O= {, } EC 3 I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
37 Step 2: Repair Strategy EC 4 I = {, } O= {, } EC 2 I = { } O= {, } EC 1 I = {, } O= {, } I = { } O= {,,a 7 } I = { } O= {,,a 7 } a 7 Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor Create equivalence classes (ECs) Determine clashes between each EC I = { } O= {, } EC 3 I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
38 Step 2: Repair Strategy EC 4 I = {, } O= {, } I = { } O= {, } EC 1 I = { } O= {,,a 7 } I = { } O= {,,a 7 } a 7 Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor Clash: Detect I = {a the 2, } situation when an output actor is not connected O= {,a to 4 } all input actors, since this may result in the infinite token accumulation problem. (Formal details given in the paper) EC 2 Create equivalence classes (ECs) Determine clashes between each EC I = { } O= {, } EC 3 I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
39 Step 2: Repair Strategy I = { } O= {, } EC 1 EC 4 I = {, } O= {, } EC 2 I = {, } O= {, } I = { } O= {,,a 7 } I = { } O= {,,a 7 } Clash exists between EC 2 and EC 4! Repair strategy splits the cluster candidate between and or between and. a 7 I = { } O= {, } EC 3 I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
40 Design Space Exploration - Evaluator Determine the quality of one phenotype: Each implementation candidate is evaluated via synthesis-in-the-loop on the target platform with respect to the design objectives: Throughput [iterations/s] (MAX) Number of used processor cores (MIN) a 7 r CPU1 r CPU2 Explore design space Step 1: Determine clustering Step 2: Perform binding Architecture graph Synthesis-in-the-loop Run executable Obtain throughput Evaluate implementation on target hardware r CPU3 r CPU4 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
41 Experimental Results
42 Experimental results Test suite: Six synthetic DFGs using the SDF 3 tool Brake-by-wire control application (BBW) Target platforms: Intel(R) Core(TM) i GHz Xilinx Zynq-7000 All Programmable SoC with a dual-core ARM Cortex- A9 667MHz Approaches: classic (DSE without clustering) post QSS (DSE with clustering as a post-processing step) Direct combination of [FKHTB08] and DSE Proposed holistic DSE flow FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
43 Results for synthetic DFG g P C Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
44 Results Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
45 Results Holistic approach is always superior to classic DSE outperforms post QSS in the multi-processor case Speedups up to 9.91x for four processor cores of the intel platform 7.2x for two processor cores on the Zynq platform Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
46 Conclusion Holistic compilation flow combines system synthesis and the use of QSSs in a DSE to maximize application throughput for a given target platform DSE explores the clustering of static actors to fine-tune the tradeoff between exploitable degree of parallelism, dynamic scheduling overhead reduction, and compiler code-generation efficiency Novel repair strategy ensures the creation of feasible clusters Each implementation candidate is compiled to and measured on real target hardware (synthesis-in-the-loop) Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
47 Thank you for your attention! Questions? Acknowledgements: This work is supported in part by the German Science Foundation (DFG), HA 4463/3-2 and TE 163/18-2. Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES
SysteMoC. Verification and Refinement of Actor-Based Models of Computation
SysteMoC Verification and Refinement of Actor-Based Models of Computation Joachim Falk, Jens Gladigau, Christian Haubelt, Joachim Keinert, Martin Streubühr, and Jürgen Teich {falk, haubelt}@cs.fau.de Hardware-Software-Co-Design
More informationCombined System Synthesis and Communication Architecture Exploration for MPSoCs
This is the author s version of the work. The definitive work was published in Proceedings of Design, Automation and Test in Europe (DATE 2009), pp. 472-477, 2009. The work is supported in part by the
More informationRuntime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays
Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann
More informationGeneration of Multigrid-based Numerical Solvers for FPGA Accelerators
Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System
More informationReliable Embedded Multimedia Systems?
2 Overview Reliable Embedded Multimedia Systems? Twan Basten Joint work with Marc Geilen, AmirHossein Ghamarian, Hamid Shojaei, Sander Stuijk, Bart Theelen, and others Embedded Multi-media Analysis of
More informationDesign Space Exploration
Design Space Exploration SS 2012 Jun.-Prof. Dr. Christian Plessl Custom Computing University of Paderborn Version 1.1.0 2012-06-15 Overview motivation for design space exploration design space exploration
More informationNoC Simulation in Heterogeneous Architectures for PGAS Programming Model
NoC Simulation in Heterogeneous Architectures for PGAS Programming Model Sascha Roloff, Andreas Weichslgartner, Frank Hannig, Jürgen Teich University of Erlangen-Nuremberg, Germany Jan Heißwolf Karlsruhe
More informationHardware/Software Codesign
Hardware/Software Codesign SS 2016 Prof. Dr. Christian Plessl High-Performance IT Systems group University of Paderborn Version 2.2.0 2016-04-08 how to design a "digital TV set top box" Motivating Example
More informationAn FPGA Based Adaptive Viterbi Decoder
An FPGA Based Adaptive Viterbi Decoder Sriram Swaminathan Russell Tessier Department of ECE University of Massachusetts Amherst Overview Introduction Objectives Background Adaptive Viterbi Algorithm Architecture
More informationBy: Chaitanya Settaluri Devendra Kalia
By: Chaitanya Settaluri Devendra Kalia What is an embedded system? An embedded system Uses a controller to perform some function Is not perceived as a computer Software is used for features and flexibility
More informationControl Performance-Aware System Level Design
Control Performance-Aware System Level Design Nina Mühleis, Michael Glaß, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Email: {nina.muehleis, glass, teich}@cs.fau.de Abstract
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationIndustrial Multicore Software with EMB²
Siemens Industrial Multicore Software with EMB² Dr. Tobias Schüle, Dr. Christian Kern Introduction In 2022, multicore will be everywhere. (IEEE CS) Parallel Patterns Library Apple s Grand Central Dispatch
More informationEfficient Symbolic Multi Objective Design Space Exploration
This is the author s version of the work. The definitive work was published in Proceedings of the 3th Asia and South Pacific Design Automation Conference (ASP-DAC 2008), pp. 69-696, 2008. The work is supported
More informationExtensions of Daedalus Todor Stefanov
Extensions of Daedalus Todor Stefanov Leiden Embedded Research Center, Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Overview of Extensions in Daedalus DSE limited to
More informationAutomatic Generation of System-Level Virtual Prototypes from Streaming Application Models
Automatic Generation of System-Level Virtual Prototypes from Streaming Application Models Philipp Kutzer, Jens Gladigau, Christian Haubelt, and Jürgen Teich Hardware/Software Co-Design, Department of Computer
More informationSYSTEMCODESIGNER An Automatic ESL Synthesis Approach by Design Space Exploration and Behavioral Synthesis for Streaming Applications
1 SYSTEMCODESIGNER An Automatic ESL Synthesis Approach by Design Space Exploration and Behavioral Synthesis for Streaming Applications JOACHIM KEINERT, MARTIN STREUBŪHR, THOMAS SCHLICHTER, JOACHIM FALK,
More informationA Parallelizing Compiler for Multicore Systems
A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)
More informationMulti-Objective Distributed Run-time Resource Management for Many-Cores
Multi-Objective Distributed Run-time Resource Management for Many-Cores Stefan Wildermann, Michael Glaß, Jürgen Teich University of Erlangen-Nuremberg, Germany {stefan.wildermann, michael.glass, teich}@fau.de
More informationMulticore DSP Software Synthesis using Partial Expansion of Dataflow Graphs
Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs George F. Zaki, William Plishker, Shuvra S. Bhattacharyya University of Maryland, College Park, MD, USA & Frank Fruth Texas Instruments
More informationMULTI-PROCESSOR SYSTEM-LEVEL SYNTHESIS FOR MULTIPLE APPLICATIONS ON PLATFORM FPGA
MULTI-PROCESSOR SYSTEM-LEVEL SYNTHESIS FOR MULTIPLE APPLICATIONS ON PLATFORM FPGA Akash Kumar,, Shakith Fernando, Yajun Ha, Bart Mesman and Henk Corporaal Eindhoven University of Technology, Eindhoven,
More informationHETEROGENEOUS MULTIPROCESSOR MAPPING FOR REAL-TIME STREAMING SYSTEMS
HETEROGENEOUS MULTIPROCESSOR MAPPING FOR REAL-TIME STREAMING SYSTEMS Jing Lin, Akshaya Srivasta, Prof. Andreas Gerstlauer, and Prof. Brian L. Evans Department of Electrical and Computer Engineering The
More informationESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder
ESE532: System-on-a-Chip Architecture Day 5: September 18, 2017 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Basic Approach Dataflow variants Motivations/demands for
More informationFiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers
FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers Rene Griessl, Meysam Peykanu, Lennart Tigges, Jens Hagemeyer, Mario Porrmann Center of Excellence Cognitive Interaction Technology
More informationMULTIDIMENSIONAL DATAFLOW GRAPH MODELING AND MAPPING FOR EFFICIENT GPU IMPLEMENTATION
In Proceedings of the IEEE Workshop on Signal Processing Systems, Quebec City, Canada, October 2012. MULTIDIMENSIONAL DATAFLOW GRAPH MODELING AND MAPPING FOR EFFICIENT IMPLEMENTATION Lai-Huei Wang 1, Chung-Ching
More informationA MODEL-BASED INTER-PROCESS RESOURCE SHARING APPROACH FOR HIGH-LEVEL SYNTHESIS OF DATAFLOW GRAPHS
A MODEL-BAED INTER-PROCE REOURCE HARING APPROACH FOR HIGH-LEVEL YNTHEI OF DATAFLOW GRAPH Christian Zebelein 1, Joachim Falk 2, Christian Haubelt 1, Jürgen Teich 2 1 University of Rostock 2 University of
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationMulti-Variant-based Design Space Exploration for Automotive Embedded Systems
Multi-Variant-based Design Space Exploration for Automotive Embedded Systems Sebastian Graf, Michael Glaß, and Jürgen Teich Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany Email: sebastian.graf,
More informationReliable Dynamic Embedded Data Processing Systems
2 Embedded Data Processing Systems Reliable Dynamic Embedded Data Processing Systems sony Twan Basten thales Joint work with Marc Geilen, AmirHossein Ghamarian, Hamid Shojaei, Sander Stuijk, Bart Theelen,
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationMapping of Applications to Multi-Processor Systems
Mapping of Applications to Multi-Processor Systems Peter Marwedel TU Dortmund, Informatik 12 Germany Marwedel, 2003 Graphics: Alexandra Nolte, Gesine 2011 年 12 月 09 日 These slides use Microsoft clip arts.
More informationClassification of General Data Flow Actors into Known Models of Computation
Classification of General Data Flow Actors into Known Models of Computation Christian Zebelein, Joachim Falk, Christian Haubelt, and Jürgen Teich Hardware/Software Co-Design University Erlangen-Nuremberg,
More informationPreliminary Discussion
Preliminary Discussion Multi-Core Architectures and Programming Oliver Reiche, Christian Schmitt, Michael Witterauf, Frank Hannig Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nürnberg
More informationPetaBricks. With the dawn of the multi-core era, programmers are being challenged to write
PetaBricks Building adaptable and more efficient programs for the multi-core era is now within reach. By Jason Ansel and Cy Chan DOI: 10.1145/1836543.1836554 With the dawn of the multi-core era, programmers
More informationPetri Nets. Petri Nets. Petri Net Example. Systems are specified as a directed bipartite graph. The two kinds of nodes in the graph:
System Design&Methodologies Fö - 1 System Design&Methodologies Fö - 2 Petri Nets 1. Basic Petri Net Model 2. Properties and Analysis of Petri Nets 3. Extended Petri Net Models Petri Nets Systems are specified
More informationMODELING OF BLOCK-BASED DSP SYSTEMS
MODELING OF BLOCK-BASED DSP SYSTEMS Dong-Ik Ko and Shuvra S. Bhattacharyya Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies University of Maryland, College
More informationHardware-Software Codesign. 1. Introduction
Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2
More informationSoftware Synthesis from Dataflow Models for G and LabVIEW
Software Synthesis from Dataflow Models for G and LabVIEW Hugo A. Andrade Scott Kovner Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 andrade@mail.utexas.edu
More informationModelling, Analysis and Scheduling with Dataflow Models
technische universiteit eindhoven Modelling, Analysis and Scheduling with Dataflow Models Marc Geilen, Bart Theelen, Twan Basten, Sander Stuijk, AmirHossein Ghamarian, Jeroen Voeten Eindhoven University
More informationA System-Level Synthesis Approach from Formal Application Models to Generic Bus-Based MPSoCs
A System-Level Synthesis Approach from Formal Application s to Generic Bus-Based MPSoCs Jens Gladigau 1, Andreas Gerstlauer 2, Christian Haubelt 1, Martin Streubühr 1, and Jürgen Teich 1 1 Department of
More informationSTATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS
STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS Sukumar Reddy Anapalli Krishna Chaithanya Chakilam Timothy W. O Neil Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science The
More informationSelf-Adaptive FPGA-Based Image Processing Filters Using Approximate Arithmetics
Self-Adaptive FPGA-Based Image Processing Filters Using Approximate Arithmetics Jutta Pirkl, Andreas Becher, Jorge Echavarria, Jürgen Teich, and Stefan Wildermann Hardware/Software Co-Design, Friedrich-Alexander-Universität
More informationWaveScalar. Winter 2006 CSE WaveScalar 1
WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE
More informationExtending Model-Based Design for HW/SW Design and Verification in MPSoCs Jim Tung MathWorks Fellow
Extending Model-Based Design for HW/SW Design and Verification in MPSoCs Jim Tung MathWorks Fellow jim@mathworks.com 2014 The MathWorks, Inc. 1 Model-Based Design: From Concept to Production RESEARCH DESIGN
More informationOhua: Implicit Dataflow Programming for Concurrent Systems
Ohua: Implicit Dataflow Programming for Concurrent Systems Sebastian Ertel Compiler Construction Group TU Dresden, Germany Christof Fetzer Systems Engineering Group TU Dresden, Germany Pascal Felber Institut
More informationMcRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime
McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime B. Saha, A-R. Adl- Tabatabai, R. Hudson, C.C. Minh, B. Hertzberg PPoPP 2006 Introductory TM Sales Pitch Two legs
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationAutomated Space/Time Scaling of Streaming Task Graphs. Hossein Omidian Supervisor: Guy Lemieux
Automated Space/Time Scaling of Streaming Task Graphs Hossein Omidian Supervisor: Guy Lemieux 1 Contents Introduction KPN-based HLS Tool for MPPA overlay Experimental Results Future Work Conclusion 2 Introduction
More informationCo-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,
Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17, 2014 1 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems
More informationOptimal Architectures for Massively Parallel Implementation of Hard. Real-time Beamformers
Optimal Architectures for Massively Parallel Implementation of Hard Real-time Beamformers Final Report Thomas Holme and Karen P. Watkins 8 May 1998 EE 382C Embedded Software Systems Prof. Brian Evans 1
More informationFine-grained Transaction Scheduling in Replicated Databases via Symbolic Execution
Fine-grained Transaction Scheduling in Replicated Databases via Symbolic Execution Raminhas pedro.raminhas@tecnico.ulisboa.pt Stage: 2 nd Year PhD Student Research Area: Dependable and fault-tolerant systems
More informationFault Tolerance Analysis of Distributed Reconfigurable Systems Using SAT-Based Techniques
In Field-Programmable Logic and Applications by Peter Y. K. Cheung, George A. Constantinides, and Jose T. de Sousa (Eds.). In Lecture Notes in Computer Science (LNCS), Volume 2778, c Springer, Berlin,
More informationLOGIC SYNTHESIS AND VERIFICATION ALGORITHMS. Gary D. Hachtel University of Colorado. Fabio Somenzi University of Colorado.
LOGIC SYNTHESIS AND VERIFICATION ALGORITHMS by Gary D. Hachtel University of Colorado Fabio Somenzi University of Colorado Springer Contents I Introduction 1 1 Introduction 5 1.1 VLSI: Opportunity and
More informationPartitioning Methods. Outline
Partitioning Methods 1 Outline Introduction to Hardware-Software Codesign Models, Architectures, Languages Partitioning Methods Design Quality Estimation Specification Refinement Co-synthesis Techniques
More informationFiona A Tool to Analyze Interacting Open Nets
Fiona A Tool to Analyze Interacting Open Nets Peter Massuthe and Daniela Weinberg Humboldt Universität zu Berlin, Institut für Informatik Unter den Linden 6, 10099 Berlin, Germany {massuthe,weinberg}@informatik.hu-berlin.de
More informationAutomatic Graph-based Success Tree Construction and Analysis
Automatic Graph-based Success Tree Construction and Analysis Hananeh Aliee, Michael Glaß, Dr.-Ing., Rolf Wanka, Dr., Jürgen Teich, Dr.-Ing., Key Words: Success tree, fault tree, transient fault, permanent
More informationFILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas
FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given
More informationEmbedded Systems CS - ES
Embedded Systems - 1 - Synchronous dataflow REVIEW Multiple tokens consumed and produced per firing Synchronous dataflow model takes advantage of this Each edge labeled with number of tokens consumed/produced
More informationCover Page. The handle holds various files of this Leiden University dissertation
Cover Page The handle http://hdl.handle.net/1887/32963 holds various files of this Leiden University dissertation Author: Zhai, Jiali Teddy Title: Adaptive streaming applications : analysis and implementation
More informationPerformance Analysis of Weakly-Consistent Scenario-Aware Dataflow Graphs
Performance Analysis of Weakly-Consistent Scenario-Aware Dataflow Graphs Marc Geilen 1, Joachim Falk 2, Christian Haubelt 3, Twan Basten 1,4, Bart Theelen 4 and Sander Stuijk 1 1 Electrical Engineering
More informationSystemCoDesigner: Performance Modeling Tutorial
SystemCoDesigner: Performance Modeling Tutorial Martin Streubühr Sebastian Graf Hardware/Software Co-Design Department of Computer Science University of Erlangen-Nuremberg Germany Version 1.4 Christian
More informationFrom Temporal Partitioning and Temporal Placement to Algorithmic Skeletons
From Temporal Partitioning and Temporal Placement to Algorithmic Skeletons Florian Dittmann, Franz J. Rammig Heinz Nixdorf Institute University of Paderborn, Germany Motivation Making reconfigurable computing
More informationSystem Level Modeling and Performance Simulation for Dynamic Reconfigurable Computing Systems in SystemC
In Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen. by Christian Haubelt, Jürgen Teich (Eds.). GI/ITG/GMM-Workshop, Shaker Verlag, Aachen, March 5-7, 2007
More informationLars Schor, and Lothar Thiele ETH Zurich, Switzerland
Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele ETH Zurich, Switzerland Efficient i Execution of KPN on MPSoC Efficiency regarding speed-up small memory footprint portability
More informationAn Evaluation of Domain-Specific Language Technologies for Code Generation
An Evaluation of Domain-Specific Language Technologies for Code Generation Christian Schmitt, Sebastian Kuckuk, Harald Köstler, Frank Hannig, Jürgen Teich Hardware/Software Co-Design, System Simulation,
More informationDataflow: The Road Less Complex
Dataflow: The Road Less Complex Steven Swanson Ken Michelson Andrew Schwerin Mark Oskin University of Washington Sponsored by NSF and Intel Things to keep you up at night (~2016) Opportunities 8 billion
More informationSystemCoDesigner: System Programming Tutorial
SystemCoDesigner: System Programming Tutorial Martin Streubühr Falk Christian Zebelein Hardware/Software Co-Design Department of Computer Science University of Erlangen-Nuremberg Germany Version 1.2 Christian
More informationSAT, SMT and QBF Solving in a Multi-Core Environment
SAT, SMT and QBF Solving in a Multi-Core Environment Bernd Becker Tobias Schubert Faculty of Engineering, Albert-Ludwigs-University Freiburg, 79110 Freiburg im Breisgau, Germany {becker schubert}@informatik.uni-freiburg.de
More informationParameterized Modeling and Scheduling for Dataflow Graphs 1
Technical Report #UMIACS-TR-99-73, Institute for Advanced Computer Studies, University of Maryland at College Park, December 2, 999 Parameterized Modeling and Scheduling for Dataflow Graphs Bishnupriya
More informationMapping of Applications to Multi-Processor Systems
Springer, 2010 Mapping of Applications to Multi-Processor Systems Peter Marwedel TU Dortmund, Informatik 12 Germany 2014 年 01 月 17 日 These slides use Microsoft clip arts. Microsoft copyright restrictions
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationDepartment Informatik. EMPYA: An Energy-Aware Middleware Platform for Dynamic Applications. Technical Reports / ISSN
Department Informatik Technical Reports / ISSN 2191-58 Christopher Eibel, Thao-Nguyen Do, Robert Meißner, Tobias Distler EMPYA: An Energy-Aware Middleware Platform for Dynamic Applications Technical Report
More informationA 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation
A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,
More informationOperating system integrated energy aware scratchpad allocation strategies for multiprocess applications
University of Dortmund Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications Robert Pyka * Christoph Faßbach * Manish Verma + Heiko Falk * Peter Marwedel
More informationEmbedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi
Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Lecture - 34 Compilers for Embedded Systems Today, we shall look at the compilers, which
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationEE382N: Embedded System Design and Modeling
EE382N: Embedded System Design and Modeling Lecture 11 Mapping & Exploration Andreas Gerstlauer Electrical and Computer Engineering University of Texas at Austin gerstl@ece.utexas.edu Lecture 11: Outline
More informationActor-oriented Modeling and Simulation of Cut-through Communication in Network Controllers
In Proceedings of Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV), Kaiserslautern, Germany, March 05-07, 2012 Actor-oriented Modeling and Simulation
More informationE-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems
E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andrew Pavlo, Michael
More informationGPU Implementation of a Multiobjective Search Algorithm
Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen
More informationComputer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing
More informationA Thermal-aware Application specific Routing Algorithm for Network-on-chip Design
A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design Zhi-Liang Qian and Chi-Ying Tsui VLSI Research Laboratory Department of Electronic and Computer Engineering The Hong Kong
More informationFeature Detection Plugins Speed-up by
Feature Detection Plugins Speed-up by OmpSs@FPGA Nicola Bettin Daniel Jimenez-Gonzalez Xavier Martorell Pierangelo Nichele Alberto Pomella nicola.bettin@vimar.com, pierangelo.nichele@vimar.com, alberto.pomella@vimar.com
More informationCompositionality in Synchronous Data Flow: Modular Code Generation from Hierarchical SDF Graphs
Compositionality in Synchronous Data Flow: Modular Code Generation from Hierarchical SDF Graphs Stavros Tripakis Dai Bui Marc Geilen Bert Rodiers Edward A. Lee Electrical Engineering and Computer Sciences
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationudirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults
1/45 1/22 MICRO-46, 9 th December- 213 Davis, California udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer
More informationINTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016
NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering
More informationOptimization of Behavioral IPs in Multi-Processor System-on- Chips
Optimization of Behavioral IPs in Multi-Processor System-on- Chips Yidi Liu and Benjamin Carrion Schafer # Department of Electronic and Information Engineering b.carrionschafer@polyu.edu.hk # Outline High-Level
More informationFunctional modeling style for efficient SW code generation of video codec applications
Functional modeling style for efficient SW code generation of video codec applications Sang-Il Han 1)2) Soo-Ik Chae 1) Ahmed. A. Jerraya 2) SD Group 1) SLS Group 2) Seoul National Univ., Korea TIMA laboratory,
More informationElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests
ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer
More informationSequential Circuit Test Generation Using Decision Diagram Models
Sequential Circuit Test Generation Using Decision Diagram Models Jaan Raik, Raimund Ubar Department of Computer Engineering Tallinn Technical University, Estonia Abstract A novel approach to testing sequential
More informationSoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator
SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator Embedded Computing Conference 2017 Matthias Frei zhaw InES Patrick Müller Enclustra GmbH 5 September 2017 Agenda Enclustra introduction
More informationPartitioning of computationally intensive tasks between FPGA and CPUs
Partitioning of computationally intensive tasks between FPGA and CPUs Tobias Welti, MSc (Author) Institute of Embedded Systems Zurich University of Applied Sciences Winterthur, Switzerland tobias.welti@zhaw.ch
More informationA High Performance Bus Communication Architecture through Bus Splitting
A High Performance Communication Architecture through Splitting Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University,West Lafayette, IN, 797, USA {lur, chengkok}@ecn.purdue.edu
More informationSymbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs
Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs Anan Bouakaz Pascal Fradet Alain Girault Real-Time and Embedded Technology and Applications Symposium, Vienna April 14th, 2016
More informationGRAph Parallel Actor Language A Programming Language for Parallel Graph Algorithms
GRAph Parallel Actor Language A Programming Language for Parallel Graph Algorithms Thesis by Michael delorimier In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy California
More informationHabermann, Philipp Design and Implementation of a High- Throughput CABAC Hardware Accelerator for the HEVC Decoder
Habermann, Philipp Design and Implementation of a High- Throughput CABAC Hardware Accelerator for the HEVC Decoder Conference paper Published version This version is available at https://doi.org/10.14279/depositonce-6780
More informationSearch Space Reduction for E/E-Architecture Partitioning
Search Space Reduction for E/E-Architecture Partitioning Andreas Ettner Robert Bosch GmbH, Corporate Reasearch, Robert-Bosch-Campus 1, 71272 Renningen, Germany andreas.ettner@de.bosch.com Abstract. As
More informationGPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC
GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT
More informationMassively Parallel Phase Field Simulations using HPC Framework walberla
Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich
More information