Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling

Size: px

Start display at page:

Download "Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling"

Delilah Melton
5 years ago
Views:

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling Tobias Schwarzer 1, Joachim Falk 1, Michael Glaß 1,

1 Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling Tobias Schwarzer 1, Joachim Falk 1, Michael Glaß 1, Jürgen Teich 1, Christian Zebelein 2, Christian Haubelt 2 1 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany 2 University of Rostock, Germany

2 Fine-granular Modeling with Data Flow r CPU1 r CPU2 a 7 Advantages: Exposes all degrees of available parallelism High flexibility of mapping actors to resources Disadvantages: Scheduling overhead at runtime May sacrifize the code-generation efficiency of the compiler Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

Coarse-granular Modeling c r CPU1 r CPU2 c a 7 Advantages: High code-generation efficiency Low scheduling overhead at runtime Disadvantages: May sacrifize parallelism

3 Coarse-granular Modeling c r CPU1 r CPU2 c a 7 Advantages: High code-generation efficiency Low scheduling overhead at runtime Disadvantages: May sacrifize parallelism significantly Reduces mapping flexibility Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

4 Proposed Solution Input: Fine-grained (data flow) modelling of applications Model of the target platform Given: a clustering technique [FKHTB08] to automatically combine fine-grained actors to composite actors using Quasi-Static Scheduling (QSS) Key Idea: Perform a Design Space Exploration (DSE) to find the best combinations of clusters and their mapping to available resources in terms of application throughput Contributions: A genetic representation for the exploration of clusters A repair strategy to efficiently restrict the search to feasible clusters only Synthesis-in-the-Loop: Automatic synthesis of each implementation candidate to the target platform to determine quality numbers, capturing the effects of (a) scheduling overhead, (b) memory accesses, and (c) compiler optimizations FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

5 Proposed System Compilation Flow a 7 a8 Automatic extraction of the dataflow graph Dataflow application Input Read in executable specification a 7 r CPU1 r CPU2 Explore design space Step 1: Determine clustering Step 2: Perform binding Architecture graph Synthesis-in-the-loop Run executable Obtain throughput DSE Evaluate implementation on target hardware r CPU3 r CPU4 Non-dominated design points Output Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

6 Fundamentals

7 Classic System Synthesis Problem graph g p = (A, C): Architecture graph g r = (R, L): r CPU1 r CPU2 r CPU3 r CPU4 a 7 dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

8 Classic System Synthesis Problem graph g p = (A, C): Architecture graph g r = (R, L): r CPU1 r CPU2 Binding β r CPU3 r CPU4 a 7 dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

9 Clustering-aware System Synthesis Clustering: Replace a connected subgraph containing only static actors of a DFG by a single composite actor cluster 1 a 7 dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

10 Clustering-aware System Synthesis Clustering: Replace a connected subgraph containing only static actors of a DFG by a single composite actor Each composite actor contains a Quasi-Static Schedule (QSS) executing sequences of statically scheduled actor firings i 1 QSS 1 o 1 QSS is represented by a Finite State Machine (FSM) o 2 #i 1 1 & #o 2 1 / < > Details in [FKHTB08] q 0 q 1 FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , dynamic actors #i 1 1 & #o 1 1 & #o 2 1 / <,a 7, > static actors FSM state Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

11 Clustering-aware System Synthesis cluster 2 Clustering: Replace a connected subgraph containing only static actors of a DFG by a single composite actor Each composite actor contains a Quasi-Static Schedule (QSS) executing sequences of statically scheduled actor firings cluster 1 QSS is represented by a Finite State Machine (FSM) a 7 Details in [FKHTB08] FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

12 Clustering-aware System Synthesis cluster 2 Clustering: Replace a connected subgraph containing only static actors of a DFG by a single composite actor Each composite actor contains a Quasi-Static Schedule (QSS) executing sequences of statically scheduled actor firings cluster 1 QSS is represented by a Finite State Machine (FSM) a 7 Details in [FKHTB08] FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

13 Clustering-aware System Synthesis cluster 2 r CPU1 r CPU2 Binding β cluster 1 r CPU3 r CPU4 a 7 dynamic actors static actors Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

14 Design Space Exploration

Design Space Exploration - Opt4J Custom optimization problems are typically defined by providing three main classes: Creator: Generate random genotype objects Decoder: Convert the genotype into a

15 Design Space Exploration - Opt4J Custom optimization problems are typically defined by providing three main classes: Creator: Generate random genotype objects Decoder: Convert the genotype into a phenotype Evaluator: Determine the quality of one phenotype Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

Design Space Exploration - Creator Generate random genotype objects: Associate each channel of the problem graph with a Boolean value using the clusterization function γ: γ = C S {,false} channel

16 Design Space Exploration - Creator Generate random genotype objects: Associate each channel of the problem graph with a Boolean value using the clusterization function γ: γ = C S {,false} channel should be contained in a cluster false channel should not be contained in a cluster Exclude channels connected to a dynamic actor from genotype false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

17 Design Space Exploration - Decoder How to convert the genotype (/false edges) to a phenotype (clusters)? cluster 2? cluster 1 false a 7 a 7 dynamic actors static actors Conversion consists of two steps: 1. Determine clustering candidates 2. Repair clustering candidates to obtain a feasible cluster Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

18 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

19 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

20 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with stop? false stop? a 7 stop? Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

21 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

22 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with stop? false stop? a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

23 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

24 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 clustering candidate 1 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

25 Step 1: Determine Clustering Candidate Collect reachable actors starting at channel associated with false a 7 clustering candidate 1 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

26 Step 1: Determine Clustering Candidate clustering candidate 2 Collect reachable actors starting at channel associated with false a 7 clustering candidate 1 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

27 Step 2: Repair Strategy false a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

28 Step 2: Repair Strategy a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

29 Step 2: Repair Strategy a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

30 Step 2: Repair Strategy Clustering condition: For each pair of cluster input and output actor, there exists a directed path from the input actor to the output actor. a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

31 Step 2: Repair Strategy a 5 a 7 Clustering condition: For each pair of cluster input and output actor, there exists a directed path from the input actor to the output actor. Repair strategy: Transform an infeasible clustering candidate into (multiple) feasible clusters cluster candidate input actors {a2,a6} clustering candidate output actors {a3,a4,a7} Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

32 Step 2: Repair Strategy Repair clustering candidate to obtain feasible cluster: a 7 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

33 Step 2: Repair Strategy Repair clustering candidate to obtain feasible cluster: a 7 cluster candidate input actors {a2,a6} clustering candidate output actors {a3,a4,a7} Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

34 Step 2: Repair Strategy I = { } O= {, } Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor a 7 cluster candidate input actors {a2,a6} clustering candidate output actors {a3,a4,a7} Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

35 Step 2: Repair Strategy I = {, } O= {, } I = { } O= {, } I = {, } O= {, } Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor I = { } O= {,,a 7 } I = { } O= {,,a 7 } a 7 I = { } O= {, } I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

36 Step 2: Repair Strategy EC 4 I = {, } O= {, } EC 2 I = { } O= {, } EC 1 I = {, } O= {, } I = { } O= {,,a 7 } I = { } O= {,,a 7 } Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor Create equivalence classes (ECs) a 7 I = { } O= {, } EC 3 I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

37 Step 2: Repair Strategy EC 4 I = {, } O= {, } EC 2 I = { } O= {, } EC 1 I = {, } O= {, } I = { } O= {,,a 7 } I = { } O= {,,a 7 } a 7 Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor Create equivalence classes (ECs) Determine clashes between each EC I = { } O= {, } EC 3 I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

38 Step 2: Repair Strategy EC 4 I = {, } O= {, } I = { } O= {, } EC 1 I = { } O= {,,a 7 } I = { } O= {,,a 7 } a 7 Repair clustering candidate to obtain feasible cluster: Determine set of predecessor input actors and set of successor output actors for each actor Clash: Detect I = {a the 2, } situation when an output actor is not connected O= {,a to 4 } all input actors, since this may result in the infinite token accumulation problem. (Formal details given in the paper) EC 2 Create equivalence classes (ECs) Determine clashes between each EC I = { } O= {, } EC 3 I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

39 Step 2: Repair Strategy I = { } O= {, } EC 1 EC 4 I = {, } O= {, } EC 2 I = {, } O= {, } I = { } O= {,,a 7 } I = { } O= {,,a 7 } Clash exists between EC 2 and EC 4! Repair strategy splits the cluster candidate between and or between and. a 7 I = { } O= {, } EC 3 I = { } O= {,,a 7 } Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

Design Space Exploration - Evaluator Determine the quality of one phenotype:

target platform with respect to the design objectives: Throughput

Explore design space Step 1: Determine clustering Step 2: Perform binding

Evaluate implementation on target hardware r CPU3 r CPU4 Schwarzer et al.

40 Design Space Exploration - Evaluator Determine the quality of one phenotype: Each implementation candidate is evaluated via synthesis-in-the-loop on the target platform with respect to the design objectives: Throughput [iterations/s] (MAX) Number of used processor cores (MIN) a 7 r CPU1 r CPU2 Explore design space Step 1: Determine clustering Step 2: Perform binding Architecture graph Synthesis-in-the-loop Run executable Obtain throughput Evaluate implementation on target hardware r CPU3 r CPU4 Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

41 Experimental Results

42 Experimental results Test suite: Six synthetic DFGs using the SDF 3 tool Brake-by-wire control application (BBW) Target platforms: Intel(R) Core(TM) i GHz Xilinx Zynq-7000 All Programmable SoC with a dual-core ARM Cortex- A9 667MHz Approaches: classic (DSE without clustering) post QSS (DSE with clustering as a post-processing step) Direct combination of [FKHTB08] and DSE Proposed holistic DSE flow FKHTB08: J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S.Bhattacharyya: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In Proc. of EMSOFT, pages , Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

43 Results for synthetic DFG g P C Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

44 Results Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

45 Results Holistic approach is always superior to classic DSE outperforms post QSS in the multi-processor case Speedups up to 9.91x for four processor cores of the intel platform 7.2x for two processor cores on the Zynq platform Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

46 Conclusion Holistic compilation flow combines system synthesis and the use of QSSs in a DSE to maximize application throughput for a given target platform DSE explores the clustering of static actors to fine-tune the tradeoff between exploitable degree of parallelism, dynamic scheduling overhead reduction, and compiler code-generation efficiency Novel repair strategy ensures the creation of feasible clusters Each implementation candidate is compiled to and measured on real target hardware (synthesis-in-the-loop) Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

47 Thank you for your attention! Questions? Acknowledgements: This work is supported in part by the German Science Foundation (DFG), HA 4463/3-2 and TE 163/18-2. Schwarzer et al. Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling SCOPES

SysteMoC. Verification and Refinement of Actor-Based Models of Computation

SysteMoC. Verification and Refinement of Actor-Based Models of Computation SysteMoC Verification and Refinement of Actor-Based Models of Computation Joachim Falk, Jens Gladigau, Christian Haubelt, Joachim Keinert, Martin Streubühr, and Jürgen Teich {falk, haubelt}@cs.fau.de Hardware-Software-Co-Design