Efficient Application Mapping on CGRAs Based on Backward Simultaneous Scheduling / Binding and Dynamic Graph Transformations

Size: px

Start display at page:

Download "Efficient Application Mapping on CGRAs Based on Backward Simultaneous Scheduling / Binding and Dynamic Graph Transformations"

Loreen Gallagher
6 years ago
Views:

Coussy 2 1 CEA, LIST, Electronic Architectures and Sensors Laboratory (LCAE) F-91191

1 Efficient Application Mapping on CGRAs Based on Backward Simultaneous Scheduling / Binding and Dynamic Graph Transformations T. Peyret 1, G. Corre 1, M. Thevenin 1, K. Martin 2, P. Coussy 2 1 CEA, LIST, Electronic Architectures and Sensors Laboratory (LCAE) F Gif-sur-Yvette, France 2 Université de Bretagne-Sud, Lab-STICC Lorient, France ASAP 2014 Conference

2 COARSE-GRAINED RECONFIGURABLE ARCHITECTURE (CGRA) Processing Elements / Tiles Homogeneous/heterogeneous Register Files (RF) Operators Interconnection network Mesh 1D, 2D, Torus, Segmented Example: 4 4 CGRA Torus 2D mesh Local RF PE PE PE PE PE PE PE PE From Neighbours & Memory PE PE PE PE FU RF PE PE PE PE To Neighbours & Memory ASAP2014 Peyret Thomas 2

3 MAPPING ON CGRA Scheduling & binding are two NP-Complete problems Separate resolution Heuristic and meta-heuristic (e.g. EMS, VPR) Heuristic and exact method (e.g. EPIMap, REGIMap) Merge resolution Exact methods (e.g. ILP-Based) Meta-heuristic (e.g. DRESC) Purpose: Have a mapping flow which deeply explores the solution space for entire application code ASAP2014 Peyret Thomas 3

4 MAPPING FLOW C Code Compilation Schedule & Binding of highest Priority Node Yes Changes? N CGRA Model CDFG Solutions? Yes No Graph Transformation Mapping Pruning List of Mappings Application & CGRA models Mapping tool No Last Node? Yes ASAP2014 Peyret Thomas 4

Basic blocs are represented by Data Flow Graphs (DFG) New kind of

5 APPLICATION & CGRA MODELS Compilation C Control Data Flow Graph (CDFG) with GCC CDFG is composed of basic blocs and a control part Basic blocs are represented by Data Flow Graphs (DFG) New kind of nodes: memorization operation nodes ASAP2014 Peyret Thomas 5

6 Cycle i + 2 Cycle i + 2 Cycle i + 1 Cycle i + 1 Cycle i Cycle i APPLICATION & CGRA MODELS Example of a 2-tile CGRA with RF Memorization operators are introduced Be able to cope with RF 1/A A 2/B B 1 1 RF RF A A B B RF RF A A 3/B B A 4/B A B 4 ASAP2014 Peyret Thomas 6

7 Cycle i + 2 Cycle i + 1 Cycle i APPLICATION & CGRA MODELS Homomorphic CGRA and DFG models Memorization nodes: to keep data dependencies Equivalence between nodes: Operators Operations Registers Data Binding finding DFG into CGRA model 1/A A 2/B B 1 2 A 3/B B B RF A 4/B B 4 ASAP2014 Peyret Thomas 7

8 MAPPING FLOW C Code Compilation Schedule & Binding of highest Priority Node Yes Changes? No Fail CGRA Model CDFG Solutions? Yes No Graph Transformation Mapping Pruning List of Mappings No Last Node? Yes Application & CGRA models Simultaneous Scheduling and Binding Binding method Backward List-scheduling based scheduling Formal graph transformations Pruning step ASAP2014 Peyret Thomas 8

9 SIMULTANEOUS SCHEDULING/BINDING Purpose: Check whether at least one binding solution exists for each node schedule Avoid dead-ends due to the dependence between these two problems Allow to transform the graph only when needed and with the right transformation Based on Levi s algorithm Solves the maximum sub graph problem for homomorphic graphs Incremental version Rely on previously found partial bindings Add the newly scheduled node (and its data node) to the previously considered sub graph Find every possible partial mapping Exhaustive method If no binding solution => graph transformation is required ASAP2014 Peyret Thomas 9

10 GRAPH TRANSFORMATIONS 3 dynamic transformations are proposed: Operation splitting Simple routing Memorization splitting ASAP2014 Peyret Thomas 10

11 Cycle i + 2 Cycle i + 1 Cycle i PRUNING STEP Idea: remove mapping with same operator utilization to limit the number of partial mappings Executed at the end of each scheduling cycle Removes redundant partial mappings Still exhaustive Example: On a 2-tile CGRA 1/A 2/B 1/A 2/B A 3/B 2 2 3/A B A 4/B 4/A B ASAP2014 Peyret Thomas 11

12 EXPERIMENTS & RESULTS Compared with two other methods: Method 1: A forward list-scheduling with just routing transformation and use Levi s algorithm to bind. Method 2: Heuristic described in EPIMap which applies static a priori transformations (routing & splitting) to schedule and use Levi s algorithm to bind. 4 metrics: Success Rate Latency Exploration Quality Exploration Efficiency 9 application codes (FFT, DCT, ) 16 constraint sets per code ASAP2014 Peyret Thomas 12

13 Success Rate EXPERIMENTS & RESULTS Success Rate 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 Method 1 Method 2 Proposed Approach 0,1 0 DC Filter DCT 2D Elliptic Filter EMA Filter FFT Manhattan Distance Matrix Product MWD Filter 99% for Proposed Approach (vs 37% and 62%) Unsharp Mask Average ASAP2014 Peyret Thomas 13

14 Best Latency Rate EXPERIMENTS & RESULTS Percentage of time a mapping has the best latency 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 Method 1 Method 2 Proposed Approach 0,1 0 DC Filter DCT 2D Elliptic Filter EMA Filter FFT Manhattan Distance Matrix Product MWD Filter 90% for Proposed Approach (vs 31% and 42%) Unsharp Mask Average ASAP2014 Peyret Thomas 14

15 CONCLUSION & PROSPECTS Mapping flow C DFGs CGRA Simultaneous scheduling / exhaustive-based binding Dynamic graph transformations Very promising results Success rate Latency Exploration quality and efficiency Future works Improve pruning step Improve scalability ASAP2014 Peyret Thomas 15

Thank you for your attention Commissariat à l énergie atomique et aux énergies alternatives Institut Carnot CEA LIST Sensors And Electronic Architectures Laboratory Centre de

16 Thank you for your attention Commissariat à l énergie atomique et aux énergies alternatives Institut Carnot CEA LIST Sensors And Electronic Architectures Laboratory Centre de Saclay bâtiment PC 72l Gif-sur-Yvette Cedex T. +33 (0) Thomas.peyret@cea.fr Etablissement public à caractère industriel et commercial l RCS Paris B

17 INTRODUCTION Performance vs Flexibility vs Conception Cost Raffin E., Déploiement d'applications multimédia sur architecture reconfigurable à gros grain : modélisation avec la programmation par contraintes, 2011 ASAP2014 Peyret Thomas 17

18 INTRODUCTION Many architectures Morphosys DART MORA ADRES Etc. Less automated compilation flow Dedicated to an architecture Not scalable (e.g. ILP-based) Not versatile or with limitations (e.g. no RF or manual partitioning) Only for kernel loop acceleration ASAP2014 Peyret Thomas 18

19 SIMULTANEOUS SCHEDULING/BINDING Purpose: Check if at least one binding solution exist for each node schedule Avoid dead-ends due to the dependence between these two problems Allow to transform the graph only when needed and with the right transformation Example: Map this DFG on this CGRA ASAP2014 Peyret Thomas 19

20 SIMULTANEOUS SCHEDULING/BINDING Schedule example: Cycle Opération ASAP2014 Peyret Thomas 20

21 SIMULTANEOUS SCHEDULING/BINDING Binding is impossible: Cycle A B C D E ? & 14 are conflicting on tile C ASAP2014 Peyret Thomas 21

22 SIMULTANEOUS SCHEDULING/BINDING Other example: Cycle A B C D E ? & 14 are conflicting on tile C ASAP2014 Peyret Thomas 22

Schedule and binding of successor nodes are

real needs for the current node Example:

Backward Cycle Operations 1 1 2 3 2 4 5 6 2&3

23 BACKWARD TRAVERSING Allows to know if a transformation is relevant and which one Schedule and binding of successor nodes are already done So it is possible to know the real needs for the current node Example: Forward (non a priori transformations) Backward Cycle Operations &3 3 7 Cycle Opérations a 2b 3 ASAP2014 Peyret Thomas 23

BACKWARD / FORWARD TRAVERSING Example: Forward (a priori transformations) Cycle Operations 1 1 2a 3 2 4 5

24 BACKWARD / FORWARD TRAVERSING Example: Forward (a priori transformations) Cycle Operations 1 1 2a b Backward Cycle Operations ASAP2014 Peyret Thomas 24

25 LEVI S ALGORITHM Determining the maximum sub graph between 2 graphs is NP- Complete Based on caracteritics matrix of the graphs Adjacence matrix, compatibility matrix Example of adjacence matrix ASAP2014 Peyret Thomas 25

26 LEVI S ALGORITHM Complete example Adjacence matrix ASAP2014 Peyret Thomas 26

27 LEVI S ALGORITHM Complete example Reduce compatibility matrix ASAP2014 Peyret Thomas 27

28 LEVI S ALGORITHM Complete example Maximum compatibility classes ASAP2014 Peyret Thomas 28

29 LEVI S ALGORITHM Complete example Connected maximum sub graphs ASAP2014 Peyret Thomas 29

30 LEVI S ALGORITHM Complete example Result ASAP2014 Peyret Thomas 30

31 Number of Different Mappings EXPERIMENTS & RESULTS Number of different mappings found Method 1 Method 2 Proposed Approach 2 0 DC Filter DCT 2D Elliptic Filter EMA Filter FFT Manhattan Distance Matrix Product MWD Filter Unsharp Mask 3.7 and 2.4 times higher for Proposed Approach Average ASAP2014 Peyret Thomas 31

32 Number of Different Mappings Generated per Second EXPERIMENTS & RESULTS Number of different mappings found per second 1,6 1,4 1,2 1 Method 1 0,8 0,6 0,4 Method 2 Proposed Approach 0,2 0 DC Filter DCT 2D Elliptic Filter EMA Filter FFT Manhattan Distance Matrix Product MWD Filter Unsharp Mask Average 2.6 and 2.2 more times higher for Proposed Approach ASAP2014 Peyret Thomas 32

CODE ANALYSES FOR NUMERICAL ACCURACY WITH AFFINE FORMS: FROM DIAGNOSIS TO THE ORIGIN OF THE NUMERICAL ERRORS. Teratec 2017 Forum Védrine Franck

CODE ANALYSES FOR NUMERICAL ACCURACY WITH AFFINE FORMS: FROM DIAGNOSIS TO THE ORIGIN OF THE NUMERICAL ERRORS NUMERICAL CODE ACCURACY WITH FLUCTUAT Compare floating point with ideal computation Use interval