Automated RTR Temporal Partitioning for Reconfigurable Embedded Real-Time System Design

Size: px

Start display at page:

Download "Automated RTR Temporal Partitioning for Reconfigurable Embedded Real-Time System Design"

Beatrix Newton
6 years ago
Views:

1 Automated RTR Temporal Partitioning for Reconfigurable Embedded Real-Time System Design C. Tanougast, Y. Berviller, P. Brunet and S. Weber L. I. E. N. Laboratoire d Instrumentation Electronique de Nancy Faculté des Sciences de Nancy I, BP Vandoeuvre-lès-Nancy, cedex, France philippe.brunet@lien.uhp-nancy.fr

2 Outline I. Introduction II. Automatic RTR temporal partitioning III. Application example IV. Conclusion & Future Works

3 I. Introduction Application development approach constrained area Application specific system design approach area = design parameter Application algorithm [Time Constraint] Host CPU Application Algorithm & time constraint Embedded CPU Optimal implementation Reconfigurable unit Optimal implementation Reconfigurable array Minimize total processing time, memory bandwidth or number of reconfigurations Minimize area of the reconfigurable array which implements the datapath of the application

4 Purposes Area 100 FPGA ASIC Full Custom 1 10 ASIC Standard Cell Technologies Silicon area for a same operator type in different technologies

5 Purposes Optimization of the FPGA logic resources Prevent memory bandwidth overhead Respect of a Time Constraint Speed up conception choices Computer aided method to maximize efficiency of reconfigurable hardware

6 How to partition in RTR? 1st spatial floorplan Comput p 1 1 Reconfiguration (temporal) Configuration DFG 1st config. 2nd reconfig. N-th reconfig. Time A B C * * p 1 RTR temporal partitioning p j < p x X

7 How to partition in RTR? How many temporal partitions are possible? Where are the partition boundaries?

8 Outline I. Introduction II. Automatic RTR temporal partitioning III. Application example IV. Conclusion & Future Works

9 Automation using Operator Library Time constraint Data-block size Data-Flow Graph Operator Library (target) Estimating the num. of partitions n Partitioning in n n <= n-1 partitions n <= n+1 Implementation (place & route) No No T remaind 0? Yes T remaind <T step? Yes End

10 Problem formulation Constrai -nts DFG Estim. of n Op. Lib. Algorithm modeled as a DFG : G(V, E) V : Arithmetic and logic operators E : Data dependencies Partitioning in n partitions Implementation (place & route) Time constraint Reconfiguration speed T VC

11 Characterization of operators 1. Resources of operators Constraints DFG Estim. of n Partitioning in n partitions Op. Lib. operator a b n + s n + 1 a0 b0 r0 a1 b1 rn-1anbn Cell 1 Cell 2 Cell n s0 s1 sn Resources of operator Implementation (place & route) Technology Sum Arithmetic mode Sum (registered) Carry n + 1 logic cells

12 Characterization of operators 2. Execution time of operators Constraints DFG Op. Lib. 1 st cellule Input of cells N th cellule Estim. of n Partitioning in n partitions t c t c t c t c t r t r t r Output of cells t setup synchronization of the carry Implementation (place & route) Execution time is function of: Type of operation Size of data to process Characteristical parameters of target technology

13 Automation using Operator Library Time constraint Data-block size Data-Flow Graph Operator Library (target) Estimating the num. of partitions n Partitioning in n n <= n-1 partitions n <= n+1 Implementation (place & route) No No T remaind 0? Yes T remaind <T step? Yes End

14 Estimation of DFG performances DFG Characterization of operators (target) Total resources evaluation Sum of each operator resources Execution time Memory needs evaluation Slowest operator execution time Sum of edges properties along the graph

15 Evaluation of partitions number V, N, T <= constants //Capture constant parameters of //target and constraints G <= DFG // DFG capture of the application C <= 0 // Total area variable TO <= 0 //Maximal operator execution time for each node NDi in G TO <= max (TO, NDi.t i ) C <= C + NDi.Area end for n <= T / [(N. TO) + rt( )] C n <= C / n //return current max execution time //add area of current node // compute n and C n n : number of partitions, Cn: optimal area on each partition

16 Partition number discussion 1)n > 2: Possible RTR partitioning Integer part of n! number of partitions 2)n < 2: RTR partitioning is not possible. If n > 1: Only static implementation is possible. If n < 1: To ensure the constraint it is necessary to modify the algorithm to add a processing parallelism. Integer part of 1/n gives the degree of processing parallelism.

17 Automation using Operator Library Time constraint Data-block size Data-Flow Graph Operator Library (target) Estimating the num. of partitions n Partitioning in n <= n-1 n <= n+1 n partitions Implementation (place & route) No No T remaind 0? Yes T remaind <T step? Yes End

18 DFG First partitioning 1 st Partition * * Optimal area 2 nd Partition * - + < + < Optimal area * j th Partition * + + S th Partition - < * Optimal area

19 Memory bandwidth evaluation DFG (edge. data_size) (edge. data_size) (edge. data_size) (edge. data_size) (edge. data_size) (edge. data_size)

20 Final partitioning (refinement) After this first partitioning step, our tool allow to move manually or automatically the splitting boundary to reduce future needs of memory bandwidth. The automated refinement is limited in an adjustable neighborhood of the first splitting to keep future partition s area as homogeneous as possible. This refinement is done by finding the local minimum of the bus width sum computed before along the DFG.

21 Configuration and memories controller Read Address control (temporary data) Clock Initialization Initialization B Reconfiguration Choice and access of scratch memories Counter Counter Controller (State machine ) Register of Partition number A End process C End partition Execution partition Write / Read End partition D Execution partition Read / Write Write Address control (temporary data) Reconfiguration E Incrementation partition indice Dynamic configuration and memories management State machine of controller

22 Outline I. Introduction II. Automatic RTR temporal partitioning III. Application example IV. Conclusion & Future Works

23 Hardware plate-form «ARDOISE» reconfigurable module ATMEL AT40K40 FPGA 2304 cells, 1365 cells/ms Two scratch memories 256K x 32

24 Implementation example Application to image processing Edges motion estimator: Static implementation T (time constraint) 896 cells 40 ms N(data block size) 512 ²

25 Implementation example Partition Total number of Cells Operator execution time (ns) Total reconfigura tion time (µs) Partition processing time (ms)

26 Implementation example Total resources: Static: 896 cells Max RTR: 294 cells Total processing time 38 ms Silicon Efficiency improvement: 896/294 3 Time remaining 2 ms => To short for another partition

27 Outline I. Introduction II. Automatic RTR temporal partitioning III. Application example IV. Conclusion & Future Works

28 Conclusion & Future Works We propose a temporal partitioning methodology and tool which: Leads to Better usage of the logical area: - Increase the logic efficiency, - Reduction of physical parameters (area), - Keep application flexibility provided by FPGAs Help designer to: - Quickly specify his architecture needs - Quickly estimate if his application can be implemented on his platform. Can be used for SoC including FPGA array

29 Future Works More work is needed to simplify RTR design flow : Auto-Synthesis of memory and configuration controller Auto-generation of the VHDL code associated with each partition. Give an estimation of the power consumption and include it in the partitioning. Studying and including non-regular DFG

30 Thank you for your attention.

31 Automated RTR Temporal Partitioning for Reconfigurable Embedded Real-Time System Design C. Tanougast, Y. Berviller, P. Brunet and S. Weber L. I. E. N. Laboratoire d Instrumentation Electronique de Nancy Faculté des Sciences de Nancy I, BP Vandoeuvre-lès-Nancy, cedex, France philippe.brunet@lien.uhp-nancy.fr

Towards Optimal Custom Instruction Processors

Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors