Optimization of Run-time Reconfigurable Embedded Systems

Size: px

Start display at page:

Download "Optimization of Run-time Reconfigurable Embedded Systems"

Brook Cannon
5 years ago
Views:

1 Optimization of Run-time Reconfigurable Embedded Systems Michael Eisenring and Marco Platzner Swiss Federal Institute of Technology (ETH) Zurich, Switzerland {eisenring Abstract. Run-time reconfigurable approaches for FPGAs are gaining interest as they enlarge the design space for system implementation by sequential execution of temporally exclusive system parts on one or several FPGA resources. In [7], we introduced a novel methodology and a design tool for communication synthesis in reconfigurable embedded systems. In [5], this work was extended by a hierarchical reconfiguration structure that implements reconfiguration control. In this paper, we describe techniques that are employed to optimize the reconfiguration structure and its communication requirements. The optimizations reduce the required FPGA area and I/O pins. 1 Introduction Embedded systems are the backbone for a wide range of application domains: simple home appliances, PDAs (personal digital assistants), portable phones, and complex autopilots in airplanes. The rising design complexity of such systems is tackled by sophisticated computer-aided design (CAD) tools [8, 10, 12] that guide the designer from system specification to implementation. Examples for decisions to be taken during design are component selection, i.e., selecting computing resources among different types of processors and programmable hardware devices, and communication type selection, e.g., using standard protocols such as the CAN bus or dedicated links based on built-in facilities such as DMA channels. Run-time reconfiguration (RTR) of FPGAs is gaining interest for embedded system design as it enlarges the design space by allowing to execute timeexclusive system parts sequentially [9, 14] on a moderately sized FPGA device. The challenge is to design efficient RTR systems in terms of reconfiguration time (system performance) and area overhead (implementation cost). A further important issue is the synthesis of communication channels between heterogeneous components [11, 12] on one hand and between time-exclusive system parts on the other hand (interconfiguration communication [9, 7, 6]). Most current approaches for reconfigurable system design focus on aspects such as optimal partitioning [3], scheduling [4, 13], or communication synthesis [12]. We learned through our previous work [7] that dynamic reconfiguration of FPGAs sometimes leads to remarkable overheads in terms of FPGA area

2 and memory. Consequently, we now consider communication and reconfiguration issues at the same time to reflect their strong interdependence. In this paper, we propose two optimization techniques that are employed to minimize the overhead added by the reconfiguration structure and its communication requirements. 2 Hierarchical Reconfiguration In our framework [5], an application is captured by a problem specification consisting of (i) a problem graph PG, (ii) an architecture graph AG, and (iii) a mapping M. The problem graph P G represents the application s behavior in form of communicating tasks and buffers. Each problem graph node has an associated control port for receiving control commands (e.g., start) and replying status messages (e.g., done). The architecture graph AG describes the target that may consist of connected computing resources (general- and special-purpose processors, ASICs, FPGAs), and memories. The mapping M determines the implementation of the problem graph on the target architecture. A mapping includes spatial partitioning (assignment of tasks and buffers to computing resources and memories), temporal partitioning (assignment of tasks and buffers to FPGA configurations), scheduling of tasks and buffers inside configurations, and the scheduling of complete configurations. In our work, we assume that a mapping has been derived by a set of front-end tools specific to the used specification model (task graphs, SDF graphs, etc.) or by user intervention. We focus on the back-end which is formed by the subsequent steps of generating an appropriate reconfiguration structure and communication channels. In [7], we presented the CORES/HASIS tool set providing communication synthesis for reconfigurable embedded systems. These tools establish the required communication infrastructure and automatically generate device drivers and interface circuitry for each FPGA configuration and the host. In [5], we extended CORES/HASIS by a hierarchical reconfiguration structure. In this paper, we discuss optimizations that can be applied to save FPGA area and I/O pins. These optimizations are based on the sharing of objects and reduce the overhead caused by the automatic generation of reconfiguration control. The added reconfiguration structure is hierarchical and consists of two layers: 1. a top layer, where one or several configurator tasks supervise a set of dynamically reconfigured FPGAs by downloading and starting complete configurations, and 2. a bottom layer, where each configuration c ij for each FPGA F i includes a dispatcher task d ij that starts and stops the nodes of the configuration using their control ports. Example 1 (Hierarchical reconfiguration structure). Figure 1a) shows the hierarchical reconfiguration structure for an FPGA F 1 with two configurations

3 and. The top layer consists of the configurator task running on the host that supervises the two FPGA configurations by controlling the corresponding dispatcher tasks and d 12. The bottom layer for configuration is shown in Figure 1b) where the dispatcher task supervises the local nodes b, q 3, c, and d using their control ports. Figure 1c) outlines a possible dispatcher function that implements a local schedule. a) Top layer FPGA configuration CPU b) Bottom Layer c) Dispatcher node features: Dispatcher_d () { b b q 3 c FPGA F 1 d b c d number of node invocations 1/30 1/30 3/18 execution cycles 11 state = s0 counterd=1 start b,d,q 3,c loop if d.done and (counterd<3) start d, counterd++ case state s0 : if b.done start c, state=s1 s1 : if c.done configuration done Fig. 1. Hierarchical reconfiguration structure. Our tool inserts edges (communication channels) d ij to connect all dispatcher tasks to the configurator on the host. For FPGAs not directly connected to the host, routing nodes have to be inserted on intermediate FPGAs: r klm d ij. Our framework supports both off-line and on-line scheduling techniques. The actual schedule is implemented in a distributed manner by the configurator and all the dispatcher tasks. Each FPGA executes only one of its configurations at any given time. However, the configuration schedule, i.e., the total order of all the configurations, may not be known at compile time. In this case, we have to add a routing node r klm for each communication channel m into each configuration c l on each intermediate FPGA F k. This process is called routing node distribution and inserts n c routing nodes for an intermediate FPGA in the problem graph [7], where n is the number of communication channels that must be routed and c the overall number of configurations for the FPGA. If the configuration schedule is known, the number of inserted routing nodes can be reduced. Example 2 (Routing node distribution). Figure 2a) shows an architecture graph with a host and two FPGAs. Two different configuration schedules that lead to different routing node distributions are shown in Figure 2b) and Figure 2c), respectively. In both mappings, FPGA F 1 has two configurations and,and FPGA F 2 implements three configurations,,and. The configuration schedule in Figure 2b2) gives the order of configurations for each FPGA. As there is no information about the relative timing of configurations on F 1 and F 2, 6 routing nodes are inserted on FPGA F 1. These 6 routing nodes require

4 12 I/O ports on FPGA F 1 and 6 I/O ports on FPGA F 2. Figure 2c2) shows a configuration schedule with an additional dependency. Configuration will only be executed when configurations and have completed. Consequently, the configurations on FPGA F 1 do not have to implement all communication channels. Only 3 routing nodes are required, leading to 6 and 3 I/O ports for FPGAs F 1 and F 2, respectively. b 1 b 2 a) Architecture graph d 21 d 21 r 111 r 111 r 112 r 112 r 113 d 22 d 22 r 121 r 122 r 123 d 23 r 123 d 23 d 12 d 12 configuration b1) Problem specification for configuration schedule A c1) Problem specification for configuration schedule B c21 b2) Configuration schedule A c2) Configuration schedule B Fig. 2. Routing node distribution.

5 Our back-end tools analyze the configuration schedule and try to insert routing nodes into configurations only if necessary to reduce the required FPGA area and I/O pins. For the remaining routing nodes and I/O ports, optimization strategies are applied to reduce the number of routing nodes and required I/O ports even further. 3 Optimization by Object Sharing Objects that can be shared comprise problem graph nodes and I/O ports. Although we consider here only routing nodes that have been added by our backend tools, object sharing could be applied to the original nodes of the problem graph as well. Nodes can be divided into three groups that demand for different interface circuitry (see Figure 3). Nodes of type a are connected only to nodes in the same configuration and are thus not amenable to the proposed optimizations. Nodes of type b are connected to nodes in the same configuration as well as to nodes on other computing resources. Nodes of type c have only connections to nodes bound to other computing resources. Node types b and c require I/O ports and are the main targets of object sharing. The routing nodes, which are of type c, benefit most from object sharing optimizations. The optimizations are split into two groups: 1. Port sharing: Several communication channels are routed over the same FPGA I/O pins in order to save pins. 2. Node sharing: The implementation of a routing node is shared between several communication channels in one configuration. This reduces the required FPGA area and routing resources. external edge c a b FPGA internal edge node type: a b c only internal edges internal and external edges only external edges Fig. 3. Node types. 3.1 Port Sharing A port is a set of I/O pins that enable data communication between two connected computing resources. Sharing ports among several channels defuses the problem of limited I/O pin resources. Port sharing applies to external edges of the node types b and c and has been extensively used in the Virtual Wire

6 project [2, 1]. We extend port sharing to time-exclusive configurations of runtime reconfigurable systems. Therefore, we differentiate between two types of port sharing: Intraconfiguration port sharing allows communication channels in the same configuration to share I/O ports by adding multiplexing and demultiplexing interface circuitry [1]. This technique can be applied to both CTR and RTR systems. Interconfiguration port sharing allows communication channels in different configurations of one FPGA to share I/O ports. These channels would otherwise be mapped to different ports as configurations of other FPGAs could require that both channels exist at the same time. Interconfiguration port sharing applies only to RTR systems. Example 3 (Intraconfiguration port sharing). Figure 4a) shows part of the problem graph of Example 2b1). Applying intraconfiguration port sharing to edges r 111, r 112,and r113 implies the multiplexing/demultiplexing interface circuits i 3 and i 4 (see Figure 4b). a) Problem specification b) Implementation r 111 i 1 i 2 r 111 r 112 i 3 i 4 r 112 r 113 interface r 113 Fig. 4. Intraconfiguration port sharing. Example 4 (Interconfiguration port sharing). Figure 5a) shows again a part of the problem graph of Example 2b1). The configurator controls the dispatchers via the edges and d 12. As the dispatchers reside in time-exclusive configurations, the corresponding ports can be shared. In Figure 5b) this is implemented by the interface circuits i 1 and i Node Sharing Node sharing is a technique where several identical problem graph nodes assigned to the same configuration share a single physical implementation on the target architecture. Generally, node sharing can be applied to all three node types a, b, and c. Node sharing to original problem graph nodes allows to trade-off between required FPGA area and execution time and could require additional buffers.

7 a) Problem specification b) Implementation i 1 i 2 d 12 d 12 Fig. 5. Interconfiguration port sharing. Node sharing for original problem graph nodes actually means to modify the mapping determined by front-end tools. Hence, we consider node sharing only for routing nodes (nodes of type c), which is extremely valuable if combined with port sharing. Example 5 (Node and port sharing for routing nodes). Figure 6a) shows a problem specification including FPGA F 1 with one configuration and FPGA F 2 with three configurations. The communication channels r 111 d 21, r 112 d 22,and r 113 d 23 are routed over FPGA F 1. The three routing nodes can share one physical implementation as the three configurations of FPGA F 2 never exist at the same time. This sharing of nodes can be combined with intraconfiguration port sharing of the three channels between the host and FPGA F 1 and interconfiguration port sharing between FPGA F 1 and FPGA F 2 (see Figure 6b). a) Problem specification b) Implementation r 111 N d 21 d 21 r 112 N d 22 i 1 i 2 r i 3 i 4 d 22 r 113 N node sharing d 23 intraconfiguration port sharing interconfiguration port sharing d 23 Fig. 6. Node and port sharing. 4 DYNAMITE algorithm The proposed optimization techniques have been implemented in our synthesis tool CORES/HASIS [6]. The corresponding algorithm DYNAMITE (DY- NAMIc system implementation) consists of two parts:

8 DYNAMITE 1 is executed for each host and extends the original problem graph with the hierarchical reconfiguration structure. The procedure has two main steps: (i) extension of the problem graph by introducing configurator and dispatcher tasks and inserting routing tasks by routing node distribution (lines ) and (ii) application of port and node sharing optimizations (lines ). DYNAMITE 2 performs the actual communication synthesis, i.e., introduces interface circuits, by executing the HASIS tool set for each configuration (lines ). The pseudo-code shown below outlines the basic steps of the DYNAMITE algorithm and is based on following assumptions: A host h supervises a set of FPGAs F h. Single FPGAs are denoted by F i. The problem graph PG consists of nodes and edges. The nodes are further split into subsets, depending on where they are mapped to. For example, V h denotes the problem graph nodes mapped to the host, V cij denotes nodes mapped to configuration c ij. C fi is the set of configurations of FPGA F i. All edges between the dispatchers of an FPGA and the configurator on the host are routed over the same set of FPGAs. Lines 2-8 insert and connect the configurator and dispatcher nodes. In lines 9-10, routing node distribution is performed for FPGAs that are not directly connected to the host. This step inserts routing nodes into several configurations, depending on information that is extracted from the configuration schedule. Lines perform port and node sharing optimizations. In line 12, all edges connecting to dispatcher nodes share their ports. This includes intra- as well as interconfiguration port sharing. In line 15, all routing nodes mapped to one configuration share their node implementations. 1. DYNAMITE_1(PG,AG,M,h) 2. V h V h -- insert configurator 3. for all F i F h 4. for all c ij C Fi 5. V cij V cij d ij -- insert dispatcher 6. E E (, d ij) -- insert edge 7. for all nodes n V cij \ d ij 8. E E (d ij,n) -- insert edge 9. if non_adjacent(f i,h) 10. routing_node_distribution(pg,ag,m,f i,h) 11. E temp {(x, d ij) j = 1... C Fi } 12. share_ports(e temp) 13. while (((x, y) E temp), x ) 14. V routing {v source(e), e E temp} 15. share_nodes(v routing) 16. E temp {(x, y) y V routing} 17. share_ports(e temp)

9 18. DYNAMITE_2(PG,AG,M) 19. for all F i F h 20. for all c ij C fi 21. HASIS.start(V cij,ag,m) Figure 7 shows the optimized implementation for Figure 2b1) after DYNA- MITE has been applied. The hierarchical reconfiguration structure consists of a configurator node on the host, one dispatcher node per FPGA configuration, and two routing nodes on FPGA F 1. Overall, three I/O ports are required on FPGA F 1 and one I/O port on FPGA F 2 to implement the reconfiguration structure. d 21 r 11 i 1 i 2 i 3 i 4 i 5 i 6 d 22 r 12 d 12 d 23 Fig. 7. Object sharing optimizations. 5 Conclusions The paper discussed optimization strategies for the hierarchical reconfiguration structure introduced in [5]. The optimizations are based on object sharing and include port sharing and node sharing. Port sharing saves I/O pins by multiplexing several communication channels over one physical channel. Node sharing saves FPGA area by reusing one physical node implementation for several routing nodes. References [1] J. Babb, R. Tessier, and A. Agarwal. Virtual Wires: Overcoming Pin Limitations in FPGA-based Logic Emulators. In IEEE Workshop on FPGA-based Custom Computing Machines, pages , Napa, CA, April 1993.

10 [2] J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and A. Agarwal. Logic emulation with virtual wires. IEEE Transactions on Computer-Aided Design of Integrated Circuit and Systems, 16(6): , June [3] K. Chatha and R. Vemuri. Hardware-Software Codesign for Dynamically Reconfigurable Architectures. In 9th International Workshop on Field-Programmable Logic and Applications, FPL 99, Lecture Notes in Computer Science, 1673, pages , Glasgow, UK, August/September [4] R. Dick and N. Jha. CORDS: Hardware-Software Co-Synthesis of Reconfigurable Real-Time Distributed Embedded Systems. In IEEE/ACM International Conference on Computer-Aided Design, pages 62 68, 8-12 November [5] M. Eisenring and M. Platzner. An Implementation Framework for Run-time Reconfigurable Systems. In The Second International Workshop on Engineering of Reconfigurable Hardware/Software Objects, ENREGLE 2000, Monte Carlo Resort, Las Vegas, Nevada, USA, June [6] M. Eisenring and M. Platzner. Synthesis of Interfaces and Communication in Reconfigurable Embedded Systems. To appear in IEE Proceedings - Computers and Digital Techniques, [7] M. Eisenring, M. Platzner, and L. Thiele. Communication Synthesis for Reconfigurable Embedded Systems. In 9th International Workshop on Field-Programmable Logic and Applications, pages , Glasgow, UK, August/September [8] R. Ernst, J. Henkel, T. Benner, W. Ye, U. Holtmann, and D. Herrmann. The COSYMA environment for hardware/software cosynthesis of small embedded systems. Microprocessors and Microsystems, 20(3): , May [9] B. Hutchings and M. Wirthlin. Implementation Approaches for Reconfigurable Logic Applications. In International Workshop on Field-Programmable Logic and Applications, pages , [10] A. Kirschbaum and M. Glesner. Rapid Prototyping of Communication Architectures. In 8th IEEE Int. Workshop on Rapid System Prototyping, pages , June , Chapel Hill, North Carolina. [11] M. O Nils and A. Jantsch. Communication in Hardware/Software Embedded Systems - A Taxonomy and Problem Formulation. In 15th NORCHIP Seminar, Copenhagen, Denmark, pages 67 74, November [12] R. Ortega and G. Borriello. Communication Synthesis for Distributed Embedded Systems. In IEEE/ACM Int l Conf. on Computer-Aided Design, pages , San Jose, CA, November [13] K. Purna and D. Bhatia. Temporal Partitioning and Scheduling Data Flow Graphs for Reconfigurable Computers. IEEE Transactions on Computers, 48(6): , June [14] E. Sanchez, M. Sipper, J. Haenni, J. Beuchat, A. Stauffer, and A. Perez-Uribe. Static and Dynamic Configurable Systems. IEEE Transactions on Computers, 48(6): , June 1999.

Anand Raghunathan

ECE 695R: SYSTEM-ON-CHIP DESIGN Module 2: HW/SW Partitioning Lecture 2.13: HW/SW Co-Synthesis: Automatic Partitioning Anand Raghunathan raghunathan@purdue.edu Fall 2014, ME 1052, T Th 12:00PM-1:15PM 2014