ARTICLE IN PRESS. Analyzing communication overheads during hardware/software partitioning

Size: px

Start display at page:

Download "ARTICLE IN PRESS. Analyzing communication overheads during hardware/software partitioning"

Jonah Parsons
5 years ago
Views:

1 Microelectronics Journal xx (2003) xxx xxx Analyzing communication overheads during hardware/software partitioning J. Javier Resano*, M. Elena Pérez, Daniel Mozos, Hortensia Mecha, Julio Septién Departamento de Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Ciudad Universitaria, Madrid 28040, Spain Abstract Current partitioning codesign tools often simplify the communication channel features by working with generic abstract channels, which in a following step, are mapped into the actual ones. However, this mapping process can critically affect the performance of a solution. Hence, we have developed a novel methodology that studies the communications in depth, taking into account the actual channel features from the first step. With this methodology we have optimised the design-space exploration of a partitioning tool, achieving up to a 2.5 performance speed-up, while increasing the time needed to perform the partitioning by less than 20%. q 2003 Published by Elsevier Ltd. Keywords: Hardware/software partitioning; Codesign system; Estimations 1. Introduction and previous work Hybrid Architectures, composed by a software (SW) processor and a configurable hardware (HW) coprocessor, are becoming more and more common. One of the greatest challenges when dealing with this kind of hybrid architectures is how to optimally distribute the computation between the HW and the SW processing elements. This problem is referred in literature as Hardware/Software (HW/SW) partitioning. Hybrid systems present a trade-off between the high performance of specific HW circuits and the low cost of SW processors. However, in such systems, communications are often a system performance bottleneck, even more when both HW and SW performance are improving much faster than communication channels do. Most of the existing work on HW/SW partitioning considers that abstract channels carry out communications between tasks in different platforms [1]. Hence, during the partitioning process, the actual features of the communication channels are neglected. Once the partitioning has been done, the abstract channels have to be mapped into one or more physical channels (e.g. a system bus). This * Corresponding author. addresses: javier1@dacya.ucm.es (J.J. Resano), eperez@dacya. ucm.es (M.E. Pérez), mozos@dacya.ucm.es (D. Mozos), horten@dacya. ucm.es (H. Mecha), jseptien1@dacya.ucm.es (J. Septién). process is called communication synthesis. In Ref. [2], the best bus bandwidth for mapping the abstract channels is selected. In Refs. [3,4], more complex interconnection topologies are allowed. In Ref. [5], initially a bus is selected, and then communications are scheduled, trying to minimize both the HW cost and the execution time. [6] addresses the problem of generating the communication topology for systems that should meet hard real-time constraints. All these approaches work with abstract communication channels, although they consider some physical features during the communication synthesis process. Once obtained, the abstract interconnection topology has to be mapped into an actual one. In Ref. [7], a useful tool for this mapping is presented, that generates automatically the drivers and a DMA controller for a given communication application. Ignoring the properties of communication channels often leads to an inefficient design space exploration, since it is not possible to estimate correctly the communication time, or the impact of the access conflicts to these channels on the system performance. In Ref. [8], simple models for some communication platforms are presented (USB, PCI buses, packing, burst transmission mode, etc.). A model is also given to estimate the area needed to implement the communication drivers. Moreover, these models are integrated within the HW/SW partitioning, so communication timing and area trade-offs are studied during the design space exploration. However, /03/$ - see front matter q 2003 Published by Elsevier Ltd. doi: /s (03)00168-x

2 J.J. Resano et al. / Microelectronics Journal xx (2003) xxx xxx communications are not scheduled during partitioning, so access collisions on the system bus are not taking into account.

2 2 J.J. Resano et al. / Microelectronics Journal xx (2003) xxx xxx communications are not scheduled during partitioning, so access collisions on the system bus are not taking into account. The approach proposed in this paper aims to: (a) (b) (c) Consider the physical features of the channel in order to accomplish accurate time estimations. Schedule the communications in order to avoid conflicts on the communication channel and trying to minimize the global execution time. Integrate all these features into a HW/SW partitioning tool without increasing significantly the time needed for the design space search. This work has been developed to deal with data dominated applications, trying to maximize the system performance for a given platform. We will illustrate our approach with a real multimedia application (used for pattern recognition over a pixel matrix) evincing how communication scheduling can become a critical factor in the system performance. 2. Initial specification The initial specification is modelled by a Direct Acyclic Graph (DAG), where each node represents a coarse grain computational task, and the edges correspond to dependencies among the nodes. Three different dependencies are considered: communication dependencies, internal dependencies, and temporal dependencies. A communication dependency edge (CDE) connects two nodes of different partitions. It represents a data transfer between nodes that will be carried out upon a communication channel. An internal dependency edge (IDE) connects two nodes in the same partition. It also represents a data transfer between the nodes, but in this case we assume that they can share the data internally using their private memories. Therefore, the communication channel is not used. Sometimes this assumption is not valid. For instance, when the size of the shared data exceeds the size of the internal memory, the data must be stored externally using the communication channel. Hence, the dependency must be tagged as CDE in spite of it communicates two nodes in the same partition. Therefore, the main different between a CDE and a IDE is that the first one represents an access to the communication channel, whereas the second represents a internal data transfer. There is a third type of edge called temporal dependency edge (TDE). A TDE represents a dependency between two nodes in the same partition that has been imposed by the scheduler. Each node of the graph contains estimations for its execution time and HW area (if needed). Each CDE is tagged with the amount of data to be transferred, and an estimation of the communication time needed. A specification graph example is shown in Fig. 1 (HW area is not included for readability). 3. Cost function choice Fig. 1. Specification graph example. The cost function of a codesign system includes generally the area and the execution time of the solution. One of the more difficult issues when designing a partitioning system is to mix rather different magnitudes such as time and area into a cost function that should be able to lead the design space exploration in a near optimal fashion. This work has been developed for a HW/SW system where the HW partition is assigned to a FPGA. Since a given FPGA has a constant area, the simplest way to obtain a straightforward cost function is simply removing the area. Thus, the cost function can be identified with the execution time. The area is now considered as a constraint that every solution has to meet. The constraint must take into account that only the 90/95% of the FPGA area must be used, since otherwise routings problems may occur. Once the area has been removed and the cost function has been directly identified with the execution time, the design space exploration will try to find the fastest solution that meets the area constraint. 4. Execution time estimation Since the execution time is the main optimization parameter of our partitioning approach, it is critical to

3 J.J. Resano et al. / Microelectronics Journal xx (2003) xxx xxx 3 estimate it as accurately as possible. To this end, there is a time estimator module that schedules all the tasks and communications taking into account the dependencies of the DAG and the limitations of the platform. Our target platform is composed by a SW processor, a configurable HW coprocessor, shared memory resources and a communication channel (e.g. a system bus) that allows communications between the HW and the SW processing elements as well as data transfer among the processing elements and the shared memory. In such a system there are two main limitations, namely, the sequential execution of the nodes assigned to SW and the sequential execution of the communications and data transfers that need to use the communication channel. Once a HW/SW partitioning is selected, the time estimator module follows the next steps to estimate its execution time: (A) Assign a weight to each node. (B) Choose the execution order for the nodes assigned to SW. (C) Recalculate the weights computed in step A taking into account the new dependencies. (D) Schedule those nodes that are not waiting for data coming from a CDE. (E) While there is a CDE waiting for being scheduled do: (E1) Choose one CDE and schedule it. (E2) Schedule those nodes that are not waiting for data coming from a CDE Step A. Assign a weight to each node In the first step, a weight is assigned to each node of the DAG. The weights of the nodes are used to steer the scheduling process trying to minimize the global execution time. This is accomplished giving higher priorities to the weightier nodes. For instance, if there is a node that must carry out several communications, the execution order of these communications must be decided. Most of the algorithms in literature take these decisions according to local considerations, e.g. in Ref. [5], the longest communication is scheduled first. Since local considerations may lead to inefficient decisions in our algorithm, a global consideration has been selected for weighting the nodes. We have chosen as a global weight the maximum distance (in time units) from the moment that a node starts its execution to the end of the execution of the whole graph. For instance, in the graph of Fig. 2, the weight of node 4 is its own execution time; the weight of nodes 2 and 3, are their own execution time plus the execution time of node 4; and the weight of node 1 is the path with maximum execution time plus its own execution time. In this case, there are two possible paths. Since the first path has 53 (33 þ 20) time units, and the second has 61 (53 þ 8) time units, the second one is selected. Hence, the weight of node 1 is 82 (61 þ 21). Fig. 2. Global weight example. This distance is computed carrying out an As Late As Possible (ALAP) scheduling. This scheduling takes into account all dependencies, including HW/SW communication times. Table 1 shows the results of scheduling the graph in Fig. 2 according to both a local consideration and our global weight. The local consideration applied is the one proposed in Ref. [5], i.e. the longest communication is scheduled first. In this example node 1 must carry out two communications on the same communication channel. Therefore, both communications must be scheduled. If the decision were taken based on the local weight, CDE2 would be scheduled first. Hence, CDE1 should wait until CDE2 ends. Since CDE1 is in the critical path of this graph, if it is delayed the system decreases its performance. Nevertheless, with our global weight, those nodes in the critical path are weightier, hence Com1 is scheduled first and no unnecessary delays are created Steps B and C. Choose the execution order for the SW assigned nodes, and recalculate the weights The choice of the SW execution order is the first decision taken to estimate the execution time. The specification graph allows parallel execution between their nodes, but those nodes assigned to SW must be executed sequentially. Table 1 Scheduling of the graph in Fig. 2 Node1 Node 2 Node 3 Node 4 L G L G L G L G G. Weight t start t end L: scheduling based on local considerations; G: scheduling based on our global weight; G. Weight: global weight assigned to the node; t start : start of the node execution; t end : end of the node execution.

4 4 J.J. Resano et al. / Microelectronics Journal xx (2003) xxx xxx Table 2 Comparison between weights in steps (A) and (C) for the graph in Fig. 1 Node Weight (A) Weight (C) The order is selected sorting the nodes regarding their weights, thus, the more weighted nodes are executed first. Afterwards, the new dependencies are added as TDEs to the specification graph as it is shown in Fig. 1. It is easy to prove that this SW execution order does not allow the new dependencies to create cycles in the graph. Since these new dependencies can significantly alter the initial graph, new weights are assigned to each node. These weights are calculated as in step A, but considering the dependencies derived from the sequential execution of the nodes assigned to SW. Table 2 shows how the TDE dependencies change the weights of the nodes in the graph of Fig Steps D and E. Scheduling A heuristic that attempts to minimize the execution time of a given solution has been developed for the scheduling process. The heuristic decides when each node and each communication is going to be executed, assigning to them a t start and a t end times. The aim of this process is to detect access conflicts to the communication channels, thus, the execution time estimation will also consider the delays caused by them. This heuristic starts after the SW execution has been sorted, and the weights have been recalculated. It has been developed for a platform with just one communication channel. This channel must be previously modelled in order to obtain accurate estimations of the execution time for each communication. The scheduling starts assigning t start ¼ 0; and t end ¼ t ex to the first node, where t ex is its execution time in the partition where it has been assigned to. Then, the algorithm continues scheduling the successors of the first node. As long as there are nodes that can start their execution without waiting for data coming from a CDE, a greedy policy is followed to schedule them. When a scheduled node is connected to some of its successors with a CDE, a communication is requested. This request is stored in a list. When all the nodes that do not have to wait for a CDE have been scheduled, one of the requested communications is selected and scheduled. There are two criteria for selecting the communication to schedule (E1): If at a given time t the communication channel is free and there is just one previous request, the communication channel is assigned to this request and the bus is tagged as busy until this communication ends. Therefore, no other communication can use it until this one finishes. Otherwise, if there are more than one request, the more weighted one is selected. The weight of a communication is computed as the weight of the destination node plus the time needed to execute the communication. Once the selected communication has been scheduled the graph is examined (E2) and all the nodes that can start their execution without waiting for data from a CDE are also scheduled. The loop continues until all the communications are scheduled. Fig. 3 shows in detail how the graph of Fig. 1 is scheduled. The complexity of the scheduling is OðC p NÞ; where C is the number of HW/SW communications, and N is the number of nodes. The algorithm can be easily extended for systems with more than one communication channel if the number of channels is prefixed, and all the nodes can access to all the communication channels. To this end it is enough to update the first criteria to: If at a given time t any of the communication channels is free and there is just one previous request, the communication channel is assigned to this request and this channel is tagged as busy until this communication ends., and the second criteria to: Otherwise, if there are more than one request, the more weighted one is selected and assigned to the communication channel that becomes free first. Fig. 3. Scheduling execution example corresponding to the graph in Fig. 1. ðt1; T2Þ represents starting and ending execution times.

5 J.J. Resano et al. / Microelectronics Journal xx (2003) xxx xxx 5 5. Area estimation We apply the following equation to estimate the area needed to implement the nodes assigned to HW in the FPGA: Area ¼ X N21 i¼0 A i þ A driver þ A control þ A storage A i is the area of the node i: A driver is the area needed to implement the communication driver. A i and A driver are taken from a core library. When a new core is added to the library its area is estimated using a synthesis tool. A control is the area needed for the control logic that schedules the communications. In this approach the scheduling control is assumed by a state machine, so the area requested is estimated as a function of the number of communications. A storage is the area needed for storing the data to transfer until a communication is executed. This storage space is computed during the communication scheduling. During this process a record keeps the maximum storage space required. The SW area is not taken into account since it is supposed that the controller has enough instructions and data memory for all the allocated nodes. If it is possible to exceed the data or the instruction memory of the SW processor, the SW area would be included as a second constraint that every solution should meet. 6. Design space exploration The scheduling algorithm has been integrated into a partitioning tool based on genetic algorithms (GA) [10]. Our tool creates a random initial population of valid solutions. A solution is said to be valid if meets the area constraint. Invalid solutions are rejected in order to save computational time. The solutions consist of a HW/SW partition and a communication channel. During the design space exploration solutions evolve by reproducing themselves, generating new solution s offspring. The crossover and the mutation operators carry out reproduction. The crossover operator generates new solutions combining the existing ones, whereas the mutation operator generates new solutions by introducing random changes in the population. After generating the new offspring, the population is kept constant deleting the solution surplus. 80% of the survivors are selected choosing the best solutions according to the cost function, and the 20% remaining is randomly selected in order to prevent a premature convergence. If invalid solutions are created, these solutions are immediately discarded. Rejecting invalid solutions improves the performance of the partitioning tool, since a fast evaluation of the area of a solution just involves an add operation, so all the time needed for the scheduling is saved. Furthermore, since a genetic algorithm is able to reach all the design space ð1þ points through crossing and mutating the initial solutions, thereisnodangerthatsuchdecisionwillmakethealgorithmto fall into local minima, as it may happen if the exploration is performed with a neighbourhood algorithm. We have compared the execution of a partitioning tool based on a genetic algorithm that discard the invalid solutions with another partitioning tool based a neighborhood algorithm (simulated annealing), showing up that in this case the genetic algorithm obtains better solutions in less computational time. 7. Experimental results The following experiment illustrates the impact of communication access conflicts over the global execution time. The experiment has been performed using an application that implements the Hough Transform [10] over a matrix of pixels in order to find simple geometric figures. The Hough transform is commonly used in image processing for pattern recognition. It finds many applications in astronomical data analysis [11] and robotics. Communications in this design are a critical factor since nodes have to interchange large data structures. The initial specification is composed of fifteen nodes. The target platform for this design is a XILINX 4010 board that contains an FPGA with 400 Configurable Logic Blocks (CLB), i.e logic gates, an 8051microcontroller, a RAM memory accessible to both of them and a system bus. In this platform the FPGA can also communicate with the microcontroller using its interruption lines. We designed a communication protocol for this platform, and implemented the communication drivers. A VHDL model of the bus, drivers and protocol was developed in order to accomplish cycle-accurate execution time estimations for each given communication. In addition, a configurable communication controller was implemented in the FPGA. This controller imposes to the system the final communication scheduling. The controller communicates with the microcontroller via the interruption lines. An interruption handler was also developed to update the scheduling events in the microcontroller. The communication protocol and the drivers that were implemented are described in detail in Ref. [12]. We estimate the HW area and execution time of each node using XILINX Foundation implementation and verification tools [13]. For the SW execution-time estimations we used an automatic assembler 8051/C translator and an 8051 simulator. After completing the initial specification graph with all the needed estimations, we use it as the input for the partitioning tool. In order to obtain different measures we run the tool with several area constraints (The full design implemented in HW occupies 384 CLBs). In order to illustrate our approach, the design space has been searched running the partitioning tool in two different

6 J.J. Resano et al. / Microelectronics Journal xx (2003) xxx xxx Fig. 4. Result comparison. ways: first ignoring access conflicts on the communication channel, and then scheduling the communications.

6 6 J.J. Resano et al. / Microelectronics Journal xx (2003) xxx xxx Fig. 4. Result comparison. ways: first ignoring access conflicts on the communication channel, and then scheduling the communications. Fig. 4 shows the best solutions found in both cases. The first column corresponds to the best execution time found by the partitioning tool when the communications were not scheduled, hence, access conflicts to the shared bus are neglected. The second column shows the execution time of the same solution once the delays caused by the access conflicts have been included. The third column shows the best solution found when the partitioning tool schedules the communications. The results that emerge from this experiment confirm how communication access conflicts become a critical performance feature. It is remarkable how they increase execution time even up to three times, so ignoring them can make difficult to distinguish the goodness of a solution. It is not surprising that better solutions are obtained once access conflicts are taken into account during the design space exploration. In this experiment up to a 2.5 speed-up has been achieved (for Area restriction ¼ 300 there is no improvement since the best solution found does not present any access conflict). One of the main objectives of this work was to accomplish accurate time estimations during the design space exploration without increasing significantly the exploration time needed to carry out the partitioning process. Hence, we have measured the execution times for the previous experiment. Fig. 5 compares the time needed for the partitioning with and without scheduling the communications. Fig. 5 shows that the increment of computational time due to the communication scheduler goes from 4 to 17%. Hence, these results confirm that it is possible to accomplish more accurate time estimations without incrementing significantly the computational time. As a final measurement we ran the partitioning tool in a third mode that schedules the communications with a branch&bound (b&b) algorithm, and we compared the results with those obtained using our scheduling heuristic. Fig. 5. Design space exploration time. The b&b algorithm guaranties that it always finds the best solution. To this end, it starts evaluating each possible scheduling, and keeps a record of the best solution found. When the execution time of a given scheduling is bigger than the best execution time found, the b&b algorithm discards this scheduling saving computational time. Thus, the algorithm guaranties that it will find the best possible solution without carrying out an exhaustive design space search. The results obtained using the b&b are slightly better (on average 10% better). However, the time consumed during the partitioning process increases 800 times. Moreover, while our heuristic has a quadratic complexity, the b&b algorithm has an exponential complexity. Hence, it may be not feasible to apply it to systems with greater number of nodes. This experiment confirms that our heuristic achieves almost optimal scheduling with a small overhead. 8. Conclusions and future work The physical features of communication channels often are one of the key factors of a codesign system performance. Hence, it is necessary to take them into account to perform accurate execution time estimations. Moreover, even when accurate estimations for each communication have been done, our experiments have shown that access conflicts can drastically increase the execution time of a solution. We have developed a heuristic that allows the partitioning tool to perform more accurate time estimations, including these access conflicts, without increasing significantly the computational time needed for the design space search. With this heuristic up to a 2.5 speedup has been achieved increasing only a 10% on average the computational time needed to carry out the partitioning process. One of the main drawbacks of our approach is that it uses as input a DAG. DAGs are commonly used in codesign platforms, but they are not suitable for modelling dynamic

7 J.J. Resano et al. / Microelectronics Journal xx (2003) xxx xxx 7 non-deterministic behaviour. Hence, although DAGs can successfully model most of the data-domain applications, they are not suitable to model applications developed for the more recent multimedia standards (such as mpeg-4) since they cannot support highly dynamic behaviour. We are currently trying to overcome this problem adopting one of the more promising scheduling approaches called Matador Task Concurrency Management Methodology (TCM) [14]. TCM uses two hierarchical levels to model an application. In the highest level the dynamic and non-deterministic behaviour among different tasks is described with a model called MTG p. This model has been especially developed for tackling highly dynamic applications. However, in the lowest level just very restricted non-deterministic behaviour is allowed. DAGs can be used to model this level. Hence, our approach can be reused in this new context. Acknowledgements This work has been partially supported by TIC References [1] J.A. Rowson, A. Sangiovanni-Vicentelli, Interface-based design, DAC 97, 183, Jun 1997, pp [2] S. Narayan, D. Gajski, Synthesis of system level bus interfaces, Proceedings of the European Design and Test Conference, 94, Feb 1994, pp [3] M. Gasteier, M. Munich, M. Glesner, Generation of interconnect topologies for communication synthesis, DATE 98, Feb 1998, pp [4] M. Gastier, M. Glesner, Bus-based communication synthesis on system level, ACM Trans. Des. Autom. Electron. Syst. (1999) [5] R. Ortega, G. Borriello, Communication synthesis for embedded systems with global considerations, Proceedings of the CODES/ CACHE, Mar 1997, pp [6] S. Vercauteren, J. Van Der Steen, D. Verkest, Combining software synthesis and hardware/software interface generation to meet hard real time constrains, DATE 99, Feb 1999, pp [7] M. O Nils, A. Jantsch, Device driver and DMA controller synthesis from HW/SW communication protocol specifications, Des. Autom. Embedded Syst. 6 (2001) [8] P.V. Knudsen, J. Madsen, Integrating communication protocol selection with partitioning in hardware/software, Codes. Trans. CAD (Aug 1999) [9] J. Holland, Adaptation in natural and artificial systems, second ed., MIT Press, [10] F. Thomson, Introduction to Parallel Algorithms and Architectures, Morgan Kaufmann, 1992, pp [11] P. Ballester, Applications of the Hough Transform, Astronomical Data Analysis Software and Systems III, ASP Conference Series, vol. 61, [12] M.E. Perez Ramo, HW/SW partitioning based on Taboo Search algorithm, MSc Thesis, Universidad Complutense de Madrid, [13] [14] P. Yang, et al., Energy-aware runtime scheduling for embeddedmultiprocessors SOCs, IEEE J. Des. Test Comput. (2001)

A New Approach to Execution Time Estimations in a Hardware/Software Codesign Environment

A New Approach to Execution Time Estimations in a Hardware/Software Codesign Environment JAVIER RESANO, ELENA PEREZ, DANIEL MOZOS, HORTENSIA MECHA, JULIO SEPTIÉN Departamento de Arquitectura de Computadores