ED&TC /96 $ IEEE

Size: px

Start display at page:

Download "ED&TC /96 $ IEEE"

Betty Elliott
6 years ago
Views:

1 ThreadBased Software Synthesis for Embedded System Design Youngsoo Shin Kiyoung Choi Department of Electronics Engineering Seoul National University Seoul, Korea, Abstract We propose in this paper a threadbased software synthesis technique to reduce communication overhead incurred by hardwaresoftware interface in a system. We start from a CDFG that models the system. The CDFG is analyzed and partitioned into a set of threads. Then we generate a mixed staticdynamic thread scheduler. The scheduler statically schedules as many threads as possible to minimize the scheduling overhead. Then the scheduler dynamically schedules the remaining threads. Reduction of the total execution time including the communication overhead is demonstrated with some examples. 1 Introduction In this paper we present an eæcient software synthesis technique for mixed hardwaresoftware systems. Such heterogeneous systems frequently involve unbounded or unknown delays caused by data dependent loops or interactions with environments. Therefore, some kind of synchronization mechanism is required. In ë5ë, idlewait type polling synchronization is used to perform blocking ièo. In ë2ë, polling is implemented as an investigation of trigger signals iequence. But the strategies raise the problem of communication overhead when delays are quite long. In ë3ë, communication overhead of up to 50è is reported, where they assume mutual exclusion betweeoftware running on a processor and hardware implemented as a type of a coprocessor. For this reason, we focus on the problem of minimization of communication overhead caused by this synchronization requirement. The problem can also be viewed as maximization of resource utilization. To achieve this objective, we generate program routines in the form of multithread. Multithreading has been widely used in parallel processor systems. Heterogeneous systems have similar problems in that processors running software and other hardware components are inherently executable in parallel. Communication overhead is reduced by overlapping execution of threads with hardware execution which can be made possible by thread partitioning and scheduling. The overall structure of our software synthesis is as follows. First, system speciæcation is transformed to a control data æow graphècdfgè model. Currently, we are experimenting with system speciæcations described in VHDL, but conceptually, speciæcations are not limited to VHDL only. Our future work includes allowing mixed description in both VHDL, Ptolemyë1ë, andèor C. CDFG is partitioned into two parts to be implemented in hardware and software. Then the interface between the two parts is generated and annotated to the partitioned CDFG. Currently, the partitioning and interface annotation is performed manually. CDFG representing the software part consists of basic operations, control constructs, and system interface operations. It is then partitioned into a set of threads from which a thread scheduler is generated. The scheduling is mixture of static and dynamic scheduling but is maximally static in that all threads that can be givetatic ordering are scheduled statically. Dynamic scheduling is performed when there is a dynamic control æow such as a data dependentloop or an unbounded delay. Examples of unbounded delays are delays incurred by environments or hardware components. The rest of this paper is organized as follows. Iection 2, we deæne CDFG and thread and discuss some related issues. Section 3 describes the thread generation procedure. Section 4 discusses thread scheduling. Iection 5, we present some experimental results and eæectiveness of our methodology. Finally, section 6 concludes with some remarks and the future work. 2 Preliminaries In this section we present some terminologies and the related issues. 2.1 CDFG CDFG is deæned as a directed graph G =é N;E é, where N is a set of nodes and E = fni énjjni;nj 2 Ng is a set of directed edges. ni é njdenotes a di ED&TC /96 $ IEEE

2 Table 1: Types of nodes node types ns;ne nop nct ; description start node, end node operation node control node interface operation nodes n 1 n 2 n 3 while n 4 n 5 n ct n n 8 7 n9 n 10 n e < n 6 n 11 n w1 n w2 rected edge from ni to nj. There are four types of nodes as shown in Table 1. A pair of a start node and an end node is introduced as polar nodes to the whole CDFG, to each condition clause, and to each conditional branch. By using these polar nodes we can take advantage of hierarchical and recursive graph structures, thereby simpling theoretical graph problems. nct is a dummy nodeintroduced to each control constructs. It connects subgraphs, a Gc for a condition clause and Gss for conditional branches. This allows the subgraphs to be hierarchically nested in G. and represent abstract read and write operations. They represent any sequence of operations required to satisfy the selected interface protocol. So their granularity can vary according to the complexity of the interface protocol. We have experimented with the handshaking protocol used in our prototyping system[6]. There are two types of edges between node pairs: data dependency edge and control dependency edge. When nj is data dependent onni,we represent the relation as ni >d nj. When nj is control dependent on ni, we represent the relation as ni >c nj. Figure 1 shows an example of a CDFG, where solid arrows specify data dependency and dotted arrows specify control dependency. We dene P (ni) to denote a set of nodes which have paths neither from ni nor to ni along data dependency and control dependency edges. P (ni) can be recursively dened as follows. Pred(ni)= [ Succ(ni) = [ n j >n i Pred(nj) n j <n i Succ(nj) [ fnig [ fnig P(ni)=N Pred(ni) Succ(ni) Using the above formulas we can recursively nd P () in the CDFG. Nodes in P () are candidate operations to be executed concurrently with hardware components. n 16 n 15 n 14 n 13 n 12 n e n r n e Figure 1: An example of a CDFG. 2.2 Thread A thread is dened as a sequence of successively connected nodes, which has the property that once the rst node res then the remaining nodes execute to the end in xed latency. Therefore, we do not support preemption of threads. The execution of threads is performed by a thread scheduler. We distinguish three thread types as presented in Table 2. Ts is a thread whose nodes are neither in Gc nor Gs. Tc is a thread whose nodes are all condition evaluation operations. Tb is a thread whose nodes are all operations in a conditional branch. We dene T (ni) as a thread starting from node ni. ST (ni) is dened as a set of threads constructed with nodes in P (ni), i.e. a set of threads whose nodes can execute in parallel with T (ni). Notation of dependency relations between threads is dened in the same way as node relations, i.e. Ti > Tjspecies that Tj is dependent onti. There are two types of dependency: Ti >d Tj when Tj is data dependent ontiand Ti >c Tj when Tj is control dependent onti. A thread consists of only nop;, and. nct aects only thread scheduling. ns and ne are dummy nodes and are not included in a thread. always starts a new thread. T () contains only elements between ns and ne of the polar subgraph which encloses directly. T () does not contain another. We schedule threads of ST () only after predecessors of all threads in ST () have beecheduled. To guarantee this scheduling scheme, we introduce the following lemma. n 17

3 Table 2: Types of threads thread type description Ts simple thread Tc codition clause thread conditional branch thread Tb T6 T5 T7 while T11 T12 T1 T3 T8 T9 < T2 T10 T13 T5 T6 T1 T3 while T8 T11 T7 < T2 T9 Lemma 1 For any Tj;Tk 2 ST(ni), if a thread Tl satises both Tj >Tl and Tl >Tk, then Tl 2 ST (ni). Proof: Let's prove it by contradiction. Assume Tl 62 ST (ni). Then, either ni > nm or nm > ni for some nm 2 Tl. First, assume ni >nm. Then there is a path from ni to nodes of Tk because Tl and Tk satises Tl >Tk. Therefore, Tk can not be included in ST (ni). Now, assume nm > ni. Then there is a path from nodes of Tj to ni because Tj and Tl satises Tj >Tl. Therefore, Tj can not be included in ST (ni), which is a contradiction. 2 3 Thread Generation There is a tradeo between the number of threads and the average length of threads. To reduce the cost of thread switching, it is important tokeep the average length of threads long. But it is also important to have a sucient number of threads which are neither directly nor indirectly dependent ont(), so that they can be executed while is waiting for the completion of the corresponding hardware operation having unbounded delay. To achieve this objective, we construct threads using nodes in P (). Thread generation consists of thread partitioning, thread clustering, and variable assignment. 3.1 Thread partitioning Thread partitioning process consists of the following four steps. Step 1 : for each, construct a node set P () Construction of P () is accomplished through the analysis of data dependency and control dependency of. For example, in Figure, 1 we identify fn 12 ;n 13 ;n 14 ;n 15 ;n 16 g as data dependent successors of, f1;2;n 8 ;n 9 ;n 2 ;n 3 ;n 4 g as data dependent predecessors of, fn 5 ;n 6 g as control dependent predecessors of, and fn 16 ;n 17 g as control dependent successors of. Note that n 16 is both a control dependent and data dependent successor of. So P () =fn 1 ;n 7 ;n 11 ;n 10 g. Step 2 : from P (), construct a thread set ST () Construction of a thread set is rather straightforward. From P (), we identify successively con T14 T4 Figure 2: Constructed thread set. (a) initial thread construction (b) threads after clustering nected nodes and statically order(topologically sort) the nodes in each set. Priority is assigned to each thread based on its dependency relation and the length of the thread. We give high priority to a thread which has long latency. This strategy is based on the fact that scheduling long latency threads are more ecient when delay due to hardware is relatively long. However, another priority criterion can be considered based oystem requirements. In Figure 1, ST () =ft 1 ;T 2 ;T 5 g, T 1 = fn 7 ;n 11 g, T 2 = fn 10 g, and T 5 = fn 1 g are identied. Step 3 : construct threads starting from Once ST () is established, we start construction of a thread from. The thread T () is cut when we visit a node which is a successor of any nodeof P() or there exist no more elements to make the thread. This is necessary for T () not to break its nondependency relation with ST (). Step 4 : construct threads with the remaining nodes Construction of a thread starts from each successor of ns nodes. Note that a thread can not cross subgraph boundary. Traversing along the edges, we append nodes to the thread until we havenonodes which have not been included in another thread yet. In Figure 2 (a), we show the results of the above four steps run on example in Figure Thread clustering Initial construction of threads can be improved through thread clustering by which we can reduce the number of context switching. A clustering rule has to be well established to ensure that the resultant threads are deadlockfree. We modied and used the clustering rule dened in [7] as follows. Clustering T10 T4

4 is performed exclusively between ST () and a set of threads not in ST (). Rule 1: Same type of threads T 1 and T 2 can be clustered to become a thread if (a)all output arcs from T 1 go to T 2 or (b)t 1 and T 2 have no input arcs from other threads. Figure 2.(b) shows the results after thread clustering. We omit some data dependency relations for the purpose of clarity. 3.3 Variable assignment We perform static analysis for variable assignment. Variables dened and used only withiingle thread scope can be accessed through registers. Variables used in interthread scope need variable lifetime analysis for an ecient assignment to registers. However, the dynamic scheduling used in our system makes this analysis dicult, and this analysis is not considered in this paper. 4 Thread Scheduling We generate a mixed static and dynamic scheduler which statically schedules threads that can be given static orders, then dynamically schedules threads which are in dynamic control ows or which can be executed in parallel with unbounded delay operations. Currently, we do not take into account timing constraints during scheduling, but focus on enhancing performance of a mixed system by inserting some threads in the idle interval caused by execution of hardware. Our future work includes software synthesis under timing constraints. For the dynamic scheduling we distinguish three types of threads: threads that consist of condition evaluation nodes, threads that consist of nodes in conditional branches, and thread T () and threads in ST (). Schedules of all other threads are statically determined. Before scheduling T () and threads in ST (), we schedule all threads that are not in ST () but are predecessors of T () or predecessors of threads in ST (). Because those threads are not in ST (), by Lemma 1, they are not successors of threads in ST () and therefore are guaranteed to be red before any thread in ST (). This implies that they can be scheduled statically. With this strategy, we can maximize static scheduling thereby minimizing scheduling overhead. This also allows us to reduce communication overhead by executing threads in ST () while T () iswaiting for a completioignal from hardware. Scheduler maintains ST () for each T () in the form of a thread list sorted by a priority assigned to each thread. To schedule thread T (), we can consider three dierent cases. Case 1: T () is a Ts. Scheduler selects and res a thread from ST () based on its priority and then check handshake signal from hardware. If the signal is asserted, T () is red and the remaining threads in ST () are all red. When the signal is not asserted, another thread in ST () is selected and red. This dynamic scheduling is repeated until the handshake signal is asserted or all the threads in ST () are red. In the latter case, we start polling the signal. Case 2: T () is a Tb When T () is in a conditional branch, we distinguish Tbs and Tss in ST () because Tss are executed once but Tbs are executed the same number of times as T (). Note that Tbs and T () are in the same branch. We don't put Tbs in other branches into ST (). For the example of Figure 2, the static scheduling schedules T 11! T 7. If the boolean value of T 7 is true, then the dynamic scheduling schedules T 6 and T 8. If the completioignal from hardware is not asserted, then the scheduler schedules T 5! T 1! T 2 iequence until the signal is asserted. If the signal is not asserted even after the execution of three threads, scheduler starts polling. In example, because T 5 is on the outside of the branch containing, itcanbe red only during the rst iteration of the loop. That is, ST () = ft 1 ;T 2 ;T 5 g for the rst iteration and ST () =ft 1 ;T 2 g for the second iteration. Case 3: T () is a Tc When T () is in a condition evaluation clause, we distinguish Tcs and Tss in ST () in the same way as case 2. In this case, Tbs can not be included in ST () because of control dependency between T () and Tbs. 5 Experimental Results We implemented our software synthesis algorithm in the C programming language on a SUN Sparc workstation. Multithreading is achieved via SUN OS light weight process library. However, our software synthesis techiniques can be easily extended to embedded realtime systems because our methodology is general enough. We have experimented with some examples in our codesign environment[6] which consists of a SUN Sparc Processor, SBus, and an FPGA prototyping board. Hardware components which are synthesized and prototyped with an FPGA communicate with software components via SBus. Figure 3 shows experimental results of an elliptical wave lter where a multiplication operation with delay elements is implemented with the FPGA. Delay have been intentionally inserted into the hardware

5 0 Relative Execution Time Number of Polling Operation Relative Communication Overhead Number of Polling Operation Figure 3: Performance comparison. (a) relative execution time (b) relative communication overhead part to mimic a more complicated system. Figure 3 compares the performance of a straight line code implementation and the code generated by our synthesis algorithm. The straight line code assumes mutual exclusion betweeoftware components and hardware components as in [3]. Therefore, there exits no overlap betweeoftware and hardware execution. The execution time and the communication overhead of the generated code are measured relative to the straight line code changing the delay of the hardware component. Number of polling operations in Figure 3 indicates the number of calls issued by the code generated by our algorithm to check the completioignal from the hardware. The number of calls issued by straight line code is much larger because the former performs simple polling during the hardware execution. We de ne communication overhead as time devoted to i/o operations over total execution time. Figure 3 shows the relative communication overhead of the synthesized code compared to the straight line code. If the completioignal arrives before the rst polling operation, our approach exhibits slightly worse performance because there is overhead of context switching. However, if the hardware delay is long enough that there occurs more polling operations, then the relative execution time and the relative communication overhead are drastically reduced. Figure 3(a) shows saturated reduction of the relative execution time. This is because the example is small. We expect much more reduction when the size of application is large so that P () and the hardware delay is also large. For the example of an MPEG2 decoder, if idct block is implemented in hardware and synchorization is performed by busywait polling, then the performance becomes worse than allsoftware implementation because communication overhead is too large. In such situations, we expect our algorithm will be of great help. Currently, we are experimenting with this example. 6 Conclusions In this paper, we have presented a software synthesis technique which generates codes based on threads. Our methodology tries to execute as many operations as possible before the completion of unbounded delay operations of hardware or environment, thereby reducing the total execution time. The executions of operations are scheduled eciently through thread partitioning and thread scheduling. It has been experimentally shown that the total execution time can be eectively reduced. We are currently experimenting with several embedded system examples. We plan to extend our work to hardwaresoftware codesign where a system is specied with mixed VHDL and Ptolemy, timing constraints are given, and/or interrupt driven i/o protocol is used. We are also exploring generation of some operating system kernel code including scheduler and device drivers. References [1] J.Buck, S.Ha, E.A.Lee, and D.G.Messerschmitt, "Ptolemy: a framework for simulating and prototyping heterogeneous systems," International Journal of Computer Simulation, Vol. 4, Apr. 1994, pp [2] Massimiliano Chiodo et al., "Synthesis of Software Programs for Embedded Control Applications," Proc. of the 32nd DAC, June 1995, pp [3] R.Ernst, J.Henkel, and T.Benner, "Hardware Software Cosynthesis for Microcontrollers," IEEE Design & Test of Computers, December 1993, pp [4] Daniel D. Gajski et al., "Specication and Design of Embedded Systems," PrenticeHall, Inc, [5] Rajesh K.Gupta, Giovanni De Mecheli, "System Synthesis via HardwareSoftware CoDesign," CSL Technical Report CSLTR92548, Stanford University, October [6] Y.Kim, Y.Shin, K.Kim, J.Won, and K.Choi, "Ef cient prototyping system based on incremental design and modulebymodule verication," Proc. of 1995 International Symposium on Circuits and Systems, May. 1995, pp [7] K.E.Schauser et al., "CompilerControlled Multithreading for Lenient Parallel Languages," Proc. Fifth ACM Conf. Functional Programming Languages and Computer Architecture, ACM, New York, 1991, pp.5072.

under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli

Interface Optimization for Concurrent Systems under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli Abstract The scope of most high-level synthesis eorts to date has