BRNO UNIVERSITY OF TECHNOLOGY Faculty of Information Technology Department of Computer Systems. Ing. Azeddien M. Sllame

Size: px

Start display at page:

Download "BRNO UNIVERSITY OF TECHNOLOGY Faculty of Information Technology Department of Computer Systems. Ing. Azeddien M. Sllame"

Edwin Griffith
5 years ago
Views:

1 BRNO UNIVERSITY OF TECHNOLOGY Faculty of Information Technology Department of Computer Systems Ing. Azeddien M. Sllame DESIGN SPACE EXPLORATION OF HIGH-PERFORMANCE DIGITAL SYSTEMS PROZKOUMÁVÁNÍ PROSTORU NÁVRHU PRO VYSOCE VÝKONNÉ ČÍSLICOVÉ SYSTÉMY SHORT VERSION OF PHD THESIS Study field: Information Technology Supervisor: Doc. Ing. Vladimír Drábek, CSc. Opponents: Prof. Ing. Jaromír Krejčíček, CSc. Doc. Ing. Jiří Douša, CSc. Doc. Ing. Karel Vlček, CSc. Presentation date:

2 KEY WORDS digital design, pipelining, synthesis, design space, module selection KLÍČOVÁ SLOVA návrh číslicových obvodů, zřetězení, syntéza, prostor návrhu, výběr modulů MÍSTO ULOŽENÍ PRÁCE Ústav počítačových systémů FIT VUT v Brně Azeddien M. Sllame, 2003 ISBN ISSN

3 CONTENTS 1 Introduction The component-based approach Thesis contributions The approach The results A new list-based scheduling algorithm New evolutionary-based module selection algorithms Module selection algorithm without resource sharing Module selection algorithm with resource sharing New pipeline-scheduling algorithm A design space exploration methodology Reusable component model Conclusions Future work

4 Abstract As digital systems become increasingly complex, a higher abstraction level is required to describe them. Consequently, searching the corresponding large design space in a manageable time and being able to find the best possible implementation in an efficient manner is becoming a critical factor in the design process. Design space can be defined as a multidimensional space measured by different design characteristics such as performance, area and architecture style. A point in that space defines one possible implementation for a given design exploiting some design features. Conversely, to manage recent advances in semiconductor technologies, which offer millions of transistors in a single chip, the design flow employed (after system level partitioning process) in current computer-aided design tools has evolved into three distinct phases: behavioral, logic and physical synthesis processes. Behavioral level takes as an input system blocks which are intended to be realized as hardware and which represent the most critical parts at system level. However, a block means a complex hardware component such as discrete cosine transform cell, which is one of the image processing systems building block. The component is constructed from a set of sub-components (we call them modules) such as adders and multipliers. Resource usage can be used to characterize the design space at this level, because the circuit objectives (area, delay) and any exploitation of any design features such as performance or architecture style depend on resource usage. Therefore, in this thesis we are proposing an efficient design space exploration methodology based on a component point of view. The component is described behaviorally in VHDL and then, to reach the final implementation, the design process goes through architecture selection, scheduling, pipelining and module selection processes. As it enters any phase, it is explored by a local exploration scheme incorporated within that phase. Inclusion of architecture selection enables designers to efficiently allocate proper modules to realize the design. Hence, a suitable design structure is assured while pipelining at functional level increases design performance. Moreover, involving module selection adds another level of exploration, which permits the use of slow modules (cheaper) on noncritical paths, while faster (expensive) modules are used on critical paths and only when necessary. In addition, pipelining and scheduling processes are supported by resource sharing to decrease the design cost whenever possible. However, in the scheduling phase, we have developed list-based scheduling algorithms that have different priority selection techniques for nodes to be scheduled next. In the pipelining phase, the previous scheduling algorithms are extended to handle pipelining at the functional level of the component. Novel evolutionary-based module selection algorithms have been developed to further refine the design cost either with or without resource sharing and with or without functional pipelining. Therefore, the algorithms applied to solve subsequent problems of the constructed methodology have formed the basis for building a prototyping tool that aims to support the design of high-performance digital systems. To illustrate the efficiency of the proposed methodology, the set of developed algorithms have been tested with standard benchmarks. Moreover, assumptions to generalize the presented methodology to cope with system level designs are highlighted. Further more, a virtual component model is proposed in order to make the proposed methodology useful to producers of IP cores. Using the proposed methodology, which reflects the current state-of-the-art behavioral synthesis structure, the designer can explore the design space by varying the design architecture, pipelining the design in different ways and into a different number of stages, selecting different modules configuration sets to implement the design and apply resource sharing in different ways. At minimum, a 3D design space exploration methodology is always granted. 4

5 1 INTRODUCTION Digital design can be defined as the process of converting an abstract specification of a system to a detailed implementation in a way that best satisfies design specified constraints on performance, cost, power dissipation, testability and so on. Though current general-purpose processors capabilities admit implementing most of the digital functions as software (SW) programs, the pure SW implementations of a system design are often too slow to meet the imposed performance constraints. Therefore, dedicated hardware (HW) chips are often needed to complement or assist the re-programmable components on certain performance-critical tasks. However, this approach offers flexibility to the system behavior by using SW reprogrammability, while reducing the size of the synthesis process by using the application-specific chips only in the system critical parts. Thus, the final implementation of such systems always contains interacting HW cores and SW components, such as application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) and processors. On the other hand, the complexity of such systems and the short time-to-market require the use of automated techniques during the specification, determination of the boundary between HW and SW components, synthesis of HW blocks, implementation of HW/SW interface and testing phases of those systems. As a result, such systems necessitate making the codesign of HW and SW a major topic for the design automation of embedded systems and impose the usage of reusable cores in current design flows. System-on-a-chip (SOC) is a recent typical case of such a design paradigm. Conversely, to manage the recent advances in semiconductor technologies which offer millions of transistors in a single chip, the design flow employed in current computer-aided design (CAD) tools has evolved into three distinct phases: behavioral, logic, and physical synthesis processes, each one of these processes has its own design space and the higher the process level the larger the corresponding design space. Consequently, searching the corresponding large design space in a manageable time and being able to find the best possible implementation in an efficient manner is becoming a critical factor in the design process. To achieve this goal, the design space needs to be properly characterized. Characterization is the process of identifying the most important features of the design to guide designers (or expert systems) exploring the design space systematically. In this thesis, we are interested in behavioral synthesis only. System level design space can be characterized by the partitioning process and system level components, which include: FPGAs, ASICs, digital signal processors (DSPs) and intellectual property IP cores. The design space size depends upon the 5

6 selected components and the main system architecture as one complete unit. Behavioral level takes as an input system blocks which are intended to be realized as a HW and which represent the most critical parts at system level. Specifically, designers at this level look to find correct choices and efficient implementations for HW cores (components). Typically, at this level, each component is composed of a set of sub-components called modules. On the other hand, different measurements can be given to the size of the design space. Resource usage can be used to characterize design space at this level, because the circuit objectives (area, delay) and any exploitation of any design features such as performance or architecture style depend on resource usage. Therefore, for efficient design space exploration, the set of modules that makes up the component first needs to be scheduled in an efficient order. Then, modules composed of the component must be selected correctly in such a way that the component meets the imposed throughput requirements. Thirdly, the area occupied by that component is optimized (minimized) by efficiently distributing costly modules to critical paths and less costly modules to non-critical paths. Finally, modules need to be selected with efficient implementation styles (such as pipelining) to produce a highperformance system component. Consequently, any estimation tool that is used to estimate a performance figure for any system component at this level must incorporate scheduling, module selection and structure style selection. In this thesis, we intend to propose and describe an efficient design space exploration methodology oriented to realizing high-performance HW components that are needed to support systems in their critical parts. Since this methodology works on a behavioral level, we have decided to use high-level synthesis (HLS) [Gajski94] structure so that the methodology results can be integrated with any system level design methodology and also to be familiar with current trends in digital design process. 2 THE COMPONENT-BASED APPROACH The proposed methodology follows the component-based approach which is a well-known one that allows a natural way of problem decomposition and enhances component reusability. Fundamental principles of component reusability are discussed in [Keating99]. However, the reasons behind the use of the component-based approach are summarized (but not limited to these reasons) as follows: It excludes the partitioning process complexity from the design path, which is still highly influenced by designers knowledge. Consequently, this allow 6

7 designers to concentrate on component-based design space which can be defined and explored according to component characteristics; hence, it enables managing system complexity. It reduces the design risks, hence, system integrators deal with verifiable and documented components, since every component is created and tested separately. Moreover, any component can be redesigned (in the worst case) alone, or replaced without affecting other system components (upgradability in reuse context). Therefore it increases the designer s productivity and shortens the time-to-market. Describing the component in a behavioral level allows the benefits of developments in the field of HLS and system synthesis processes. Furthermore, it enables switching the implementation from HW to SW, if a new processor is capable of doing this. 3 THESIS CONTRIBUTIONS The work presented in this thesis makes the following contributions: It discusses the reusability features of VHDL as a system design specification language. It proposes new reusable virtual component model. It proposes an efficient scheduling algorithm that is useful for some classes of data flow graphs such as those found in DSP applications. It proposes scheduling algorithm for functional pipelining. It proposes new novel evolutionary-based module selection algorithms; one for general solutions with no resource sharing, while the other considers resource sharing. It proposes a well-structured design space exploration methodology for highperformance cost-efficient HW components. 4 THE APPROACH The work presented in this thesis concentrates on design space exploration techniques and algorithms at behavioral level, since the input specification of the HW component is described behaviorally using VHDL. However, the structure of the methodology reflects the current state-of-the-art behavioral synthesis phase structure in the current trend in digital design flow process. These algorithms and techniques are employed in a design space exploration methodology aimed at 7

8 designing high-performance and cost-efficient reusable HW cores. Specifically speaking we are targeting signal and image processing systems. However, while designing the methodology, we have followed the specify-explorerefine design paradigm [Gajski95] in which the component is described behaviorally in VHDL and then translated into an internal data representation i.e. data flow graph (DFG) structure which captures all control and data flow dependencies of the given behavioral description. However, to reach the final implementation, the design process goes through scheduling, pipelining and module selection processes. As it enters any phase, it is explored by a local exploration scheme incorporated within that phase. In the scheduling phase, we have developed list-based scheduling algorithms that have different selection processes for nodes to be scheduled next. In the pipelining phase, the previous scheduling algorithms are extended to handle pipelining at the functional level of the component. Moreover, pipelining and scheduling processes are supported by resource sharing to decrease the design cost whenever possible. Novel evolutionary-based module selection algorithms have been developed to further refine the design cost either with or without resource sharing and with or without pipelining. 5 THE RESULTS 5.1 A NEW LIST-BASED SCHEDULING ALGORITHM List-based scheduling techniques [Gajski94] are adapted to HLS systems to solve the resource-constraint scheduling (RCS) problem. In the RCS problem we specify resource-constraints for each operation type that exist in the DFG and the objective function of the employed algorithms is to minimize the total execution time. List scheduling processes each control step (c-step) sequentially and at each c-step, subject to resource constraints, tries to choose the best operation from all the candidate operations to place into the current c-step. List scheduling uses a readylist which keeps all nodes that have all predecessors already scheduled and is always sorted with respect to a priority functions. The priority function always resolves the resource contention among operations; i.e. operations with lower priority will be deferred to the next or later c-steps. However, list-based scheduling algorithms depend predominantly on their priority function, in some cases especially in DSP-like algorithms (see Figure 1), equal priority values to some of nodes in the ready-list are produced. These equal priority values complicate the scheduler selection process in such a way that they do not guide the scheduler to efficiently select the proper operation to be firstly scheduled in the current c-step. 8

9 Such incorrect node ordering forces the scheduler to make decision errors which are translated into a sub-optimal schedule, i.e. long c-steps in the case of a RCS problem. A := (X1 * X2) + (X3 * X4); B := (X5 * X6) + (X7 * X8); C := (X1 * X4) + (X2 * X3); D := (X5 * X8) + (X6 * X7); F := A C; G := B D; A B C D (a) F G (b) (c) Figure 1: (a) Code, (b) DFG, (c) Schedule with mobility, (d) Improved schedule. To overcome such a problem we have proposed a new list-based scheduling algorithm which exploits some inherent features of data-driven digital systems (i.e. signal and image processing systems). Since these kinds of algorithms enclose in their DFGs the features of regularity and symmetry, such as a butterfly computational structure, which are found in discrete cosine transform (DCT) algorithm, as well as those regular structures that are essential computational cores found in a wide variety of filtering algorithms. Regularity in a DFG means the existence of sub-graphs (sub-dfgs) called templates, which have multiple instances in the DFG. In other words, the DFG can be decomposed into several similar sub-dfgs that, when suitably replicated, form the complete DFG. By symmetry, we mean that the template uses a set of functionally equivalent operations. The proposed scheduling algorithm starts with a preprocessing phase in which it reads the HDL description code (e.g. VHDL) and then constructs the corresponding DFG structure. The data structure used in this phase makes use of and stores all valuable data about the given design behavior. This information is included in every node data structure so as to help the scheduler to have more information about the node during the selection process (such as successor, predecessor, number of successors, tree-id(s), depth of tree and other nodes (d) 9

10 contributing to the same successor). Therefore, after construction of the DFG data structure every node knows its successor and its predecessor. The algorithm will then choose those nodes which have no successors (the last operations in the DFG) to construct trees, beginning from them as roots. The tree is constructed in such a simple way that each node starting from the last operation (i.e. selected as a root) in the DFG will pass a tree-id to its predecessor and the distance (tree-depth) is accumulated with every node until we reach an input operation. This tree-like graph contains all nodes reachable from that root, i.e. some nodes may be included in more than one tree which allow them to be considered as critical nodes or to be given the tree-id of the largest tree since we are accumulating distances from each tree root. The scheduler will use another priority function (mobility in our case) as the main priority function to generate priority values to all operations in the ready-list. Then the ready-list will be sorted as follows: for those operations that have equal main priority value (the same mobility), the scheduler will select those operations that belonging to the same tree, i.e. those contributing to the same path, using the treeid value enclosed with each node data structure. Then for those operations which have the same tree-id value, the scheduler will choose those operations contributing to the same successor (i.e. subtree) (same-successor). This simple technique is able to guide the scheduler to select the proper operation and to produce a correct schedule more quickly and efficiently than the approach described in [Govind97], which is based on graph clustering techniques. Table 1 presents the scheduling results of the DFG shown in Figure 1. The proposed algorithm supports variable execution time of functional units (multicycling, chaining) and the usage of pipelined functional units. The main results of the presented list-based scheduling algorithm are: The schedules produced by the proposed algorithm are always structured in such a way that all operations which contribute to the same path are scheduled as close together as possible, respecting the availability of resources. Results of the new algorithm are given in comparison with other well-known algorithms. However, in the worst case, the algorithm produces schedules that are similar to those using mobility alone as a priority function. Different variants from the algorithm have been developed. The proposed algorithm enhances the design space exploration process at the scheduling level significantly, since it is able to produce optimal schedules for a set of DFGs as illustrated in Figure 1. Finally, the algorithm approach demonstrates how application specific synthesis can benefit from exploiting the underlying structure of the DFG being synthesized, as well as proving that: The more we incorporate 10

11 information about the underlying DFG structure of the given design behavior, the more we get accurate and optimal/near-optimal scheduling results. Table 1 Results of the DFG illustrated in Figure 1 Res. set No. of c-steps List-based Kollig s algorithm results List-based Optimal + - * (mobility) taken from [Govind97] new approach NEW EVOLUTIONARY-BASED MODULE SELECTION ALGORITHMS The module selection problem is an optimization problem and it can be formulated using different optimization methods [Eles98], [Gajski94], or it can also be solved using heuristic techniques. On the other hand, evolutionary algorithms have, in recent years, been successfully applied to optimization [Back96]. Evolutionary algorithms are inspired by and based upon evolution in nature. They consider a large collection of solutions at once, instead of working with one solution at a time in the search space. However, we have defined the module selection process as a performance-driven problem, since we are performing the module selection process on ready schedules that are produced by RCS type algorithms, such as those described in section 5.1. Therefore, the objective function of both of the proposed algorithms is to search for the modules configuration set which has the minimum implementation cost (design area) for the specified design delay. However, the design cost (area) is estimated through the cost of modules available in modules configuration set only and performance of the final design is measured as the design total delay of the produced implementation. A real component library (CL) is employed with the proposed algorithms. However, CL contains different alternative implementations for each resource type. These are characterized by different area and latency estimates. The term modules configuration set means the complete set of modules that are selected from the CL to implement the design schedule such that it satisfies the design required delay. This set may include none or many instances of the same module that exists in the CL. 11

12 The algorithms are designed to produce the upper bound and lower bound of design costs in the initial population. The upper bound design is constructed from the fastest modules (most expensive), while the lower bound is constructed from the slowest modules (the cheapest). This will allow designers to explore the design space in between, as well as let them know the size of the design space of the design under development. One-point crossover is applied with probability p c = 60 %. Two randomly selected genes are mutated per chromosome if crossover is not used. Both operators produce correct implementations according to the schedule. In addition, tournament selection with base 2 and elitism are employed. The initial population is generated from a combination (50:50) of the fastest one and the slowest, as we have found from experiments that a (50:50) combination yields better convergence than if we had started from the fastest combination or from the slowest. The fitness function assigns higher values to chromosomes that exhibit design delay (L) equal to the required design delay (RL) and this minimizes the area (number of gates) (A) needed for the implementation. 1 if L > RL, Fitness value = MAX A 5 * L RL otherwise. (1) where MAX is a sufficiently high value. Design delay L is calculated as a sum of latencies l i of the slowest modules in each scheduled c-step used in the chromosome, which means the module which has the maximum latency (the slowest) value represents the delay of the corresponding c- step in the schedule. L = n l i i= 1 where n is the number of scheduled c-steps MODULE SELECTION ALGORITHM WITHOUT RESOURCE SHARING This algorithm provides a general solution to the module selection process in such a way that it produces implementations which have the minimum design cost while meeting RL, with no resource sharing. The algorithm starts by reading the following inputs: (i) an initial schedule that is produced from the RCS type algorithms, e.g. list-based scheduler described in section 5.1; (ii) the required RL for the final implementation; and (iii) CL, which is used by the algorithm to search for the best modules configuration set. Hence, the algorithm outputs modules configuration set which has an area A estimated according to all the resources of the produced implementation, while the RL is estimated according to the formula (2). 12 (2)

13 5.2.2 MODULE SELECTION ALGORITHM WITH RESOURCE SHARING This algorithm provides a solution to the module selection problem with resource sharing. However, resource sharing is always employed in HLS systems to reduce the design cost as much as possible, provided that performance and other design constraints can be satisfied. Different datapath operations can share the same resource if they are not executed during the same clock cycle. The main advantage of the second algorithm over the first algorithm is the ability of evolving module types from the CL and their corresponding exact positions in the final schedule. The point is that it is possible to automatically decide which of the already selected modules will be employed to implement an operation in the schedule in case that a given c-step needs less modules than are available. To do that, the algorithm requires, as an input, the number of each module type that will appear in the modules configuration set of the implementation, in addition to the inputs specified above with first algorithm. However, the term resource set which abbreviated as resource set (+2, *3), means the maximum allowable number of adders and multipliers for every c-step for the given schedule is 2 adders and 3 multipliers; this is used by RCS algorithms and represents the maximum number of each module type that will appear in the final implementation in case of resource sharing is employed. However, in this case, the total implementation area A is the sum of the areas of modules in the modules configuration set, which of course reflect the numbers provided in the schedule resource set, while the RL is estimated according to the formula (2). The main results of both evolutionary-based algorithms are that: We have observed that as the number of modules in CL were increased, the design space tended to be larger and the possibility of producing high quality designs in terms of design cost was increased, since the possibility of making a proper tradeoff during the module selection process was increased by adding more modules into the CL. The cheapest designs are those obtained by using a complete CL and including resource sharing in the design process. The obtained results clearly demonstrate the suitability of evolutionary algorithms to solve the module selection problem in the HLS process. 5.3 NEW PIPELINE-SCHEDULING ALGORITHM A pipeline-scheduling algorithm based on the list-based scheduling algorithm described in section 5.1 above. The input to the algorithm consists of the DFG, the 13

14 CL, clock cycle, the pipe stage delay and design constraints specified either as a design required resource set or data introduction interval (DII). The output of the algorithm consists of a mapped and partitioned DFG where each node is mapped to a module of the corresponding type and the DFG is partitioned into the minimum number of pipe stages, each with a delay no larger than the specified pipe stage delay. Time constraints in this algorithm are specified as constraint in DII, while the resource set represents the design area constraints. The proposed algorithm has two different pipelining strategies: forward scheduling and backward scheduling. Each has a different priority function. The scheduling priority of operations used with backward approach is based on urgency measures of operations. This is based on the critical paths starting from each node, i.e. the calculation of the computation path length including the node toward the DFG input nodes, since the selection of the modules configuration set is made before the pipelining process [Park88]. The forward approach priority function is based on graph construction technique as presented in section 5.1 above. The algorithm is supported by a function that uses a real CL to choose the proper modules configuration set which are able to perform the DFG under the specified pipe stage delay. Following this, the pipeline (forward or backward) and schedule iteration is performed which will partition the DFG into stages; each has, at maximum, the delay of the specified pipe stage delay. Concurrently scheduling is performed with the help of an allocation table. The scheduling process uses the following rule: schedule the current node (which is selected from ready-list) in the current pipe stage if adding its latency does not violate the pipe stage delay and if there is a free resource that is available to execute it without any resource conflicts with other nodes that are found in concurrently running stages. The pipelining and scheduling iteration is repeated using that rule until the end of all DFG nodes. The main result of the presented pipeline-scheduling algorithm is: The choice between doing forward / backward pipelining and resource sharing combined with clock cycle selection, pipe stage delay determination and module selection allow designers to make efficient area-performance tradeoffs by using the different strategies employed in the flexible algorithm procedure. 5.4 A DESIGN SPACE EXPLORATION METHODOLOGY The possible design space boundaries of any design are depicted in Figure 2. This figure illustrates the tradeoff process, which is governed by either maximum allowed design cost or minimum required performance. Exploring such a large 14

15 design space randomly takes up a lot of a designer s time and will produce inefficient designs. Expensive Possible design space A Max. allowed cost C Design area D Feasible design space with constraints Cheap B Slow Design delay Min. required performance Fast Figure 2: Design space boundaries: A: The fastest design; B: The cheapest design; C: The fastest design within cost constraint; and D: The cheapest design satisfying minimum required performance. However, we propose a design space exploration methodology which will, at minimum, operate in a 3D space. The individual algorithms that construct different phases of the methodology have been described in previous sections; here we are doing the unifying process. The intention of the methodology is to explore the design space systematically with respect to different design constraints. As a result, fast and sufficiently accurate statements concerning a possible implementation can be obtained. Moreover, using the proposed techniques, designers are guided toward the next steps without making bad design decisions. We assume that the system specification is first spatially partitioned into HW blocks and SW components and a HW implementation is required for each of the HW blocks. After this, the process represented by our design space exploration methodology is started for each system HW oriented block. The methodology is constructed from one preliminary step and three main steps. In preliminary step, designers explore the component s initial specification by altering the component s VHDL construction and verify this by using a test bench until the specification which best captures the component behavior is found. The initial specification is then translated into an internal data representation i.e. DFG structure. Thus, other phases of the methodology can operate on such a DFG. For 15

16 example, schedule partitioning the DFG into sub-dfgs so that each sub-dfg is executed in one c-step. However, the three main steps are scheduling, pipelining and module selection. Scheduling is considered as a trivial design space. Pipelining is used to seek highperformance implementations, while module selection is applied to reduce the implementations cost. The outline structure of the methodology is given in Figure 3. Initial specification of the HW -component Pipelining Support allocation with scheduling Support of forward pipelining Support of backward pipelining Support of resource sharing Scheduling Support allocation with scheduling Support of resource-constraints Support of time-constraint Support of structural pipelining Support of multicycled operations Module selection Evolutionary approach is used Support of resource sharing Implementations with pipelining and scheduling Implementations with pipelining and module selection Implementations with scheduling and module selection Figure 3: Conceptual structure of the design methodology. The designer using the proposed methodology can explore the design space in one or a combination of four approaches: (1) Varying the architecture of the design and changing the corresponding resource set; (2) Selecting different modules configuration sets to implement the design; (3) Pipelining the design in different ways and into a different number of stages with different modules configuration sets; and (4) Sharing resources in different ways. Scheduling phase The methodology as seen in Figure 3, contains a set of scheduling algorithms, which can start to explore the selected VHDL behavior of the HW component specification either from the time axis (starting from point D of Figure 2) or from 16

17 the area axis (ending at point C of Figure 2) of the prescribed trivial design space by using TCS or RCS approaches respectively. The set of scheduling algorithms integrated in the first phase of the methodology include: as soon as possible (ASAP), as late as possible (ALAP), force-directed scheduling (FDS) [Paulin89b], static-list scheduling algorithm and list-based scheduling algorithm. The FDS algorithm is included for comparison purposes only. The list-based scheduling algorithm was created in five different variants each with distinct priority function. The priority functions that are associated with the list-based scheduler include mobility alone, number-of-successors alone, mobility+number-of-successors, mobility+tree-structuring, and mobility+treeid+ same_successor. The purpose of using different priority functions with a list-based scheduling algorithm is to further explore the underlying structure of the schedules produced by each selection method employed in those priority functions. For instance, the priority functions mobility+tree-structuring and mobility+treeid+ same_successor are able to produce efficient structured schedules, as has been explained in section 5.1. The employed algorithms are supporting pipelined functional units, multicycling as well as resource sharing principles. At this phase, resource set size has a large impact on the scheduling results. The larger the resource set, the more exploiting parallel executions of operations are allowed, so that a higher performance can be achieved at the expense of higher area cost. By adjusting the design constraints and the resource set, designers at this level can quickly evaluate multiple implementation alternatives with different scheduling algorithms. For example, ASAP/ALAP schedules are used to define the upper bound of the design cost in the design space exploration process, point A in Figure 2, while a list-based schedule with one module for each operation type produces a lower bound for the design cost, point B in Figure 2. The output of this exploration step is a set of tables comparing different results of distinct allocated resource sets with different scheduling algorithms. Hence, designers can select those schedules which satisfy the design objective function to do further module selection or pipelining exploration processes. Pipelining phase Pipelining algorithms are developed as extensions to the scheduling algorithms described above in such a way that all the algorithms support the design process with/without pipelining, as described in section 5.2. However, for every algorithm forward and backward pipelining strategies are incorporated, each is applicable with time-constrained and resource-constrained pipelining. Resource sharing is supported in order to allow designers to reduce the design cost, while the pipelining algorithms can allow execution overlap. Moreover, the module selection 17

18 process is still applicable with pipelining. Either by using the largest stage as a design input to the module selection algorithms or by doing a simple preselection phase in which a function is employed that lists all desirable modules from CL that are applicable to work correctly with the defined clock cycle and the corresponding specified pipe stage delay. However, a local exploration scheme at this phase is granted by varying modules configuration set, clock cycle, pipe stage delay, resource sharing and the DII value. The result of this exploration step is a set of schedules each with different pipe stage delays and each corresponding to different modules configuration set each with distinct DII value. Module selection phase Evolutionary-based algorithms, as described in section 5.3, are used in the proposed methodology to do module selection process with/without resource sharing. This phase uses initial schedules as inputs, which were selected by designers from pipelined or nonpipelined schedules produced by the previous two phases. Then, a local design exploration scheme is employed to evaluate a large number of implementations by varying the required design delay and using the module selection process with/without resource sharing to find the best modules configuration set that satisfies the specified design delay. The result of this exploration step is a set of implementations each with distinct design delays, each corresponding to a different modules configuration set. Clock cycle, which will derive the selected modules configuration set, could be selected using the technique described in [Chaudh97], or our clock cycle exploration scheme, which is guided automatically by the latencies of the selected modules configuration set elements, can be used also. The benefit of postponing the clock selection process to after scheduling and the module selection processes is to have the advantage of using more modules during the module selection phase and not to constrain the module selection process to only a few candidates which agree with an a priori selected clock cycle. In other words, the selection of the clock cycle before the module selection process restricts the design space too much to choose from only a small subset of modules, which in turn will create the possibility of producing inefficient designs. If we select the clock cycle after the module selection phase, we can find an efficient clock cycle that is able to utilize the chosen subset of modules that already satisfies resource-constraints and timeconstraints, which are the main design goals in the design process. 18

19 Some illustrative results Figure 4 describes an experiment carried out to demonstrate the efficiency and quality of the designs produced by the proposed methodology for DCT benchmark In this experiment, the design space was explored by using three different architectures. Then, for every architecture a module selection exploration was performed with/without resource sharing. Figure 5 describes an experiment was carried out with finite impulse response filter (FIR) benchmark to point out the capability of combining module selection based on an evolutionary approach with a pipelining exploration process. This experiment was executed by first producing a pipelined schedule by pipelining process and then running the module selection process by using only the largest pipe stage modules. The design space was explored for three architectures and for each architecture the pipe stage delay is varied by a delay of one multiplier or one multiplier plus one adder delay. Figure 6 shows an experiment was performed with the second order differential equation solver (Diffeq) benchmark to show the process of combining different exploration paths in one exploration figure for comparison. In this experiment, the design space is explored by using three different architectures. Then, for every architecture, the design is explored by performing module selection without resource sharing, module selection with resource sharing and pipelining + module selection with resource sharing. However, Figure 6 demonstrates that highperformance and cost-efficient designs are those produced by pipelining and plus module selection with resource sharing process. Exploration time Design exploration time is a very important factor in any design process. In our presented methodology, the exploration time of each schedule for each architecture with/without pipelining is produced in less than a second using list-based scheduling algorithms. Therefore, the designer can explore any architecture with pipelining in few seconds. Exploration times of module selection process for different benchmarks are reported on Table 2. However, if we consider the exploration time (see Table 2) of module selection process for the largest benchmark used in the presented experiments, i.e., DCT benchmark, neglecting the tuning process of the evolutionary algorithms. The worst CPU time (Pentium III 700 MHz machine with 128MB RAM) for the ten runs for each design point of DCT (+2, *3) design (without resource sharing) was about 110 seconds (i.e., each design point could be obtained within 11 seconds), while it was 101 seconds for DCT (+3, *2) design and 79 seconds for DCT (+4, *3) design. However, if we consider the DCT (+2, *3) 19

20 design with a single run, exploring a design space curve with 16 implementations will take less than 3 minutes to complete. Consequently, if we consider that the design point could be obtained within 11 seconds on average for DCT benchmark, the designer can explore the design space shown in Figure 4 within 19 minutes, which is a reasonable time for such a large benchmark. Other benchmark exploration times are reported on Table 2. Table 2 Exploration time for module selection process with/without resource sharing (pop. Size: population size, Exe time: execution time in seconds, EWF: fifth order elliptic wave filter benchmark) No. of runs = 10 Design name Module selection without resource sharing algorithm Maximum Exe time number of (worst generations case) Pop. size Module selection with resource sharing algorithm Pop. Maximum size number of Exe. time (worst case) generations DCT(+2, *3) DCT(+3, *2) DCT(+4, *3) EWF(+2,*1) EWF (+2, *2) EWF (+3, *2) FIR (+2, *1) FIR (+3, *2) FIR (+5, *3) Diffeq (+1, -1, 1 <>, *1) Diffeq (+1, -1, 1 <>, *2) Diffeq (+1,-1, 1<>, *3)

21 Design area (gates) DCT(+3, *2) with module selection and resource sharing DCT(+3, *2) with module selection without resource sharing DCT(+2, *3) with module selection and resource sharing DCT(+2, *3) with module selection without resource sharing DCT(+4, *3) with module selection and resource sharing DCT(+4, *3) with module selection without resource sharing Design delay (ns) Figure 4: Module selection design space exploration process for DCT benchmark. Design area (gates) FIR (+5, *3), pipe stage delay = multiplier latency FIR (+5, *3), pipe stage delay = multiplier + adder latency FIR (+3, *2), pipe stage delay = multiplier latency FIR (+3, *2), pipe stage delay = multiplier + adder latency FIR (+2, *1), pipe stage delay = multiplier latency FIR (+2, *1), pipe stage delay= multiplier + adder latency Pipe stage delay (ns) (1/Throughput) Figure 5: Pipelining + module selection design space exploration process for FIR benchmark. 21

22 22 Design area (gates) Diffeq (+1, -1, 1<>, *3) with DII=2 Diffeq (+1, -1, 1<>, *2) with DII=3 Diffeq (+1, -1, 1<>, *1) with DII=6 Diffeq (+1, -1, 1<>, *1) with module selection and resource sharing Diffeq (+1, -1, 1<>, *2) with module selection and resource sharing Diffeq (+1, -1, 1<>, *3) with module selection and resource sharing Diffeq (+1, -1, 1<>, *1) with module selection without resource sharing Diffeq (+1, -1, 1<>, *2) with module selection without resource sharing Diffeq (+1, -1, 1<>, *3) with module selection without resource sharing Design delay (ns) Figure 6: Diffeq benchmark design space exploration: pipelining + module selection with resource sharing, module selection only with resource sharing, module selection only without resource sharing. DISCUSSION: APPLICABILITY TO HANDLE SYSTEM LEVEL DESIGNS We have presented a component based design space exploration technique. However, the proposed methodology can be generalized to handle system level design processes as well. Usually, the system level specification is given in terms of interacting concurrent processes from a behavioral point of view since the current trend in digital design process is that the initial specification specifies the system level functionality without any details of how to be implemented. Hence, the partitioning process divides the system level specification into an SW part and an HW part in the simplest case. We assume that the goal of the partitioning process is to satisfy the design timing constraints while reducing the HW cost of the design. As a result, an HW design procedure presented in previous section is applied only to a system s critical parts which have no available HW cores to implement them, i.e., the behavior will be executed in HW only if a processor is unable to satisfy the timing constraints of that behavior. Assuming that concurrent processes which have been divided into SW and HW represent system level tasks [Eles98], the following assumptions are needed for such tasks: (i) every task is a non-preemptive task; (ii) a task may be scheduled

23 only on one processor; (iii) a processor can execute only one task at a given time; and (iv) the task may begin its execution only after all its data inputs are available. Remember that the system contains a set of components and the component contains a set of modules. However, pipelining is a general technique which can be applied hierarchically to any system design by partitioning the system into concurrently running stages, using pipelined components to perform some system tasks and using pipelined modules inside the pipelined components. In order to allow component selection at system level for SW parts, we need to use a system level SW component library. However, such a library contains different components, each with different implementations which are able to execute SW tasks such as processors and DSPs. The elements of the SW component library are characterized by speed, power consumption and dollar cost. For HW parts to allow component selection at a system level, the component library (which is used by our methodology) is incorporated with different system level components with different implementations such as HW cores for MPEG and DSP filters, memories and buses. HW components are characterized by area, latency and pipe stages. Hence, the design space exploration methodology for system level could be seen as: (A) perform hierarchical pipelining; (B) create any needed custom HW cores using our proposed methodology; (C) schedule and perform component selection for SW parts; (D) schedule and perform component selection for HW parts; and (E) perform communication synthesis to integrate the whole system. 5.5 REUSABLE COMPONENT MODEL The model is designed based on the knowledge gained by the author while using VHDL to model, simulate and design of different systems, as well as from the design experience of using Synopsys Behavioral Compiler. The aim is to allow the reuse of a HW component in many applications as much as possible. Therefore, we intended to specify the component at the behavioral level using VHDL language, because the component that is specified at the behavioral level has a wider reusability domain than the component which is specified at the RTL level. The former can be reused in different applications with different constraints. The use of VHDL language allows the designing of parameterized designs, easy management of large designs, enhanced readability and permits the writing of designs independently of the technologies used for their final realization. In addition, the test designers, when using specified cores in VHDL, have sufficient knowledge about the internal structure of the core which enable them to develop a correct test strategy which smoothly admits core insertion. Furthermore, the design space 23

24 exploration process using HLS tools will enlarge the reusability domain of the component, since it permits receiving different implementations from the same specification. We assume that the designer is the person who will create the reusable component, while the user (system integrator) is the person who will reuse it. In addition, we assume that different communication units are available in the design library that is used by the user. There are some design-for-reuse requirements for HW IPs provided in [Keating99]. However, such principles will be adopted for our proposed reusable component, which we will call a behavioral component (or a component for simplicity). The design for reuse requirements are listed below: The behavioral component has to have enough general use, such as DCT component. The behavioral component has to be fully documented and its function (what to do) is properly characterized to easy system integration. The specification of the behavioral component has to be easily configurable, easy to modify and independent of the implementation technology. The behavioral component has to be implementable on multiple technologies. The behavioral component specification has to be executable on a variety of platforms and simulateable with a variety of tools. The behavioral component has to be verified independently of the application in which it will be used. The behavioral component needs to be provided with a standard interface. The behavioral component has to be specified using a uniform design methodology to ensure proper synthesizability of the component. To satisfy the listed requirements we have organized the component model in such a way that the component s main characteristics are provided to the user in the first level of the component structure. In addition, we have separated the computation core of the component from the communication part, in such a way that different interfacing circuits may be inserted by the user according to the technology and design requirements.. Furthermore, we intend to use the design methodology based on design space exploration techniques that were described in the previous sections. Figure 7 illustrates the structure of our proposed reusable component model for reuse. As seen in Figure 7, Generators are used to create final implementations upon user constraints and libraries which are available with the synthesis tools to be employed by the user. The complexity of a generator differs from one implementation to another, where the generator can include different design exploration steps such as definition of clock cycle, structure style (e.g. with/without pipelining), pipelined functional units and use RAM as a communication unit, etc. However, each optimized behavioral code of any 24

Unit 2: High-Level Synthesis

Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis