INTEGRATION, the VLSI journal

Size: px

Start display at page:

Download "INTEGRATION, the VLSI journal"

Gervase McBride
5 years ago
Views:

INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] Contents lists available at ScienceDirect INTEGRATION, the VLSI journal journal homepage: www.elsevier.

Safizadeh c a Department of Electrical and Co

1 INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] Contents lists available at ScienceDirect INTEGRATION, the VLSI journal journal homepage: Q Improved predictability, timing yield and power consumption using hierarchical highways-on-chip planning methodology A. Jahanian a,, M. Saheb Zamani b, H. Safizadeh c a Department of Electrical and Computer Engineering, Shahid Beheshti University, G. C., Evin, Tehran, Iran b Department of Information Technology and Computer Engineering, Amirkabir University of Technology, Hafez Street, Tehran, Iran c Department of Electrical and Computer Engineering, University of Minnesota, 200 Union Street S.E., Minneapolis, MN 4, USA 2 article info abstract Article history: Received 6 June 200 Received in revised form October 200 Accepted October 200 Keywords: Highway-on-chip Predictability Timing closure Interconnect mis-prediction is a major problem in nano-scale design that may diminish the quality of physical design algorithms or may even result in design divergence. In this paper, a new interconnect-planning methodology based on assume and enforce strategy is presented. In this methodology, some regions of the chip are planned to provide auxiliary routing resources and improve the interconnect delay of critical nets during the floor-placement process. Each of these wealthy regions is called a highway-on-chip. The location of highways and their resources are gradually determined during the hierarchical floor-placement process. Experimental results show that the performance, timing yield, predictability and power consumption of the attempted benchmarks are improved by.66%, 0.02%, 20.%, and 6.8% on average. These improvements are obtained at the cost of about.82% runtime overhead and less than 0.8% wirelength growth. & 200 Published by Elsevier B.V Introduction As technology continues to scale down, the impact of interconnects on design performance increases, considering that a significant portion of the delay results from interconnects. Moreover, correct estimation of final interconnect delay at higher levels of the design flow becomes more and more difficult due to some nano-scale problems such as crosstalk noise and leakage power. Therefore, efficient interconnect estimation and design can result in considerable improvement in the performance of the design. However, interconnect design should not be started after cell placement because positions of cells are fixed after placement and interconnect optimization is limited by the fixed cell locations. As a result, interconnect design and optimization should be applied prior to or during cell placement []. Some methodologies have been developed to plan interconnects at early levels of physical design. Chen et al. [2] presented a design flow in which some low level parameters of interconnects such as width and place of wires and their routing order are planned at the floorplanning level. Cong et al. [] proposed a multilayer gridless detailed routing system, called DUNE, for deep submicron physical design which uses the routing plans made at the floorplanning level to perform the detailed routing of wires. It features a coarse grid-based wire planning algorithm that uses exact gridless design rules to Q2 Corresponding author. Tel.: ; fax: addresses: jahanian@sbu.ac.ir (A. Jahanian), szamani@aut.ac.ir (M. Saheb Zamani), hamid@umn.edu (H. Safizadeh). accurately estimate routing resources and distribute the nets throughout the routing regions based on the estimated resources. Cong et al. [4] proposed an interconnect-centric design flow consisting of interconnect planning, synthesis, and layout stages to allow interconnect design and optimization at early stages of design process. In this method, some efficient interconnect estimation models and algorithms were proposed to model the behavior of succeeding tools and estimate interconnects at the floorplanning level. However, based on high level estimations, routing order, width and layers of wires are fixed at the floorplanning level. In [], two polynomial-time algorithms were presented for wire planning with bounded over-the-block wires. Both algorithms guarantee to find an optimal routing solution for a two-pin net as long as one exists. Reported results show that runtime grows linear but will take considerably long for real designs. Lu et al. [6] developed a planning algorithm that uses retiming in an integrated physical synthesis process. In the Lu s algorithm, the status of global routing is estimated, and the final location of buffers and flipflops is predicted at the floorplanning level. Since buffer insertion is an effective interconnect optimization technique, several buffer planning methodologies, which can be regarded as special types of interconnect planning, have been aimed to decrease the interconnect delay. The work in [ ], have focused on buffer planning. In [], Cong et al. proposed a method in which critical wires were estimated at the floorplanning level, and the required buffers were planned along the critical wires to improve their performance. In [8], an integrated nonlinear placement framework with porosity and congestion-aware buffer planning was presented /$ - see front matter & 200 Published by Elsevier B.V. doi:0.06/j.vlsi planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

2 2 A. Jahanian et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] The authors of [8] used integration of gradually refined cell porosity and routing congestion-aware buffer planning and insertion methodology in a high-quality nonlinear placer. Their experiments show considerable improvement in routing overflow, wirelength and buffer count. Jiang et al. [] also presented a floorplanning algorithm with buffer block planning. They used simulated annealing to find the best routing tree considering buffer blocks. In each loop of this algorithm, a routing tree is constructed, buffers are allocated, and then a Lagrangian relaxation is invoked to improve delay and area. In [0], Ma et al. proposed a design flow for early buffer planning at the floorplanning level. In Ma s method, an initial solution is suggested by simulated annealing process and is iteratively refined during floorplanning. In this method, design congestion is regarded as a constraint on the buffer budget analysis in the process of buffer planning. In [], the authors of this paper presented a probabilistic and hierarchical approach to plan buffers during floorplanning. They used the concept of buffer requirement map (BRM), an estimated map of buffer requirements in various regions of the chip, to plan buffers at early levels of the design flow so that they can distribute buffer resources in various regions of the chip considering buffer requirements of each region. Despite the partial improvement achieved compared to the prior buffer planning methodologies, Jahanian et al. only planned the wires outside the floorplanning macros and did not consider the nets inside the macros. The mentioned methodologies have three drawbacks. The first and foremost is mis-prediction, which can result in poor optimization results in terms of performance and predictability. This is because all the proposed methodologies have two distinct phases. In the first phase, the characteristics of wires at early stages of physical design are estimated, and in the second phase, some parameters of interconnects, such as location and size of buffers and location, width, and routing order of wires, are fixed based on the early estimations. In other words, planning in the cited approaches means fixing the parameters and detailed features of interconnects in higher levels of physical design. However, estimation of final parameters in higher levels of physical design (such as floorplanning) is very rough and should not be used to fix the final values of design parameters. In other words, fixing the detail features of interconnects at early stages based on high-level estimations can be very erroneous [2]. The second limitation of the above methodologies is that they only plan those interconnects that connect floorplanning macros together and all internal wires of the macros are neglected. This may reduce the number of planned nets and degrade the accuracy of estimations and planning []. Finally, the last drawback of the proposed methods is that they may cause some congested areas to be created in the channels between the macros [2] due to inserting many buffers in the spaces between the macros. Alpert et al. [2] proposed a methodology for early buffer planning considering the wires inside the macros. They reserved some internal areas of the macros (buffer sites) for early buffer planning. This solution is not constrained to focus on the channels between the macros for buffer insertion. However, using this methodology requires changing current EDA tools and libraries to support buffer sites in IP-macro design and utilization. Furthermore, the precise location of buffer sites cannot be easily determined prior to the placement of cells in soft-ip macros. Therefore, the mis-estimation of buffer planning, especially in large IPs, can be considerable. In this paper, a novel hierarchical interconnect planning methodology that is broadly similar to highway design semantics in masterplan-based urban design is proposed. Highways are used in large cities, especially in congested paths of the cities, to reduce the transportation delay and traffic. In the masterplan-based design of urban areas, global characteristics of much needed highways are planned in the masterplan, yet without fixing the width or the exact location of any of the highways. Then, the highways are gradually positioned and configured during the urban construction with respect to the urban masterplan [4]. In this methodology, some routing regions of the design are planned as special routing paths that utilize auxiliary routing resources, e.g. more buffers and routing tracks, for nets at hierarchical floorplan/placement levels. The location of these wealthy regions, named highways-on-chip (HoCs), and their required amount of resources are gradually evolved in a hierarchical process. In other words, characteristics of the highways are gradually determined based on the estimated critical and near-critical nets as well as the estimated congestion of various regions of the chip. Indeed, the location of each highway is refined during a hierarchical design flow so that a detailed and precise map of the highways can be produced at the end of global placement. As a result, by generating the highways, critical and near-critical nets that are assigned to the highways would be routed with less delay because of having access to the reserved resources of the highways. Analogously, this idea is similar to the highway design in large urban areas. Therefore, we call this methodology as highway master planning approach (HAMPA). Gradually refinement of design parameters, which is the contribution credited to HAMPA, results in more predictability that tends toward better performance and lower timing variation sensitivity. In recent years, some new methodologies such as platformbased design, architecture-based design, and backbone planning have been proposed to improve the complexity management of large embedded systems. A platform is known as a partial design for a particular type of system which includes embedded processor(s), may comprise embedded software and is customizable to a customer s requirements []. In these methodologies, there are some fixed assumptions on global architecture of the design before it is implemented. Moreover, the design may even use some pre-implemented parts. The mentioned methodologies are significantly different from highway planning because there is no fixed assumption about the architecture of the design in the highway planning, and routing resources are planned during the placement process. The rest of the paper is organized as follows. In Section 2, we describe the terminology used throughout the paper, the overall design flow and the contributions of the paper. In Section, our algorithm is demonstrated in detail. Experimental results are presented in Section 4 and finally, the paper is concluded in Section by outlining our main contributions and discussing future research directions. 2. Preliminaries 2.. Terminology We introduce the following definitions which will be used in the rest of the paper: Definition. In a hierarchical floor-placement flow, the design is bi-partitioned iteratively. Each partitioned region is called as a bin or cluster. It should be noted that the cells of each bin are assumed to be placed at the center of that bin at each level. Definition 2. A bin is a sibling of another bin if they are two smaller partitions of the same bin, called their parent. An example of sibling bins is provided in Fig. in which bin A is a sibling of bin A 2, and A is the parent of both A and A 2. In this paper, we represent the sibling of a given bin by adding.sibling at the end of the bin s name, e.g. A.sibling planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

3 A. Jahanian et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] 6 6 Fig.. An example of sibling bins. Suppose that bin A is a sibling of bin B as shown in Fig.. Obviously, each child of A (A or A 2 ) is a neighbor of at least one child of B (B or B 2 ) after a new level of bi-partitioning is performed. In this figure, each pair of bins (A, B ), (A 2, B )or(a 2, B 2 ) shows two neighbors whereas bins A and B 2 are not neighbors. Definition. A net segment is defined as a segment of a net connecting two nodes in the Steiner tree of the net Definition 4. A near-critical net segment is the net segment belonging to at least one path whose delay is larger than a delay threshold T d. T d is calculated by T d ¼ T w a ðþ where T w is the delay of the longest path in the design (timing wall), and a is a factor between 0 and representing the range of criticality. In fact, a controls the percentage of long-delayed paths that are to be regarded as near-critical. We used a value between 0.8 and 0. for a in our experiments. Fig. 2. HAMPA-based floor-placement design flow HAMPA design flow Highways are used in large cities, particularly along the congested paths of the cities, to reduce the transportation delay and traffic. In the masterplan-based urban design process, global characteristics of the highways are planned in the masterplan (without fixing the detailed picture), and highways are gradually built during the urban construction according to the urban masterplan [4]. Gradually characterizing the highways can reduce the risk of mis-estimations of the initial urban masterplan. In this paper, a hierarchical interconnect-planning methodology, which is very similar to the highway design semantics in masterplan-based urban design, is presented. The main feature of this methodology, called HAMPA, is that it does not fix the interconnect characteristics (e.g. the location and width of wires and buffer distribution) all at once at the floorplanning level. However, these characteristics are gradually determined during the hierarchical floor-placement process. At any given floorplacement level i, the planning information of level i is utilized as well as the floor-placement status of level i to refine the ith level of floor-placement. As shown in Fig. 2, a conventional floor-placement of level i is carried out at the beginning of level i. The estimations of level i are made using the floor-placement information and the highway planning of level i, simultaneously. Then, the highways of level i (current level) are planned based on the estimations and the floorplacement of level i. Finally, the floor-placement of level i is more refined to meet the constraints of the planned highways. On balance, the planning and the floor-placement concurrently go ahead in a hierarchical manner and develop effective mutual cooperation both to plan highways and perform the placement together. Since hierarchical placement is the strategy of modern placers to overcome the complexity of current industrial circuits [6], using the hierarchical placers does not confine the generality of the proposed approach. Hierarchical placement process is divided into two stages [6]: global placement and detailed placement. In global placement, each bin of the design is partitioned into two or four bins. The most important objective in the partitioning process is cut size but some other metrics such as routing congestion may also be considered. The partitioning process iteratively continues until the number of cells inside each bin becomes lower than a specified threshold. During global placement, all cells inside a bin are located at the center of the bin. However, in detailed placement, the exact location of each cell inside a bin is determined to make a legal placed design. Fig. shows an example of highways planned on a chip. In this figure, bins are shown as rectangles and highways are represented by bold lines. For each highway, the assigned width indicates the required amount of routing resources, which directly depends on the number of critical nets assigned to the given highway. 2.. Problem definition Fig.. Highway regions at a floor-placement level. In this subsection, a mathematical model for highways is described and then, highway planning problem is formally defined. Floorplanning neighborhood graph shows the neighboring relations between the bins of a floorplan. Each vertex of the neighborhood graph represents a bin of global placement and planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

4 0 4 A. Jahanian et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] each edge between two vertices shows that their corresponding bins are neighbors []. We assigned a weight to each edge corresponding with the adjacent bins in the neighborhood graph to show the amount of resources that should be planned for the path between the bins. The weighted neighborhood graph is called highway graph. An example of the highway graph at one of the intermediate levels of global placement is shown in Fig. 4. In this figure, rectangles represent the bins of the design at that level and weights indicate the amount of resources that have been planned. These weights are computed based on the estimated amount of the resources of each net during the proposed methodology. It should be noted that at the end of the floor-placement process, the edges with zero or very small weights can be removed from the highway graph. In Fig. 4, dotted lines represent the edges whose planned resources are zero or very low. Let G i (V i, E i ) be the highway graph at the ith stage of floorplacement hierarchy constructed as described above. At each level of the hierarchy, the weights are updated according to the number and criticality of nets which are assigned to utilize the highway resources. The weights of E i are updated when a new net is planned to use the highway resources. Therefore, the problem of highway planning is defined as finding the best weights for the edges in E i such that: delays of near-critical paths are minimized by assigning the highway resources to them, and availability constraints for highway resources are met. In other words, highway resources are controlled to avoid unroutability due to over-consuming the resources. In the following subsections, the proposed methodology to find a good mapping of critical nets to highway graph edges are described HAMPA contributions As described in Section, gradually refinement of interconnect characteristics, which is the main feature of HAMPA, can reduce the risk of high-level mis-estimations because the estimations are refined at each level of the floor-placement process considering the current status of critical and near-critical paths and congestion of the design. Our contributions can be briefly described as follows: Design performance improvement: Since some of critical and nearcritical nets are selected to utilize the highway resources, the routing delay of these nets can be reduced, and the performance of design is improved, consequently. The reason is that routing resources are allocated according to the critical nets distribution so that critical and near-critical nets can be routed with more resources than others. The design performance is gradually improved during the floor-placement as the location of highways and their required amount of resources are refined at each level. Design predictability improvement: IneachstageofHAMPA,some properties of wires are planned and the possible range of design parameters is bounded at each level of the design process. As a result, design predictability is improved at the intermediate levels of floor-placement. In other words, each plan makes assumptions that are strictly enforced at the succeeding levels of the design process. Therefore, HAMPA can be fallen into the category of assume and enforce methodologies [8]. Timing yield improvement: Critical and near-critical nets are supposed to be gradually assigned to the highways at each level of design. It is worth noting that delay improvement can be achieved by not only eliminating the most critical paths but also reducing the number of near-critical nets that are granted the permission for using the highways. Therefore, mathematical expectation of delay of nets and consequently, timing sensitivity of design to variation is reduced. On the whole, a higher timing yield can be resulted in the presence of delay variation. When the highways are located at higher levels of design hierarchically, wide enough range of areas should be considered. Therefore, detailed location and resource amount of the highways are not fixed in these levels. However, the exact location of each highway and its required resources are smoothly determined as the design process continues. By doing so, the risk of mis-estimations at higher levels of physical design is greatly reduced because at the higher levels, where interconnect estimations are very erroneous, only boundaries of parameters of each highway are planned. However, they are gradually refined to draw the precise map of highways at lower levels where critical paths can be estimated with less error. In other words, in the HAMPA methodology, planning is not fixing the design parameters but defining their boundaries level by level. All in all, the drawbacks of the traditional wire planning approaches can considerably be mitigated by making gradual determination of highway parameters.. Implementation of HAMPA framework Since highways are hierarchically planned by using HAMPA design flow during floor-placement, it can be implemented in each hierarchical floor-placement environment such as CapoPlacer [] and DragonPlacer [20]. We used CapoPlacer, a well-known bi-sectioning based floor-placement tool, that is widely used in academic research, as the hierarchical floor-placement framework for the following reasons: Fig. 4. Highway graph at an intermediate level of global placement CapoPlacer is a renowned placer for utilizing one of the top five placement algorithms in the ISPD placement contest [2]. Hierarchical and iterative behavior of CapoPlacer provides an analytical framework within which the distinctive characteristics of the highways can be gradually refined throughout the design flow so that the full details of the highways can be obtained at the end of floor-placement. Bi-sectioning based partitioning, which is used in CapoPlacer, simplifies the white space distribution during the floor-placement. The succeeding sections show that sliding the boundary of bins in a 2-way partitioning floor-planner is far simpler than the case in which any of the k-way partitioning floor-planners is used [22]. The flow of CapoPlacer consists of global placement and detailed placement stages as in many other multi-level placers. At each level of the global placement stage, the design is recursively partitioned into planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

5 A. Jahanian et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] two smaller bins considering cut size, congestion and terminal propagation rules of bins. This process continues until the number of cells of each bin falls below a threshold. In CapoPlacer, this threshold is about 0 cells. It is assumed that all the cells in each bin are gathered at its center throughout global placement. However, the precise locations of the cells are determined at the end of detailed placement. We adapted the global placement stage of CapoPlacer to support the idea of HAMPA design flow. Since the size of bins is large and plenty of cells fall at the center of bins at the early levels of CapoPlacer global placement, delay estimation at these levels are very rough and erroneous. Therefore, highway planning should not be started from the early levels of global placement hierarchy. Thus, we divided the CapoPlacer global placement stage into two distinct sub-stages: global placement without planning and global placement with planning. Global placement without planning is exactly the same as CapoPlacer global placement stage without any modification whereas global placement with planning is the adapted CapoPlacer global placement in which HAMPA planning policies are developed. These stages are shown in Fig.. It should be noted that Feedback and white space distribution options are turned ON in CapoPlacer to utilize its maximum capabilities. In this paper, the initial planning level (IPL) of global placement with planning at which planning of highways is started, is described. In fact, initial planning level is the first level of floorplacement at which the average size of bins becomes smaller than a certain threshold. This threshold, namely D buf, is defined as the minimum effective distance between two successive buffers []. This level is selected as initial planning level because local wires of bins do not require any buffer insertion if the size of bins falls below D buf. Fig. 6 shows the overall picture of highway planning used at each level of global placement with planning. Global placement with planning Global placement without planning Global placement with planning Detailed placement with respect to highways CapoPlacer global placement CapoPlacer detail placement Fig.. Modified CapoPlacer placement flow. Bi-partitioning the design and create new bins Planning the highways Adjusting the current placement to the plan Is global placement finished? YES NO Fig. 6. Level i of global placement with planning. Goto level i+ As shown in this figure, each level of global placement with planning consists of three main phases: CapoPlacer bi-partitioning: The fundamental operations of Capo- Placer floor-placement (e.g. bi-partitioning and creating new bins) are performed during this phase. These operations are described in [22]. Highway planning: This phase plays an important role in HAMPA methodology. In this stage critical and near-critical nets are detected and assigned to the highways and then the features of resources are updated. These features consists of the location of and amount of resources of highways that are estimated based on the distribution of critical nets, their degree of criticality, and the congestion map of the design. Placement refinement: Finally, the placement is updated to reserve the planned resources for the highways. The following subsections discuss the three phases of global placement with planning in detail... CapoPlacer bi-partitioning In this stage, each bin of design is partitioned (vertically or horizontally) into two bins recursively. CapoPlacer uses either hmetis or MLPart for partitioning []. The main partitioning metric is cut size. However, some other criteria such as congestion and terminal locations are taken into account, too..2. Highway planning In this phase, highways are planned, and edge weights of the highway graph are updated. In other words, planning the highways at a particular level of floor-placement refines the structure of the highways in that level. Our algorithm for highway planning at each level of the floor-placement hierarchy, called HAMPA highway planning algorithm (HAMPA-HPA), consists of several steps, each of which is discussed below. Fig. shows the main steps of HAMPA-HPA. Step : This step constructs the required planning structures consisting of bins and the highway graph. Some decisions at each level are made based on the information at the preceding levels. As a result, data structures should be updated and stored separately at each level of the hierarchy. Step 2: After CapoPlacer partitions the bins once again to create smaller ones, the number of bins are increased. Thus the highway graph needs updating. Our algorithm for updating the highways, called HAMPA highway refinement algorithm (HAMPA-HRA), is shown in Fig. 8. HAMPA-HRA considers the prior weights of highway edges to maintain the HAMPA consistency and to distribute the resources in a manner that is fairer and more balanced. This algorithm runs for each net segment of the design and maps its weight on the new highway graph (highway graph of the new floor-placement level). Let NS ðb,b 2 Þ i be a planned net segment connecting two adjacent bins b and b 2 at level i. After a new level of bi-partitioning is done, HAMPA-HRA checks whether the source SRC and the sink SNK of NS ðb,b 2 Þ i are still neighbors. In the case of being neighbors, the Step : Step 2: Step : Step 4: Highway planning algorithm Construct planning structuresinthe currentlevel Update the current highway weights considering the preceding ones Carry out a Static Timing Analysis (STA) considering the previously planned highways Update highway planning features Fig.. HAMPA highway planning algorithm (HAMPA-HPA) planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

6 6 A. Jahanian et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] : 2: : 4: : 6: : 8: Apply previous highway edges to current highway edges FOR each edge NS ij of a previously planned net, DO steps 2 to 8 IF (two source (SRC) and sink (SNK) of NS ij are neighbors after partitioning in level i). Add weights of connected nets between SRC and SNK at. level i- calculated for NS ij ELSE Assign the weight of edge connecting the SRC and SNK nodes at level i- to the edge connecting the SRC and SNK.sibling in highway graph at level i. Assign the weight of edge connecting the SRC and SNK nodes at level i- to the edge connecting the SNK.sibling and SNK in highway graph at level i. END END Fig. 8. HAMPA-HRA for applying the highway edges weights of level i to highway edges of level i. Fig.. Static timing analysis graph after level i. Fig. 0. Static timing analysis graph after level i. Highway detailed planning algorithm : FOR each net segment n i DO 2: FOR each net segment ns ij of the net n i DO : IF ( (ns ij is near-critical) AND (ns ij has not been planned yet) )THEN 4: FIND the shortest path on the highway graph :. Compute the required resources for ns ij 6: IF (delay of ns ij can be improved by highway resources) THEN : Add the resources to the highway Update the highway weights 8: END : Mark ns ij as a planned net segment. 0: END : END 2: END Fig.. HAMPA highway detailed planning algorithm (HAMPA-HDPA) previous weight that had been calculated for NS ðb,b 2 Þ is assigned to i the edge corresponding to the bin pair (SRC, SNK) on the highway graph at level i. Otherwise, at level i, the following steps must be followed: The weight of NS ðb,b 2 Þ i should be assigned to the edge that connects SRC and SNK.sibling on the highway graph at level i. The weight NS ðb,b 2 Þ i should be assigned to another edge connecting SNK.sibling and SNK on the highway graph at level i. Step : This step estimates the critical and non-critical net segments using a hierarchical static timing analysis (STA). The main point in the revised STA is that planned nets will have better delay than other nets because they use fast highway routing resources. We modified the conventional STA so that the timing calculations of the previously planned nets are included in the timing analysis. In other words, in the proposed hierarchical STA, the delay of a net in the ith level depends not only on its length at that level but also on the highway resources that are planned for the net in all previous levels. Figs. and 0 show the STA graph of a design at levels i and i. In this figure, nodes represent the pins of design and edges show connections (internal connections of gates and also interconnects between gates). It is worth noting that the topology of a design remains unchanged during the highway planning (except for the buffers that are inserted along some planned nets) but the weights of the edges in the STA graph may be changed after any planning. As can be seen in Fig. 0, some nets are planned during the planning level i (changed edges are marked by dashed bounded curve). Therefore, the STA after level i should consider less delays for the planned nets. New values of delay are calculated based on the planned buffers or planned width for each net [] and also the STA graph data structure is updated. Step 4: This step performs the necessary updates on the highway weights at each level of floor-placement based on the degree of criticality of net segments and the congestion map of the design. The proposed algorithm for this step, called HAMPA highway detailed planning algorithm (HAMPA-HDPA), is shown in Fig.. This algorithm is performed at each level of floor-placement with planning to update the highways at that level. For each critical and near-critical net segment that has not been planned at the previous levels, the shortest path between its source and its sink is found by using Dijkstra algorithm []. Note that the shortest path is measured considering the length of various highway edges, the amount of resources that are available to each edge, and the congestion level of the bins. After the shortest path is found, the net segment is assigned to the highway in order to be routed through the shortest path. In this way, the algorithm tries to distribute the resources so that the critical and near-critical nets can be routed along the wires with the shortest length as much as possible to improve their performance. Fig. 2 shows an example for highway graph of a design corresponding with a critical net segment ns ij on it. The net segment ns ij connects two bins i and j (source and sink), and its corresponding bins on the shortest path of the highway graph. In this figure, bins, the highway graph and the net segment under investigation are shown by dotted line rectangles, gray solid lines and a bold line path, respectively. As seen in Fig. 2, ns ij can be routed through various paths on the highway graph. However, HAMPA-HDPA assigns the shortest path to ns ij. Then, the algorithm checks whether the delay of the net segment can be improved by using buffer insertion and/or wire sizing techniques for the shortest path. If yes, the required buffers and additional tracks, which are estimated using the method described in [], are planned for the highway edges assigned to ns ij. Note that no specific buffer or track is firmly assigned to the net, yet only some buffers and track resources are reserved for the highways edges that are assigned to a critical or near-critical net planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

7 A. Jahanian et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] implement the HoCs. The availability of area, which should be used for either routing tracks or the buffers, may bound the planning process. In fact, the most required resources for planning are routing area and routing tracks, both of which can be controlled by white space distribution. We proposed a white space distribution algorithm based on the highway database in floor-placement. Before we go through our white space distribution algorithm, white space excess is defined as follows: Definition. White space excess of bin j at any floor-placement level i (WSE ij ) is defined as the auxiliary white space of bin j at level i and is calculated by WSE ij ¼ TWS ij TWR ij ð2þ 6 6 Q Fig. 2. The shortest path on the highway graph for a net segment. Fig.. Overlapped areas within feasible regions of a net segment. segment (such as ns ij ). In other words, only a global estimation of the required resources for each highway edge is made. The feasible region (FR) of each buffer on a net segment is defined as the smallest region in which inserting the buffer, results in meeting the timing constraint of that net segment []. FR of each buffer may overlap with more than one bin. The most suitable bin to insert each buffer is chosen based on the congestion level of all the overlapping bins. After the congestion of each bin is estimated by using the models presented in [2], each buffer is planned to be inserted inside the least congested bin. A sample of the congestiondriven buffer assignment is shown in Fig., in which the shortest path of a critical net segment on the highway graph is depicted as bold lines and feasible regions of the critical net segment (FR and FR 2 ) are represented by rectangles (a feasible region may be trapezoid or polygon shape. They are sketched in this figure as rectangles for simplicity). Fig. emphasizes that congestion level of bins 2,, 6 and should be carefully considered to determine the location of the buffer(s) in FR. Similarly, the congestion level of bins, 8 and should be studied to locate the buffer(s) in FR 2. At the end of this step, some of the needed resources of the highways are estimated according to the visible critical and nearcritical net segments. This information should be stored in the highway database to be used for the following iterations... Placement refinement After the highways are planned at a given level, the placement of that level should be refined to meet the availability constraints of the resources (i.e. buffers and auxiliary routing tracks) required to where TWS ij is the total white space of bin j at level i, and TWR ij is the total white space required for inserting buffers, sizing wires and reserving the tracks to improve the congestion in order to preserve the routability at the floor-placement level i. Our previous experience in timing optimization using buffer insertion shows that the uniform distribution of white spaces may not be a good solution to resource management since white space requirement of various regions of a chip may not necessarily be uniform [,]. However, uniform distribution of white space excess can be acceptable. The main problem of WSE is that it should be estimated based on the requirement of the various regions of the chip to buffers but this requirement will not be determined prior to the end of placement. In HAMPA, we can estimate the requirement of each net to buffer resources based on the weights of the highway edges. Therefore, the WSE of each net can gradually be estimated. Fig. 4 shows the proposed algorithm for distribution of WSEs. This algorithm is named as HAMPA white space distribution algorithm (HAMPA-WSDA). In the proposed algorithm, for each bin with zero or negative SWE, the required white space is provided by transferring the white space from other bins that have positive WSE by sliding the boundary of bins. Fig. 4 shows the proposed algorithm for white space redistribution which refines the white space distribution in layout according to white space requirement of highways at each level of floorplanning. Fig. shows a graphical flow of WSE transfer from parent of A to A via HAMPA-WSDA. As shown in this figure, if there is not enough white space for any given bin (e.g. A ), the algorithm tries to provide the required WSE from its sibling (i.e. A 2 ). If neither A nor A 2 has positive WSE, the algorithm backtracks to the previous level of global placement in order to find some space from their parent sibling (i.e. B). If B is able to give the required white space excess to its sibling (i.e. A), the space will be transferred to A and divided into two parts to be granted to A and A 2 according to their WSE WSE distribution algorithm at level i : Function DistibuteWSE(B ij ) // Bin j at level i 2: IF (B ij.sibling has positive WSE) THEN : Transfer the WSE from B ij.sibling to B ij. 4: Finish. : ELSE 6: CurrentBin = A. : Return to the previous level of floor-placement. 8: IF (B ij.parent.sibling has enough WSE) THEN : Transfer the WSE from B ij.parent.sibling to B ij.parent. 0: Finish. : ELSE 2: CurrentBin = B ij.parent; : DistibuteWSE(CurrentBin). 4: END : END 6: END Fig. 4. WSE redistribution algorithm (HAMPA-WSDA) planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

8 8 A. Jahanian et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] A WSE < 0 WSE =~ 0 A2 A WSE > 0 WSE =~ 0= ε A2 B WSE>0 WSE>0 B2 B WSE>0 WSE>0 B2 requirement. If B does not have enough positive WSE, HAMPA- WDSA will return to the previous level recursively. The worst case happens when the algorithm returns to the level 0 and finds no bin with enough WSE. In this case, the required WSE cannot be provided as a whole, which indicates that the estimated die area is not sufficient to meet the planning constraints. Thus, the floorplacement process should be repeated again with a larger die area. 4. Experimental results and discussion We implemented our algorithms in C++ on an Intel Dual Core.8 GHz with 2 GB of memory. Ten circuits were selected randomly from the IWLS-200 benchmark suite [24]. These circuits have various ranges of sizes from about 000 to 2,000 standard cells. The benchmarks were synthesized in 0 nm technology with eight layers of metal. All of the benchmarks were floorplanned and placed by the CapoPlacer [6] with 0% row utilization. Table shows the characteristics of these circuits. In order to evaluate the HAMPA methodology developed in this paper, the results yielded by HAMPA were compared with those achieved by CapoPlacer after detailed placement in terms of performance, predictability, timing yield and runtime. In addition, we compared some of the key results of HAMPA with another placement algorithm (DragonPlacer). DragonPlacer is a widely used and well-known placer for large and congested circuits [20]. DragonPlacer iteratively partitions the design into four new sub-partitions (quad-partitioning) and then shuffles the new partitions through a simulated annealing process to find the best place for each partition [20]. In this section, HAMPA, CapoPlacer and DragonPlacer flows are referred to as HAMPA, CAPO and Dragon, respectively. 4.. Performance improvement Back track WSE Distribution WSE < 0 WSE > 0 Table 2 shows the performance improvement of HAMPA design flow compared with CAPO. It should be noted that comparing the final results of HAMPA with those of original CAPO may not be fair because HAMPA uses buffer insertion during highway planning but CAPO does not. Therefore, HAMPA results should be compared with the CAPO results after the CAPO results are improved by buffer insertion so that both design flows can utilize buffer resources. We used Magma BlastFusion [2] to improve the results of CapoPlacer A WSE Transfer A Fig.. WSE distribution example. B B WSE < 0 WSE > 0 Table The characteristics of the representative benchmarks. Index Circuit Cells pci_bridge dma 8 b20_ 4 b22 28 b22_ wb_conmax 204 b 8 b_ 8 ethernet 46 0 b by inserting buffers. In Table 2, the minimum clock period of the design using the original CAPO (column CAPO) is compared with that when CAPO is followed by Magma buffer insertion (column BCAPO). In this table, column HAMPA represents the minimum clock period of the design resulted from HAMPA and the fifth and sixth columns show the performance improvement of HAMPA compared with CAPO without buffer insertion and with Magma buffer insertion, respectively. As seen in Table 2, the clock period of the design using HAMPA is improved by 6.8% and.66% on average compared with the CAPO and the BCAPO results, respectively. It is important to note that the improvement resulted from HAMPA increases significantly for larger circuits because the number of hierarchy levels is much higher in larger circuits. Therefore, HAMPA has more opportunity to improve the performance of the design, Besides, the ratio of bin area to design area are smaller in the larger circuits, so the net delay estimations made and used by HAMPA are more accurate. As stated before, the delays of the critical nets are improved by assigning such nets to the highways at each level of the floorplacement process. Therefore, the performance of the design is improved at each level of hierarchy, which will result in overall improvement in the final performance. However, Table 2 does not show the performance improvement in the intermediate levels and only shows the final performance improvement. Table shows the performance improvement of HAMPA design flow compared with Dragon. In this table, the minimum clock period of the design after the original Dragon (column Dragon) is compared with that when the Dragon is followed by Magma buffer insertion (column BDragon). As seen in Table, the clock period of the design using HAMPA is improved by 8.% and 6.42% on average compared with the Dragon and the BDragon results, respectively. In this subsection, we describe an experiment to analyze the impact of the partial improvements produced at the inner levels of HAMPA on the final design performance. As stated in the preceding sections, each level of HAMPA is divided into two parts; Capo partitioning and highway planning. Considering these two parts, three delay factors are given in the following to show the partial improvements at a given level i of HAMPA. Initial clock period at level i (ICP(i)): Clock period of the design at level i before Capo starts partitioning. Capo clock period at level i (CCP(i)): Clock period of the design at level i immediately after partitioning by Capo. Clock period at level i (HCP(i)): Clock period of the design after planning the highways. In fact, ICP(i) shows the clock period of the design at the beginning of level i. After partitioning, the delay of the critical path is changed. Normally, the delay is increased because the cells that are located at the center of each bin are expanded into two planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

9 A. Jahanian et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] Table 2 Experimental results of HAMPA and the CAPO in terms of design performance. Benchmark Minimum clock period (ps) Performance improvement (%) CAPO BCAPO HAMPA HAMPA vs. CAPO (%) HAMPA vs. BCAPO (%) pci_bridge dma b20_ b b22_ wb_conmax b b_ ethernet b Average 6.8%.66% Table Experimental results of HAMPA and Dragon in terms of design performance. Benchmark Minimum clock period (ps) Performance improvement (%) Dragon BDragon HAMPA HAMPA vs. Dragon (%) HAMPA vs. BDragon (%) pci_bridge dma b20_ b b22_ wb_conmax b b_ ethernet b Average 8.% 6.42% newly created bins. Therefore, it is expected that CCP(i) will be greater than ICP(i) for any floor-placement level i. Finally, the highway planning is applied and the delay of critical nets are improved. Fig. 6 shows an example of delay fluctuations that may occur at any given level i of global placement. Note that the delay improvement at the later levels of global placement is not the same as the earlier levels. Table 4 shows the partial clock period improvements for one of the attempted benchmarks (benchmark no. ) at the inner levels of hierarchy. The corresponding curve is shown in Fig.. In fact, the positive effect of planning at the early levels of floor-placement is more significant than the later levels. The reason is that the number of critical nets that are not assigned to the highways is more substantial at early levels of floorplacement. Design clock period 0 ICP CCP i HCP Global placement level Fig. 6. ICP, CCP and HCP relationship at the ith level of global placement. n Timing yield improvement calculated by 2 Z Ttarget The progressive improvement of delay using HAMPA not only increases the design performance but also results in reduction of timing variation and consequently increases in timing yield. The reason is that by using HAMPA, the delay improvement is not limited to most critical paths but includes near-critical paths too. As a result, the average delays of the near-critical paths are reduced, which leads to more robustness against timing variation and consequently improves the timing yield of the design. We performed an experiment to evaluate HAMPA results in terms of timing yield. Timing yield (Y) can be defined as a cumulative distribution function (CDF) for the probability distribution function (PDF) of the critical path delay (f critical ) [26] and is Y ¼ f critical ðtþ dðtþ where T target is the target delay needed for the design (timing wall). Fig. 8 shows the relationship between the timing yield and the distribution of the critical path delay while there is some delay variation. If the CDF exceeds 00%, the timing constraint will be violated. Therefore, the accumulation of the PDF values before the critical path delay exceeds the T target, shows the yield of the design. Decreasing the PDF shifts down the CDF and increases the value of CDF below the timing violation and consequently the timing yield increases. Therefore, reducing the average delay of near-critical nets by using HAMPA improves the timing yield of the design. We ðþ planning methodology, Integration VLSI J. (200), doi:0.06/j.vlsi

Basic Idea. The routing problem is typically solved using a twostep

Basic Idea. The routing problem is typically solved using a twostep Global Routing Basic Idea The routing problem is typically solved using a twostep approach: Global Routing Define the routing regions. Generate a tentative route for each net. Each net is assigned to a