Layout Conscious Approach and Bus Architecture Synthesis for Hardware-Software Co-Design of Systems on Chip Optimized for Speed

Size: px

Start display at page:

Download "Layout Conscious Approach and Bus Architecture Synthesis for Hardware-Software Co-Design of Systems on Chip Optimized for Speed"

Heather Warren
5 years ago
Views:

1 Layout Conscious Approach and Bus Architecture Synthesis for Hardware-Software Co-Design of Systems on Chip Optimized for Speed Nattawut Thepayasuwan, Member, IEEE and Aex Doboi, Member, IEEE Abstract This paper presents a ayout conscious approach for hardware-software co-design of systems on chip optimized for atency, incuding an origina agorithm for bus architecture synthesis. Compared to simiar work, the method addresses ayout reated issues that affect system optimization, such as the dependency of task communication speed on interconnect parasitic. The co-design fow executes three consecutive steps. (i) Combined partitioning and scheduing: Besides partitioning and scheduing, this step aso identifies the minimum speed constraints for each data ink. (ii) IP core pacement, bus architecture synthesis and routing: IP cores are paced using a hierarchica custer growth agorithm. Bus architecture synthesis identifies a set of possibe buiding bocks and then assembes them together for minimizing bus ength and compexity. Poor soutions are pruned using a specia tabe structure and seect-eiminated method. (iii) Re-scheduing for the best bus architecture. The paper offers extensive experiments for the proposed co-design method, incuding bus architecture synthesis for a network processor and a JPEG SoC. Index Terms Bus Architecture Synthesis, Hardware/Software Co-design, Systems-on-Chip. I. INTRODUCTION Many embedded systems must meet stringent cost, timing, and energy consumption constraints [0] [8] []. In addition, embedded architectures are very thrifty in empoying hardware resources: they incude genera purpose processors running at ow/medium frequencies (ike ARM, 80C88EB, Phiips 80C etc), have a reduced amount of memory (the memory capacity can be as ow as 8k of RAM and 6k of fash memory), and incorporate customized co-processors and I/O peripheras (incuding RF and anaog circuits). Typica exampes incude embedded systems for teecommunication and mutimedia, ike ce phones, digita cameras, and persona communicators. Systems-on-Chip (SoC) are singe-chip impementations of embedded systems. Compared to printed circuit board designs, SoC offer higher performance and reiabiity at cheaper costs [0]. It is foreseen that advances in device manufacturing technoogy, incuding present deep submicron technoogies and future nanotechnoogies, wi continuousy reduce the minimum feature size, and thus increase the functiona compexity of SoCs []. For SoC reaized in deep submicron technoogies (DSM), physica eve attributes, such as interconnect parasitics, substrate couping, and substrate noise, significanty infuence The authors are with the Department of Eectrica and Computer Engineering, State University of New York at Stony Brook, NY, 79-0 Emai: nattawut, {} {0} {0} Fig.. {} {0} (a) C C C C ~6.97mm ~6.97mm ~6.97mm IP core (Power PC 0GP) IP core (Power PC 0GP) ~6.97mm Bus Bus Bus (b) ~6.97mm IP core (Power PC 0GP) ~6.97mm Bus Length Speed (mm) (MHz) ~7.0 < ~7.0 ~.6 < < Impact of ayout on data communication speed and system design system performance, e.g. data communication speed, system atency, power consumption, and signa integrity [6] [] [0]. Figure iustrates the impact of ayout parasitics on data communication speed and system design. Figure (a) presents a task graph with five tasks. Each task is abeed by its execution time on Power PC processor core. Without considering ayout information, the co-design step decides to aocate a singe 66MHz system bus for a core communications. This woud meet the timing constraints, whie keeping the system architecture simpe. However, considering the physica distances between cores - shown in Figure (b), it is difficut to impement a bus with the requested speed. The same atency can be obtained with three buses of ower speed, ike those in Figure (b), because the system concurrency improves. The bus speeds of MHz, MHz, and MHz were found based on the physica ocations of cores, and the RLC parasitic of the routed buses [0]. This exampe arguments that the communication sub-system of an SoC needs to be designed whie contempating ayout-reated criteria. In genera, it is difficut to postuate a unique bus architecture as being optima for various appications and performance requirements. Instead, bus architectures need to be customized depending on the appication specifics and design needs. New synthesis agorithms are needed, such as for bus architecture design, as we as nove modeing methods, ike predicting interconnect ength at the system eve [8] [7]. System design, incuding task and communication partitioning and scheduing, must be integrated with reevant knowedge about core pacement, bus topoogy design, and bus routing to guarantee that aocated data communication speeds are reaistic. Figure (a) depicts a partitioning soution in which tasks and are mapped to the same core, tasks and are on another core, and is bound to a third core. To minimize system atency, the speed of data communications and has to be higher than that of

2 communications and (assuming that the same amount of data is sent between cores). This speed requirement enforces that the core running wi have to be paced cose to the other two cores, whereas the core executing tasks and might be paced further away from the core for tasks and. This set of communication speed constraints is feasibe, and Figure (b) presents a possibe foorpan. However, it is infeasibe to impose the additiona requirement that the speed of is much higher than the speed of because the corresponding foorpan is hard to buid. In concusion, speed constraints for the communication sub-system need to be tacked whie contempating ayout-reated aspects, e.g., possibe core foorpans and achievabe bus speeds. This task is obviousy chaenging and requires new design approaches, in which the top-down co-design process is aware of certain ow eve aspects, ike core pacement and bus routing. This paper describes a hardware-software co-design method for deveoping SoC impementations subject to atency minimization. The novety is in proposing a systematic, ayoutconscious approach for tacking the SoC communication subsystem, incuding an origina bus architecture synthesis agorithm. System-eve design attempts to minimize atency and maximize the feasibiity of constraints imposed to the bus architecture. Appications are task graphs [0] with data dependencies and reduced number of contro dependencies. The set of avaiabe hardware resources and the SoC area are known. The co-design method incudes three subsequent parts: () combined partitioning and static non-preemptive scheduing, () bus architecture synthesis, and () re-scheduing for the best bus architecture. The first step is an exporation process based on simuated anneaing agorithm (SA) []. The cost function expresses the minimization of system atency and maximization of the feasibiity of bus architecture constraints, ike required speed, number of inks and amount of resuting connectivity between cores. We propose Performance Modes (PM), a graph-based description, that symboicay captures the reationships between performance, graph characteristics, and design decisions. PM are genera, fexibe, and can be easiy extended to new design activities without requiring cumbersome vaidation. The second step synthesizes and routes the bus architecture for an SoC. IP cores are paced using a hierarchica custer growth agorithm. Using the proposed PBS bitwise generation agorithm, bus architecture synthesis first identifies a set of possibe buiding bocks, and then assembes them together, such that bus ength, bus topoogy, communication conficts, and unnecessary core connectivity are minimized. We propose a specia tabe structure (named bus architecture synthesis tabe) and seect-eiminate method to prune poor soutions, such as buses with compex and redundant connectivity. The agorithm was successfuy used to automaticay synthesize bus architectures for reaistic SoC, incuding a network processor and a JPEG SoC. The paper is organized as foows. Section presents reated work. Section discusses system modeing. Section introduces the proposed co-design approach. Bus architecture synthesis is presented next. Experimenta resuts are given in Section 6. Section 7 enumerates pans for future work. Finay, concusions are offered. II. RELATED WORK Over the ast ten years or so, a variety of hardware/software co-design methodoogies were proposed for designing embedded systems optimized for cost, speed, and power consumption [] [9] []. A typica co-design fow incudes the foowing activities: seection of architectures and architectura resources (processors, memories, buses, I/O modues), functionaity partitioning, task mapping to resources and scheduing, and communication synthesis. Depending on the targeted appications, co-design approaches can be cassified into three groups: for data dominated systems [] [9] [] [8], for contro intensive systems [], and for appications with substantia data processing and reduced amount of contro [0] [7]. Baarin et a [] present POLIS, an approach for contro dominated rea-time embedded appications. For data dominated systems, Prakash and Parker [] and Bender [7] formuate the co-design probem as a mixed-integer inear programming (MILP) probem. A inear equation sover finds the optima impementation. The disadvantage of MILP-based co-design is its imitation to sma size appications. The aternative is to empoy heuristic agorithms, such as greedy prioritydriven custering [9], ist scheduing methods [] [] [0] [8], iterative improvement heuristics, ike simuated anneaing and tabu search [0], and genetic agorithms [8] [] [7]. Heuristic methods can be used for arge task graphs [0]. The disadvantage is that the soution optimaity is difficut to be characterized. For exampe, greedy priority-driven agorithms offer good average resuts, but they might give poor soutions for situations not captured by the priority function []. Henke [] suggests a hardware-software partitioning method for ow-power systems. After scheduing, instruction custers with a high utiization rate (thus, with ess wasted energy) are moved to hardware. Dave, Lakshminarayana and Jha [9], and Dick and Jha [] propose co-synthesis methods for the design of heterogeneous systems under a arge variety of optimization goas incuding cost, atency, and average, quiescent and peak power consumption. The methods perform task aocation, scheduing and performance estimation whie contempating inter-processor concurrency, preemptive and non-preemptive scheduing, and memory constraints. Givargis and Vahid [] describe Patune environment for tuning parameterized uniprocessor SoC architectures to optimize timing and power consumption. Parameters, ike processor speed, cache organization, and certain periphery attributes, are decided using the Pareto optimaity criterion. Bus design is critica for SoC. Eary work on bus and communication synthesis [] [] [] [6] focuses on mutiprocessor embedded systems on a printed circuit board. Research addresses interface design [] [], communication packeting [0], mapping and scheduing [6]. This work does not tacke the hardware and ayout detais of the SoC communication sub-systems. Sgroi et a [9] suggest communication centric system design motivated by the increasing importance of communication attributes. Communication is ayered simiar to the OSI Reference Mode. Adapters increase the reusabiity of components by matching different protocos. Lahiri et a [0] focus on communication protoco seection for

3 Custer node cond Fig.. HDCG cond cond 9 (a) 7 8 Communication custer node 6 Operation node cond * Custer node * * (b) Communication custer node Hierarchica Data and Contro Dependency Graph *... (c) handshaking Data packet optiona handshaking Data packet n handshaking a communication architecture tempate incuding shared and dedicated buses. Recenty, Drinic et a [7] present a method for SoC bus network design to maximize overa processing throughput. The communication architecture incudes shared buses connected through bridges. The design fow incudes two steps: one produces the communication topoogy, and the other finds the core foorpanning. Hu et a [6] introduce point-to-point communication synthesis to optimize energy consumption and area. Their work concentrates on bus width synthesis to meet timing constraints on the communication inks, and foorpanning to minimize energy consumption and SoC area. Existing approaches use imited ayout knowedge to guide system design. In many approaches, bus topoogy is assumed given [] [6] [0]. This is reasonabe for sma SoC for which the designer manuay designs the buses. However, it is not effective for SoC with arge number of cores. Compared to simiar work, this paper proposes a new hardware-software co-design approach that integrates system design with bus architecture synthesis and routing. The suggested bus architecture synthesis method does not require knowing the bus topoogy, is more sensitive to ayout parasitic, and prunes eary poor soutions. The co-design agorithm performs combined task partitioning and scheduing using the we-known SA for exporation, but empoying a new method for expressing system performance and requirements. The combined method offers shorter system atency, is more fexibe towards new design requirements, and scaes reasonaby we with the appication size. III. SYSTEM REPRESENTATION FOR CO-DESIGN A. Embedded System Modeing The quadrupe! " $#&%(' describes an embedded system: represents the system functionaity, ) is the set of IP cores of the impementation,! is the set of a possibe foorpans for denotes performance the IP cores in set, and #&% attributes of the impementation, ike atency. A. HDCG (Hierarchica Data and Contro Dependency Graph) Definition: A Hierarchica Data and Contro Dependency Graph is the tripet *,.-0/, -/0/& -6', where -/ is the set of custer nodes, -7// is the set of communication custer nodes, and -7 is the set of arcs. HDCGs are acycic poar graphs having one start node and one end node. Figure shows an HDCG exampe. Custer nodes (CN) represent tasks, functions, oops, and if-then-ese constructs in the system specification. At the fine grain eve, each custer node 8:9 is described as the acycic poar graph 8 9 *,;- < 9 - >=?A@ 9 ', where - < 9 is the set of operation nodes forming custer node B, and - >=?A@ 9 is the set of arcs connecting the operation nodes. Figure (b) shows the fine grain structure of 8. Operation nodes (ON) denote an atomic data processing, such as addition, mutipication, division, comparison etc. Operation nodes are mapped to sma/medium size IP cores, ike mutipiers and arithmetic and ogic units (ALU). Each arc DC.- >=?A@ 9 is a pair EF 8:GH F 8JIK, F 8:GH F 8JILC6- < 9. Arcs express data dependencies between ONs: node F 8:I can start ony after node F 8 G was performed. During co-synthesis, ONs are used for exporing hardware resource sharing across tasks. Each CN and ON has a MONAPQ MONRSL' representing symboic variabes for the node s start time, execution time, and end time. These variabes are used to describe the performance modes of the embedded system. Communication custer nodes (CCN) represent data communications between CNs mapped to different processing units. CCNs are shown as back bubbes in Figure (a). At the fine grain eve, each 8&T has a inear structure, as shown in Figure (c). 8 T is an aternating sequence of nodes corresponding to transmissions of data packets of a fixed size, and nodes for synchronization. The number of data packet nodes depends on the data quantity specific to a CCN, as we as the fixed size of the data packet. Synchronization nodes express the time overhead for synchronizing two cores through handshaking. The optiona synchronization nodes aow packets from different communication inks to be intereaved on the same bus. This faciitates the suspension of an ongoing communication in favor of a higher priority data transmission. If successive packets pertain to the same communication ink, then the optiona synchronization nodes have zero time ength. Arcs describe the data and contro dependencies of an HDCG. An arc UCV- is the tripetw 9 T Y G ', where Y9 $ T CZ-/\[\-/0/, and YHG is a booean variabe or ]. For data dependencies, ^YHG*_]. Data dependencies impose that the target node T starts its execution ony after the source nodey9 was competed. Simiar to conditiona process graphs [0], contro dependencies are arcs annotated with a booean variabe. For contro dependencies, node >T is executed ony if the booean variabe is true. In Figure (a), booean variabes are depicted in itaics. Node 8 computes variabe cond. If variabe cond is true then the communication custer node foowing 8 is performed. 8 is executed for a fase vaue, indicated as - cond in the figure. Definition: System atency is the end time of the HDCG end node. For HDCG with conditiona dependencies, system atency is the worst case atency for a possibe vaues of the booean variabes. Node execution is non-preemptive. Due to the acycic nature of an HDCG, each CN, ON, and CCN is executed at most once for a traversa of the graph.

4 Fig.. W W W W W w Ww W r W r W w W w W r a) A Core Graph b) A Directed Core Graph Core Graph and PBS exampes W r For an HDCG with contro variabes, finding system atency requires the anaysis of cases. This is sti feasibe for HDCG with reduced number of contro dependencies. HDCGs offer a dua perspective on the system functionaity: a task-eve description (for partitioning and scheduing) and an operation-eve representation (for exporing hardware sharing across tasks). HDCG are simiar to contro data fow graphs [] and conditiona process graphs [0]. Even though system functionaity coud be expressed using operation nodes ony, custer nodes prevent the unnecessary growth of the design space, and hence, a very engthy co-design process. If custer nodes are executed on a genera purpose processor as software then there is no need to expore hardware sharing at the operation eve. Besides, for each CN, the execution in software can be accuratey estimated using data profiing and performance modes for CPU, cache, memory, and communication units []. The effect of various compier optimizations can be tacked more effectivey for CNs than for a system expressed using ONs. B. Resources Definition: is the set of IP cores avaiabe for the SoC impementation. 9 is the subset of set to which node B can be mapped, where B C -7/0 [ - <. Function %!-/0 [ - < ) defines the actua hardware resource on which a CN or ON is executed. As presented in Section, the proposed co-design method assumes that the number and type of avaiabe hardware resources is known. This set incudes GPP cores, FU cores, mutipier cores, and so on. Hence, sets ) and 9 are given. Through exporation, co-design identifies the function % ) that optimizes the design constraints. The considered bus mode assumes a singe transaction phase protoco and no data buffering. The singe transaction phase incorporates a activities reated to the address and data phases. Definition: A core graph (CG) is the graph (V,E), where 9 C V represents core B in the architecture, and 9 T C E is the communication ink between cores i and j. The weight 9 T is the Communication Load between core i and core j. It expresses the amount of data exchanged between the two cores. The core size 9 x 9 is described aong with node 9. This concept has been iustrated in Figure (a). The core graph representation of a system architecture is used for bus architecture synthesis. For simpicity of modeing, CG do not distinguish between unidirectiona and bi-directiona datafow. Communication direction depends on whether an operation is a read or a write, and isn t specified directy in a CG. However, the core graph can be modified in order to address the direction of data. Figure (b) presents the core IP core 80 IP core IP core IP core 0 IP core 80 IP core 6 Fig.. (a) Foorpan Tree eve IP core eve eve IP core IP core IP core IP core IP core 6 graph for bi-directiona communications. In case there is more than one communication channe between two cores, then the communication oad is spit across the channes. Industria bus standards can be expressed using the CG formaism. The superposition principe can be appied, if a bus standard is used at the transaction eve. This can be done by cassifying inks according to their bandwidth (ow, medium, and high), and generating CGs for each category. This is consistent as most bus standards for SoC, i.e., AMBA [] and IBM CoreConnect [], have different buses to support data communication at different bandwidths. C. Foorpan Definition: Foorpan Trees (FT) are binary tree structures having foowing two properties: () Leaf nodes correspond to IP cores. () Each interna node inks the two nodes that exchange the maximum amount of data with each other. By definition, an interna node!8:9 exchanges with eaf node G, G C 8,9, a data quantity equa to the sum of a data communications between node G and a IP cores in the subtree originating at node!8:9 ( * #& ')( ). The amount of data communi-!#" = NN%$ cated between two interna nodes 8 9 and!8 T is equa to the sum of a communications between node!8 9 and a eaf nodes of the subtree originating at node 8JT. Figure (a) presents a set of six IP cores, and Figure (b) shows the corresponding FT representation. Arc abes express the amount of data exchanged between cores. Cores and, cores and, and cores and 6 are heaviy communicating. Hence, interna nodes,, and represent their custering. The quantity of data communicated between nodes and is 90 (0 for the communication between cores and, and 0 for the communication between cores and ). The bottom-up process continues by considering nodes,, and, unti the root node is reached (node in the figure). An FT modes core foorpanning at the system-eve. It heps to quaitativey approximate the bus deays in an SoC impementation. Subsection. expains that the speed of the ink between two cores decreases as the eve of their first common interna node increases. For exampe, it is ikey that the ink for cores and wi be faster than that for cores and. The quaitative approximation is needed, because it is too cumbersome to integrate foorpanning and bus routing with the aready compex co-design process. Instead, FTs abstract away the horizonta and vertica cutines in the sicing trees [6] for foorpanning, and repace precise bus speed evauation with finding a ower bound of the bus speed. This avoids co-design soutions in which inks for oosey connected cores are required to operate at high speeds, because after foorpanning those cores wi communicate through ong buses. Obviousy, the actua bus speed after detaied foorpanning and routing might be higher than the ower (b)

5 R R R R Start Fig.. (a) HDCG the three tasks are mapped to the same processor End end T T ex ex T max max 0 Performance Mode for atency T end max (b) PM for atency T end Latency max bound predicted by FTs. However, this gap is not a probem, because FT were introduced to aid finding constraint satisfying designs. D. PM (Performance Mode) Performance Modes describe symboicay the semantics of performance attributes, such as atency, with respect to the invariant HDCG characteristics, ike CN, CCN, ON, and dependencies, as we as the design decisions contempated during co-design, such as partitioning and scheduing. Definition: Performance Mode (PM) is a graph that contains foowing eements: () The starting node 0 for setting the modeed performance attributes to their initia vaue. () The constant part consists of inked symboic variabes and operationa nodes, ike addition nodes, mutipication nodes, max nodes, and min nodes. () The variabe part incudes additiona directed arcs between the operationa nodes. The numeric vaues of performance attributes resut by evauating the operationa nodes for the operands described by symboic variabes and arcs. Figure shows an HDCG, and its corresponding PM for atency. The figure assumes that F 8, F 8, and F 8 are executed in this order on the shared processor. The constant part of the PM incudes a nodes and soid edges in Figure (b). max and addition nodes express constraints between start and end times of each custer node. For exampe, the outputs of max nodes define the start time of the corresponding custer nodes. The start time of a CN has to be arger than the maximum of the end times of a predecessors. Addition nodes express that the end time M NARS T end 9 of node B is the sum of its start time M and its execution time M 9 NAP. The variabe part presents the reationship between atency and the design decisions taken during co-design, ike task partitioning and scheduing. In Figure (b), the variabe part incudes dashed arcs between the addition nodes for the end times of ON, and the max nodes for the start times. Other ON scheduing orders are easiy captured in the PM by changing the orientation of certain arcs. PM is a genera description, which can express different performance attributes and denote various co-design activities. PMs are very fexibe, as they aow easy definition of new performance attributes, or description of additiona reationships between performance attributes and co-design activities. For exampe, the attribute of communication speed fexibiity, defined in Subsection., was added without affecting the aready existing rues for atency. There is no vaidation effort for new attributes. Finay, rues can be identified to prune infeasibe or dominated soution points. For exampe, the Fig k n = > s T ex T end T s s T end T T n max... s n T end k T ex k T k (a) Modeing of data dependencies s T T s p max. max..... p T s C p C r = > C C T s T s... q q max. max.. q (b) Modeing of contro dependencies Modeing of data and contro dependencies end T n T end p min T end q rues for CSF cacuation avoid generating designs, which are difficut to reaize. This heps faster cosure by improving the feasibiity of system design. Maestro et a [] suggest Timing Graphs for symboicay expressing the system execution time. PMs differ from Timing Graphs by not being imited to timing attributes or coarse-grained descriptions. Timing Graphs are empoyed to avoid overapped execution of tasks with simiar operations. This is not the case for PMs, which are used for characterizing finer grained functionaity too. The remaining part of this section presents the rues for buiding the PMs used in the proposed co-design methodoogy. B. Modeing of Co-Design Activities A. Modeing of Data and Contro Dependencies Figure 6(a) shows the genera rue for expressing data dependencies in a PM. Node is executed ony after a its predecessor nodes,,..., are performed. A max node was introduced to express that the start time of node is greater or equa than the end times of a its predecessors. The addition node for node symboicay reates the node s end times MONAR S to its start time and its execution times M NAP. Simiar constructs are introduced for a data dependencies. The right most addition node of the resuting PM denotes the system atency. Figure 6(b) presents the genera rue for representing contro dependencies in a PM. According to the HDCG definition, if condition is true then nodes,..., are performed, otherwise nodes,..., are executed. Max and addition nodes are introduced using the same rues as for data dependencies. The conditiona execution of nodes and was represented in the PM by annotating the input arcs to their max nodes with the corresponding condition vaue (true condition for node and fase condition for node ). Node, which unifies the two branches, has a min node instead of the max node. The foowing rue is appied for numericay evauating PMs with conditiona dependencies: for a certain condition vaue, the arcs abeed with that condition wi propagate the numerica vaues that resut from the PM evauation. Arcs annotated with the opposite condition vaue wi propagate the vaue. For exampe in Figure 6(b), for a true condition, the input arc to the max node for node propagates the output T s r

6 6 tasks are mapped to the same processor T s = > max s max T (a) Scheduing with data dependencies T s T s C max... T s... C max T end T s end T max (c) Scheduing with contro dependencies (case ) Fig. 7. T end min Ts r Modeing of scheduing of the addition node MONAP for node transmits C C tasks are mapped to the same T s... processor T s C max r max = > C max Ts... T max s end T (b) Scheduing with contro dependencies (case ) T s T end C max T s max C T s Tend C C end T s T max (d) Scheduing with contro dependencies (case ). The input arc to the max node. The min node of node r eiminates the infinite vaues propagated through the non-seected branch. B. Modeing of Custer Node Partitioning and Operation Binding From mode point of view, custer node partitioning and operation binding finds the definition of the function % ) that optimizes the design performance. Obviousy, % ) E BK C for each node B. The numerica vaues of the resource dependent attributes of a node become we defined after partitioning and binding. In our case, the execution time MONP 9 of node B changes for each new resource type, and its numerica vaue is updated in PMs. C. Modeing of Scheduing For a given HDCG and a node partitioning/binding to hardware resources, scheduing decides the node execution order on the shared resources. Static non-preemptive scheduing was used in our approach. Depending on the scheduing decisions, different execution sequences and timing attributes (such as start time and end time) resut for the nodes. In the presence of data dependencies ony, a certain execution order is modeed by introducing in the PM mode a dashed arc from the addition node for the end time of the node to be executed first to the max node for the start time of the node to be executed second. For exampe, in Figure 7(a) Node is executed before Node on the same resource. Accordingy, the PM is updated by introducing a dashed arc that forces Node to start ony after Node ends. This arc pertains to the variabe part. Different scheduing decisions can be easiy captured by changing the orientation of dashed arcs. In the presence of contro dependencies, scheduing is more difficut due to the uncertainty of contro dependencies [0]. If scheduing is performed across contro constructs, node schedues must satisfy foowing three requirements [0]: () respecting the execution order defined by data and contro dependencies, () maintaining the conditiona node execution as defined by a HDCG (e.g., if a condition vaue is true then ony the nodes from the true branch wi be executed), and () executing at most once each CN and CCN in the HDCG. The first two requirements are aready captured by the PM modeing of data and contro dependencies. To iustrate the CSF(,) CSF(,) CSF(,6) max D D max D CSF(,) CSF(,) CSF(,6) max D CSF(,) input: FT Foorpan Tree output: PM for CSFs for a eves j in FT, starting from eve upwards do for a nodes p in FT paced on eve j do identify a communications (m,n), such that node p is the first common parent in FT for cores m and n; create a max node and an addition node; ink the output of the max node to the input of the additon node; create symboic variabe D j as the second input to the addition node; abe the output of the addition node as CSF(m,n); for a existing CSF(,k), k<>n or <>m do if FT eve of ink (,k) < FT eve of ink(m,n) then insert an edge from CSF(,k) to the input of max node for CSF(m,n); Fig. 8. PM modeing of communication speed fexibiity third constraint, ets assume a schedue for the graph in Figure 7(b), so that Node woud be executed before Node if condition is true, and after Node if condition is fase. Nodes,, and share the same resource. This situation coud occur if the branch for true condition is ong, and the branch for fase condition is short as compared to the path that Node pertains to. This schedue is incorrect because for a fase condition, Node is executed twice, both before and after Node. Three cases are possibe for satisfying the third correctness requirement: ) Before the contro structure: Node is executed before Node that cacuates condition. In this case, the scheduing of Node does not depend on condition, and Node is executed ony once. Figure 7(b) depicts this situation, and the dashed arc enforces that Node starts after Node terminates. ) During the contro structure: For a given condition vaue (e.g., true condition ), Node is schedued to execute after Node but before Node. To maintain the scheduing correctness, for the opposite vaue of condition, it is required that Node executes ony after Node ends. Figure 7(d) depicts this case. Two new dashed arcs are introduced, so that Node starts after Node if condition is true, and after Node if condition is fase. ) After the contro structure: Node executes after Node. Its scheduing time depends on the vaue of condition. However, as the vaue of condition is aready known by the time Node starts, it is trivia to guarantee singe execution for Node. In Figure 7(c), this case is refected by having the dashed arc between the min node of Node r and the max node of Node. D. Modeing of Communication Speed Fexibiity The execution time of communication custer nodes (CCN) can not be accuratey estimated at the system eve. This is because the bus speed depends on the bus ength, thus, on the pacement of IP cores, the bus architecture and bus routing. As shown in Figure 0, this information is not avaiabe during

7 7 task partitioning and scheduing. Definition: For each data ink, the communication speed fexibiity (CSF) indicates the amount of deay that can be toerated on that ink without vioating the required system atency. To address the unknown communication speed, the codesign methodoogy in Figure 0 first identifies feasibe CSF requirements for each data ink by using a system-eve modeing of the bus architecture. CSF requirements are feasibe, if the bus speed can be achieved in the presence of RLC parasitic. Then, the found CSF vaues become constraints for the bus architecture synthesis step discussed in Section. Lemma: Let EB HK and E K be the CG edges for the data communications between cores B and, and cores and respectivey. In the corresponding FT, et be the eve of the first common parent of cores B and, and the eve of the first common parent of cores and. If then the speed of the bus for communication E B HK the speed of the bus for communication E K. Proof: Considering the construction rues for the binary FT, it resuts that cores B and are paced coser to each other than cores and. Thus, the bus speed wi be higher for ink E B!K than for ink E $K. In the fina SoC ayout, it is very ikey that cores that are cose to each other wi use faster buses than cores paced far apart. This observation is summarized by the above emma. To find feasibe deay constraints, a naive soution woud assign random vaues to CSFs, and then check if these vaues meet the constraints imposed by FT. In reaity, this soution does not work, as most of the CSF vaues wi vioate the constraints. Instead, PMs for CSF were buit to expicity incorporate a FT constraints. Figure 8 shows the corresponding agorithm. The agorithm traverses bottom-up the foorpan tree, and for each interna FT node it generates a pair of inked max and addition nodes. The output of the max node is input to the addition node. The output of the addition node represents the CSF constraint for the communication ink between cores and, such that the interna node is the first parent of both cores in the FT. The CSF constraints for a cores connected through a ower eve ink are inputs to the newy created max node. This modes a requirements expressed by the above emma. Figure 8 shows an exampe for the PM expressing the constraints between CSF vaues. CSF vaues for nodes CSF(,), CSF(,), and CSF(,6) (which are a on the first eve of the FT) are inputs to the PM. According to the foorpanning, the speed for communications (,) and (,) wi be sower than the sowest of the communications (,) and (,). The max nodes and the addition nodes in the PM formuate these constraints. Vaues and express the time amount by which the two communications are sower. Simiary, communication (,6) wi be sower than communications (,6) and (,). Finay, communication (,) wi be sower than communications (,) and (,). For each CCN B, a max node and two addition nodes are introduced into the PM for atency. The max node describes the starting time of data communication. The first addition node has the max node output and variabe M 9 as inputs. 9 R ex CCN T min csf T end T T s T end T max max Fig. 9. (a) Communication speed fexibiity (b) T end The output of the first addition node is input to the second addition node, which has variabe M 9 / as second input. The first addition node modes the minimum communication time, which depends on the amount of communicated data, as we as the maximum speed of a given fabrication technoogy. This vaue is a ower bound for the CCN execution time. The second addition node expresses the extra bus deay due to foorpanning constraints. Its output is the end time of communication. Variabe M 9 / depends on the CSF vaue of the communication ink used for CCN B and the amount of data. Figure 9 shows an exampe. CNs and are mapped to different cores, and CCN is their data communication. Figure 9(b) presents the PM for atency, incuding the two components of the communication time. IV. CO-DESIGN METHODOLOGY Figure 0 presents the proposed hardware-software codesign methodoogy. Inputs are the HDCG of an appication, the maximum system atency, the overa siicon area of the SoC, and the set of avaiabe cores, incuding the number and types of genera purpose processors, functiona units etc. The goa is to partition the HDCG nodes to cores, to decide the scheduing of nodes, to synthesize the bus architecture, and to map and schedue data communications on buses. The overa system atency must be minimized. As a byproduct of bus architecture synthesis, the core foorpanning is found, such that the tota area constraint is met. The co-design methodoogy incudes three consecutive steps. The first step partitions custer nodes to processor cores, binds operation nodes to functiona unit cores, schedues custer nodes, communication custer nodes and operation nodes, and finds the speed requirements for communication custer nodes. The second step decides the IP core foorpanning, synthesizes the bus architecture, routes the buses, and characterizes the speed achievabe on each bus. Finay, the third step re-schedues custer nodes, communication custer nodes, and operation nodes whie keeping the partitioning and the bus architecture unchanged. The proposed co-design methodoogy is sub-optima for the given co-design probem. The optima soution requires simutaneous partitioning, scheduing, and bus architecture synthesis. The experimenta section shows that this is difficut, because the three activities are computationay fairy compex. Hence, the proposed methodoogy sequences these activities whie accommodating the circuar reasoning [] inherent to the co-design probem: partitioning and scheduing are soved for a certain data communication speed, even though the bus architecture and speed are known ony at the ater

8 & 8 max system atency set of cores HDCG Update Core Graph description Core Graph Paced cores Fina design Partitioning Partitioning (binding) Scheduing Pace IP cores using the Hierarchica custer growth pacement agorithm () Update Core Graph () Update Foorpan Tree Scheduing Bus architecture Bus ength Step : Re scheduing Speed of the best bus architecture Update PM R i T s i T i ex Performance Mode Generation System atency and speed requirements for communication inks Info on partitioning, scheduing and communication speed requirements and communication re scheduing Fig. 0. Bus routing siicon area Hardware-software co-design methodoogy Step : Partitioning and scheduing Evauate Performance Mode Step : Bus architecture synthesis Generate set of primary bus structures (PBS) using the bitwise PBS generating agorithm Bus architecture synthesis tabe Bus architecture synthesis using Seect eiminated method Bus speed estimation through parasitic extraction for the routed buses design steps. For handing circuar reasoning, Wof suggests a methodoogy that seriaizes the co-design activities depending on their importance []. We adopted a simiar strategy with the modification that critica information about ater steps is incorporated into the earier co-design activities. For partitioning and scheduing, we used foorpan trees to quaitativey predict the structure and engths of buses, thus their achievabe speed. Step. Partitioning and scheduing: First, Performance Modes (PM) are generated for an HDCG using the rues presented in Section. Next, a simuated anneaing (SA) [] based exporation oop conducts simutaneous partitioning and scheduing. For each CN (ON) B, attributes % EBK (the hardware resource that executes the node), MNAP execution time on that resource), and 9 (the 9 (the start time) are the unknowns for co-design. Custer node partitioning to processors and operation binding to FUs are modeed by the unknowns % ) E BK and M 9 NP. The scheduing of custer nodes, communication custer nodes, and operation nodes is described by the unknowns M Possibe numerica vaues for the unknowns 9 and M are searched during exporation. SA iterativey seects a new point from the neighborhood of the current soution. The neighborhood was defined as the set of points that () differ from the current soution by the execution order of one pair of nodes that share a hardware resource, or () the resource binding of one node. PMs, Foorpan Trees (FT), and Core Graphs (CG) are updated for each newy seected soution. For each co-design soution, the system atency and communication speed fexibiity (CSF) are cacuated by evauating their PMs with a node attributes 9 and MONAP 9 instantiated to their numerica vaues. Starting soutions were obtained by uniformy distributing nodes to resources, and then scheduing nodes using ist-scheduing with critica path as the priority function []. Partitioning, binding, and scheduing steps were executed with different probabiities. The reason is that mutipe vaid schedues are possibe for each partitioning and binding decision. A sma probabiity was used to seect a partitioning step that moves a custer node from a processor core to another processor core or to hardware. The probabiity (.' ) binds an operation node to another FU core. The reason for being greater than is that mutipe hardware designs are possibe for each partitioning of custers to FU cores. Finay, the probabiity - ( ) decides a scheduing action. This emuates a hierarchica exporation process, in which for each new partition or binding there are $ anayzed schedues. For exampe, if = 0.0 and = 0. then on the average, eight schedues are examined for each partition. If the execution order of a node pair is modified then the agorithm verifies that the new ordering does not create cyces in the PMs. The cost function for SA is Cost = Y I 9 R G@ >Y!Z >Y AB B, The cost function to be minimized modes the system atency and the feasibiity of the bus architecture constraints. To maximize feasibiity, communication speed fexibiity (CSF) requirements for each ink need to be maximized. Large CSF vaues reax the constraints for bus architecture synthesis, as sower buses woud be acceptabe. Subsection.D expains that CSF vaues are maximized if their corresponding 9 vaues are aso maximized. To encourage equa distribution of the toerabe sack time to a inks, the product of 9 vaues was used in the cost function instead of their sum. Using the sum coud resut in having some very reaxed vaues, but very tight vaues for other. Such a bus architectures woud be sti difficut to impement. The ast two terms in the cost function further express the quaity of a bus architecture, as the number of buses and the amount of unnecessary core connections. The number of buses was estimated depending on the ikeihood of different communication inks to share the same bus. Links are ikey to share a bus if they invove the same cores, have the same bus speed requirements, there is few overapping between communications, and there is itte unnecessary core connectivity. A more detaied modeing of these attributes are used for bus architecture synthesis discussed in Section.,, and are weights. Step. Bus architecture synthesis: Core Graph (CG) description is updated based on the information on task partitioning and scheduing. First, the foorpan for the SoC cores is found using the hierarchica custer growth pacement agorithm described in Subsection.. Core pacement is needed to accuratey estimate bus engths, and find the correct rates at which data can be communicated on buses. The introductory section expained that DSM effects are important for characterizing the speed possibe on a ink. Core pacement is communication driven, so that two heaviy communicating cores are paced cose to each other, the aspect ratio of their rectanguar bounding box is cose to one, and the tota area of the box is minimized. Aso from CG, the set of possibe primary bus structures (PBS) is created using the bitwise PBS generating agorithm (presented in Subsection.). PBS are the buiding bocks for creating bus architectures. Then, a bus architecture synthesis tabe is produced to characterize the satisfaction of connectivity requirements by individua PBS structures. The actua bus architecture synthesis agorithm (caed Seect-eiminate method) uses SA. Using BA synthesis tabes, the method buids bus architectures, which are PBS

9 9 Fig.. binary position Basis index (base0) Exampe m m... 0 m m... 0 Look up Tabe An output Base 0 to basis set ook up tabe Transator 00 {, 0 } Bitwise Decoder Agorithm Let L = { m, m,...,,, 0} for (i = 0 ; i < DIM(L) ; i) { Output basis set = decoder(i); PBS = OR(Output basis set); resut = COMPARATOR (PBS, PBS TABLE) if (resut == 0) tabe update (PBS); } Bitwise PBS Generating Agorithm Bitwise decoder and bitwise PBS generating agorithms sets that meet a the connectivity requirements in a CG. Topoogica attributes are evauated for each bus architecture, e.g., number of PBSs in an architecture, bus utiization, communication conficts, and maximum data oses. The tota bus ength is estimated using the actua core pacement. The best found bus architecture is characterized for speed in the presence of RLC parasitic. Step. Re-scheduing: Using SA and PM, the third step binds CCN to buses and re-schedues CN, CCN and ON for the best found bus architecture and the CN (ON) partitioning identified at Step. This step may use the fine grain structure of CCN nodes shown in Figure (c). V. BUS ARCHITECTURE SYNTHESIS A. Modeing for Bus Architecture Synthesis Definition: Primary bus structure (PBS) is defined as a potentia custer of connected cores. A PBS is vaid, if a its node connectivity exist in the origina CG. Otherwise, it is invaid. PBS are the buiding bocks for bus architecture synthesis. Figures (c) and (d) show eight PBS for the CG in Figure (a). PBS on Figure (c) are vaid. PBS are characterized by foowing physica and topoogica properties: ) PBS utiization percentage: Utiization is defined as the communication spread in a structure. For exampe, a PBS corresponds to two inks in the CG, i.e., and. This PBS can aso contain, the connection between core and core. There might, however, be no communication between these cores. Therefore the PBS under-uses its structure. We consider the unused eement as a redundant ink of the PBS. The PBS utiization percentage, #, was defined as # * R $ R &, where 8 is the number of inks in a PBS, and n is the! number of associated cores in a PBS. The maximum PBS utiization occurs when a associated cores communicate between each other, and the PBS corresponds to a cique in the CG. ) Communication confict: A PBS is impemented as a shared bus in the system architecture. Performance of a bus architecture can be evauated by its contention. For a static time scheduing of tasks, it is important to evauate if there is a communication confict in a PBS. Communication confict of a PBS,? I 9?, is R " the amount of time overaps between communications mapped to the same ink. ) PBS bus ength: PBS bus ength is a vita attribute for evauating the bus speed in the presence deep submicron effects. Longer buses require more siicon area and additiona circuitry ike bus drivers [6] [0]. Aso, the arger cross couping and parasitic capacitances of onger buses increase interconnect atency [0]. Larger power dissipation for interconnect and drivers is caused by onger buses. It is, however, difficut to accuratey estimate the PBS bus ength without contempating the SoC ayout. As expained in Subsection., hierarchica custer growth pacement is used for pacing IP cores, and estimating PBS bus engths. Identifying the set of vaid PBS has an exponentia compexity, if a brute-force agorithm is appied. The upper bound I of the tota number of PBS is # *, where and # represent the number of inks in a CG and the maximum possibe number of PBSs, respectivey. We suggest a more efficient, bitwise agorithm to generate the set of vaid PBSs. The agorithm is presented in Figure. First, using the bitwise decoder agorithm, each ink abe is transated into binary, and stored as a set of basis eements (a basis eement is a ink in the CG). Then, in a oop, the bitwise PBS generating agorithm performs a bitwise OR operation on the basis eements to generate new PBS structures. A produced PBS is vaid, if and ony if a its basis eements are connected. Otherwise, the PBS incudes redundant inks. If the PBS is vaid, the PBS storage is checked to avoid dupications of the same PBS. Exampe: The core graph in Figure (a) has cores. Binary numbers are used to represent inks 9 T,i.e., is described as 00. The number of bits is equa to number of cores (the first core has the right most digit, whie the ast core has the eft most digit). In this case, there are basis eements in the PBS set, namey,,,, and abeed in order. Therefore the basis set is B = (, 00 ),(, 00 ),(, 00 ),(, 00 ), where the first coordinate is the abe of the basis eement. Bitwise PBS generating agorithm starts with index 0 and empty PBS storage. A basis eements are added as separate PBS into the PBS storage. Considering index, this is decoded into 00. Therefore, PBS has two basis eements, namey, (, 00 ) and (, 00 ). This PBS is vaid because a the basis eements are connected. PBS is then vaidated with the PBS storage. The storage is updated, if there is no such PBS. Definition: Bus architecture synthesis tabe describes the reationship between a set of PBS and the connectivity requirements in a CG. The number of rows is simiar to the number of basis ink eements in the CG. The number of coumns is the dimension of the PBS set. An entry in the tabe has vaue, if the PBS corresponding to the coumn incudes the basis ink eement specific to the row. Exampes of BA synthesis tabes are shown in Figure. The tabes are for the CG in Figure (a). Connectivity requirements are expressed as the compete set of basis ink eements

Mobile App Recommendation: Maximize the Total App Downloads

Mobile App Recommendation: Maximize the Total App Downloads Mobie App Recommendation: Maximize the Tota App Downoads Zhuohua Chen Schoo of Economics and Management Tsinghua University chenzhh3.12@sem.tsinghua.edu.cn Yinghui (Catherine) Yang Graduate Schoo of Management