Data Flow Graph Partitioning Schemes

Size: px
Start display at page:

Download "Data Flow Graph Partitioning Schemes"

Transcription

1 Data Flow Graph Partitioning Schemes Avanti Nadgir and Harshal Haridas Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania Abstract: The ordering of operations in a data flow program is not specified by the programmer, but is implied by the data dependencies. This advantage can be exploited further on multiprocessor architectures by grouping the nodes in the corresponding data flow graph and allocating these groups efficiently to processors. This paper presents and compares some of the numerous heuristic approaches that were proposed to partition data flow graphs and assign them to different processors. The processor allocation algorithms, which do not consider the communication cost between processors, are presented initially. A Region analysis algorithm that enables better load balancing but again does not consider communication cost is presented. Schauser et. al. developed an intermediate data structure dual graph to transition a data flow graph into threads which easily compiled and mapped on Threaded Abstract Machine. Lastly, algorithms, which consider these costs, are studied. We present a design pattern that can be summarized from these algorithms to efficiently partition data flow graphs. We also list the key issues that need to be addressed during data flow partitioning. Abstract:... 1 I. Introduction... 1 II. Processor Allocation Strategies... 2 III. Region Analysis: A Parallel Elimination method for data flow analysis... 4 IV. Data flow graph partitioning to Reduce Communication Cost... 7 V. A Vertically Layered Allocation Scheme for Data flow systems... 8 VII. Conclusion REFERENCES I. Introduction Data flow analysis is the compile- time collection of semantic information from a program. A program is represented by a flow graph G = (N, E, ρ), a rooted directed graph with the unique root ρ such that for any node v ε N there is a path from ρ to v. E comprise the edges in the flow graph. The operands conveyed from one node to another along these edges are called tokens. A node is active if tokens are available at all inputs. On a single processor, these nodes are executed serially after they become active. The multiprocessor system can exploit the inherent parallelism due to asynchronous execution for data flow graphs. Parallel data flow analysis methods offer the promise of calculating detailed semantic information about a program at compile time more efficiently than sequential techniques. However, a common question that arises on a multiprocessor system is Given a set of data flow problems to be solved for program Q on Parallel machine P, what is the best parallel execution time achievable? In this paper we summarize a pattern to solve the above problem based on the study of some of the numerous heuristic approaches proposed in papers. Initially, the data flow graphs are partitioned into intervals or regions of a particular size. These regions are then combined to form larger regions such that the cost of moving tokens across regions is minimized for the entire data flow graph. Paper Overview. Section II reports the Processor allocation strategies. Two algorithms are developed to allocate paths to processors in a near optimal manner. Section III presents the Region Analysis method, a new parallel elimination method for data flow analysis.

2 Section IV discusses an advanced partitioning method that reduces the communication cost between regions. Section V describes a vertically layered scheme for data flow system. Finally, section VI discusses an intermediate dual graph scheme that Schauser et. al. have put forth to convert a data flow graph into threads that compile and run on a threaded abstract machine. Finally, we devise a pattern common to these algorithms to efficiently partition Data flow graphs. II. Processor Allocation Strategies Any two operators connected by an arc in a dataflow program show a data dependency between the two operators. Consequently, the execution of one must necessarily precede the execution of the other in time. Since they cannot be executed in parallel, there is no reason for mapping them on two different Processing Elements (PEs). The operand produced by the first operator can be stored directly into the second when mapped on the same PE. Lubomir Bic converted a data flow representation into sequential code segments (SCS) based on the above argument. Applying this scheme resulted in the following advantages: Reduction in matching overhead Increase in efficiency of individual processing units Two processor allocation algorithms to allocate SCS were developed. The first does not always use the minimum number of processors required. The second algorithm ensures that the minimum number of processors is used. Fig 1: Data flow graph For each operation, o, at Ie 1 (o) Do if a processor which executed an operation q such that q o Then allocate o to that processor Else if a processor which executed an operation q such that q o Then allocate o to that processor Else allocate o to any idle processor End if End if 1 * Ie (earliest Initial time): The time at which an operation o starts execution

3 End do Algorithm 1 Let F (o) be the scheduled finish time of operation o. Allocate critical path(s) to processor(s) Let R denote the set of allocated operations For all other processors Do while there are operations to allocate Do while an operation o with Il 2 (o) < min rεr F(r) Do select the operation o with minimum Il (o) If a processor which executed an operation q such that q o Then allocate o to that processor at Ie (o) and add o to R Else if a processor which executed an operation q such that q o Then allocate o to that processor at Ie (o) and add o to R Else allocate o any idle processor at Ie (o) and add o to R End if End if End do While an operation o with Il (o) >= min rεr F(r) Do select the operation with minimum Il (o) If a processor which executed an operation q such that q o Then allocate o to that processor at max (Ie (o), F(q)) and add o to R Else if a processor which executed an operation q such that q o Then allocate o to that processor at max (Ie (o), F(q)) and add o to R Else allocate o to any idle processor at max (Ie (o), min rεr F(r)) and add o to R End if End if End do End do End do Algorithm 2 Fig 2: Processor Allocation Algorithms Algorithm 1 allocates nodes as soon as possible to idle processors. Algorithm 2 ensures that minimum number of processors is used. The shortcomings of these algorithms are: The communication cost to transmit tokens between processors is not considered The number of nodes allocated to a processor is not limited to an upper bound 2 * Il (latest Initial time): The time at which an operation o must start to be executed to maintain the minimum execution time

4 The algorithms assume that the entire set of nodes will be executed during the execution of the program and do not take into consideration conditional loops The focus behind designing of these algorithms was to minimize the number of processors. However, minimizing number of processors may not always result in optimal processing time and hence in efficient utilization. III. Region Analysis: A Parallel Elimination method for data flow analysis Elimination algorithms are type of graph partitioning algorithms that partition graphs into single entry regions 3. Previous work on parallel elimination methods has been hampered by the lack of control over the interval size of the regions. This can prohibit effective parallel executions of these methods. A new elimination method (Region Analysis) was designed to overcome this problem. Region analysis emphasizes flow graph partitioning to enable better load balancing in a more effective parallel algorithm. Considering a forward data flow problem, we can intuitively describe elimination methods as having two phases: elimination and propagation. During elimination, the algorithm summarizes the data flow within an interval in terms of the data flow solution at the entry node. Then, the propagation phase accounts for the data flow solution within a region. The Region Partition Problem. Given a size limit S ε Z+ and a reducible flow graph G = (N, E, ρ), partition G into r (disjoint) regions R hi = (N i, E i, h i ) with region size N i <= S, 1 <= i <= r, such that r is minimized. Two approximation algorithms have been presented for region partitioning: The Forward algorithm The Bottom-Up algorithm. The Forward algorithm: The name stems from the way it forms regions by proceeding along the direction of flow graph edges. The forward algorithm begins forming a region with one node that is the head node. A node is included in the region if and only if all of its immediate predecessors are already in the region, satisfying the entry constraint, and the resulting size is no greater that S, satisfying the size constraint. A back targeted node 4 is made a head node. 3 A region is a connected sub-graph such that all incoming edges from other parts of the flow graph to the region enter into its head node 4 if there is an edge u v and v u in a flow graph the node v is a back targeted node

5 The Forward Algorithm The Bottom-Up algorithm: It uses the dominator tree 5 of a flow graph and topological order on flow graph nodes. It visits the nodes only after having visited all the children and is thus a bottom-up algorithm. In an a cyclic graph, each node v can be labeled with a number ts(v), 1<= ts(v) <= n, by a topological sort; that is, ts(u) < ts(v) if and only if u is a predecessor of v. We can thus visit the dominator tree nodes in a bottom-up fashion by visiting them in reverse topological order. We also have all necessary information regarding the entry constraint to decide the merger of R v into R par (v) if we visit the dominator tree nodes in bottom-up manner and visit sibling nodes in topological order. 5 if there is an edge u -> v, u dominates v. A tree formed from the nodes a cyclic in nature is a dominator tree

6 The Bottom-Up algorithm Comparing the Forward and Bottom-Up algorithms: The advantages of the Forward algorithm are efficiency and ease of implementation The forward algorithm has no prior knowledge of the graph structure and proceeds in an oblivious manner. This can result in poor partitioning for a graph with many leaf nodes. The Bottom-Up algorithm visits a node only after having visited all of its children The Bottom-Up algorithm is expected to produce better partitions than the Forward algorithm since it uses some knowledge of the flow graph structures Fewer regions would be formed in Bottom-Up technique as compared to the Forward technique The average size of the regions in Bottom-Up would be larger than the regions formed by the Forward Region technique The previous processor allocation algorithms and the Region Analysis partitioning did not consider the communication costs between regions or processors. In the next section we look at some heuristic approaches to partition a data flow graph which take into consideration the communication cost.

7 IV. Data flow graph partitioning to Reduce Communication Cost The objective is to reduce the overhead due to token transfer through the communication network of the machine. The load distribution on the rings 6 is improved when this scheme is employed on large graphs. There are a couple of good reasons to partition a graph so that two nodes which communicate are allocated to different rings even when the inter processor communication costs are very high: (i) the original graph may be too large to have all of its nodes stored in the node store of a single ring. (ii). Assigning several graphs to the machine rings without partitioning some of them may result into unbalanced load which would not take full advantage of the machine s capability. The partitioning cannot be based on the number of token transfer load factors 7 on each arc since conditional nodes will execute paths based on the input data set values. However, based on the input data set values, the load factors are not distributed randomly relative to each other, but in clusters of arcs with exactly the same number of tokens. This statement comprises the first heuristic rule and forms the basis for a data flow graph subcontraction (i.e., a data flow graph whose cycles are each shrunk to a single dense node). The graph is converted into a canonical flow graph (CFG)[3]. The CFG is acyclic in nature and layers the nodes horizontally. Each node in the same layer is initially assigned the same level label. Consider a directed path from node α to node β and let T(x) denote the time needed for processing the elementary functions associated with the nodes on path x. Nodes at level n become executable no earlier than T * L(F(n)) where L(F(n)) or L(n) is the level label of F(n). A lower estimate on T(x) is [L (α) L(β) + 1] * T However, this equation does not take into consideration the communication costs. In order to consider the communication overhead, we consider concurrent 8 transfer of tokens over the network and conclude that the overhead can be expressed as D * q(x, P) where q(x, P) is the total number of non-concurrent arrivals of tokens to the nodes of path x and D is transmission delay. Hence the lower estimate of T(x) becomes [L (α) L(β) + 1] * T + q(x, P) * D This analysis leads to a second heuristic rule according to which the number of different levels in the cut-set of the partition should be kept minimal i.e. the corresponding partitions are preferred to extend more horizontally than vertically. The load of communication network (number of token transfers/ time) is a factor affecting D. This suggests that the size of the cut-set should be kept minimal. We can conclude based on these analyses that a good partitioning method should compromise among the above two heuristic rules and minimality of the cut-set. 6 A ring has multiple functional units, each of which is capable of performing a number of elementary functions 7 A load factor of an arc indicates the total number of tokens transported over the arc during an execution of the data flow graph 8 Two or more tokens are said to be transported concurrently through the network when they emanate from nodes of data flow graph that have the same level labels in the CFG G

8 The algorithm initially partitions the graph in a number of clusters and then merges these clusters together to obtain optimal load balanced components and reduced inter component communication clusters. (i). The algorithm initially locates the strongly connected components (SCCs) of the data flow graph and includes them in initial set of clusters. (ii). A depth first search is then employed to assign level labels for partitioning the graph into horizontal layers to form the CFG. The remaining members of the initial cluster set are determined as follows. (iii). Among the nodes that do not belong to the current cluster and are adjacent to it, select a node such that the difference between the maximum and minimum arc levels in the current cluster changes the least (i.e. restricting the partition to expand horizontally). Also, if there is more than one node that satisfies this condition, then the node that increments the smallest of X i values, (where X i is the number of nodes in the current cluster belonging to level i) is chosen. This ensures that the cluster s diameter is kept as small as possible. (iv). After the clusters are formed, each is assigned to a specific ring and this assignment is indicated with a RING-MARK label on each of them. The marking starts with clusters with SCCs. (v). If the final partition is to be composed of R components and the number of SCCs is greater than R, a preliminary merging procedure is performed to determine the SCCs that have to be assigned in common to rings. The combined cost in the common rings should not exceed the ring capacity. If the final set of clusters exceeds R and no two SCCs can still be merged under the constraints in the preliminary merging procedure, the following steps are followed: (vi). Returning to the step (iv), (the initial clusters) RING-MARKs are assigned to each of the clusters containing SCCs with the restriction that any two SCCs that were merged together in the preliminary merging procedure are assigned the same RING-MARK. (vii). Pairs of clusters are then selected such that at most one of them has been assigned a RING-MARK, and their total size does not exceed the ring capacity. Such pairs are merged. V. A Vertically Layered Allocation Scheme for Data flow systems The proposed allocation scheme is based on two general allocation philosophies: (i) assign concurrently executable nodes to separate PEs and (ii) assign serially connected nodes to the same PE. i.e., total execution time and contention are minimized by distributing nodes of a data flow graph on all available processors while total communication time is minimized by clustering nodes in as few PEs as possible. In order to determine a compromise between computation and communication cost, the proposed allocation scheme analyzes the Critical Path 9 (CP) and the Communication to Execution Cost Ratio 10 (CTR) of data flow graphs. The set of nodes that lie on the critical path is given highest priority and assigned to a PE. All other serially connected nodes in the graph are found recursively by determining the longest directed path (LDP) emanating from the nodes that have already been assigned to PEs. The CP and the LDP thus minimize contention and inter-processor communication time by assigning a single PE to each serially connected node. 9 The Critical path of a data flow graph defines the longest path from the root node to the exit node 10 The Communication cost to Execution cost ratio indicates whether the inter-pe communication overhead offsets the advantage gained by overlapping the execution of two subsets of nodes in separate processing elements

9 Allocation scheme: A directed data flow graph is initially converted into an acyclic graph by traversing the graph in depth first search (DFS) manner and marking all backward pointing arcs that form closed loops. A modified topological sort seen earlier is then performed to partition the graph into disjoint horizontal layers such that the nodes in each layer can be performed in parallel and the layers are linearly ordered Fig 3: Dividing the graph into horizontal layers The layering of the data flow graph is followed by the allocation of the nodes in two phases: the separation phase and the optimization phase. In separation phase, the dataflow graph is initially partitioned into distinct program modules only on the basis of the execution times T. Each program module consists of a serially connected set of nodes. This is done by rearranging the nodes into vertical layers, where nodes constituting a single vertical layer can be allocated to a single PE. The conditional nodes and loops are handled primarily. A conditional node is implemented as a SWITCH operator, which sends a token to one of its successor functions based on the logical result of a predicate P. The MERGE operator then accepts the result token on one of its input and directs it on the output. Loops are also implemented with SWITCH and MERGE operators. In the case of a deterministic loop 11, the execution time can be determined on the basis of the iterations. For random loops, the number of iterations is based on the probability assigned to the conditional node, which indicates whether the execution of the loop continues. The expected number of iterations of a random loop when a probability p is assigned to a conditional node is E(I) = p / 1-p 11 The number of operations in a loop are either deterministic or random

10 SWITCH TF P f g TF MERGE Fig 4: Conditional Node The probability for a node not conditional in nature is 1. After the probabilities are assigned, the approximate CP is determined by evaluating the earliest time (e) and the latest time (l) that a node can finish its execution. The set of critical nodes is then found by determining the degree of criticality (i.e. l-e) of each node and a random CP is selected if there is no unique one. The set of nodes on the critical path are placed in a FIFO queue according to their precedence relationship 12. V 1 V 2 V Fig 5: Vertically layered graph The LDP for each of these queued nodes is determined iteratively by removing a node n from the queue and following a procedure similar to finding a critical path from the node n. The nodes that have been already arranged into vertical layers or are in the CP are not included in the LDP. Each set of nodes thus obtained is then assigned to the first available vertical layer. The separation phase is completed when the queue is empty and all the nodes in the graph are rearranged into vertical layers. 12 If node N i precedes node N j in execution N i has precedence over N j and this relation is termed as the precedence relationship

11 In optimization phase, the inter-pe communication delays 13 are minimized. Two types of inter-pe communication behaviors are identified for optimization: Type A inter PE communication exists if two subsets of nodes exhibit a precedence relationship and are arranged in two distinct vertical layers. Type B inter PE communication exists if three subsets of nodes exhibits a precedence relationship and are arranged in three different vertical layers. V α V α V ν V β C α β N β N α C β α n α n β C β α Fig 6: Type A and Type B inter-pe communication There are three possible cases in Type A inter-pe communication behavior: (i) T β + C αβ < T α (ii) T α < T β + C αβ < T α + T β (iii) T α + T β < T β + C αβ where T α and T β represent the execution times on PEs α and β respectively and C αβ is the total communication cost (c αβ + c βα ) 14. In case (i), the execution time on β plus the communication cost is less than the execution time on α. Hence, the initial assignment of the vertical layers to two distinct PEs does not affect the overall execution time. In case (ii), the execution time T β and the communication cost are significant enough to affect the execution time T α and therefore, the total execution time. The set of nodes assigned to the PEs α and β can be combined into a single vertical layer to eliminate the communication delay. This will be only if a single vertical layer assignment will not result in increase of the total execution time. In case (iii), the execution of the nodes on a single PE results in a superior performance. In other words, the ratio of the communication to execution cost is greater than 1. Combining the two subsets of nodes in the two vertical layers and executing them on a single PE can achieve an improved total execution time. 13 Communication between two nodes in successive horizontal layers and not assigned to the same processing element. 14 c αβ represents the communication delay when transporting a token from node α to node β

12 In the case of Type B communication behavior, if the communication costs are greater than the execution costs on a single PE, the subsets of nodes are combined into a single vertical layer that can be executed on a single PE. Combining nodes in vertical layers may result in a new critical path. Hence, this process is repeated in an iterative manner until no improvement in performance can be obtained by combining two subsets of nodes associated with the critical path. VI. Schauser et al. proposed an intermediate graph (dual graph) representation to partition a data flow graph, generate threads and compile them for a Threaded Abstract Machine (TAM). Synchronization, thread scheduling and storage management in a TAM are explicit in machine language and exposed to a compiler. Hence a multithreaded execution is addressed as a compilation problem. Compiler controlled multithreading is examined through compilation of Id90, a lenient parallel language 15 for a threaded abstract machine (TAM). The intermediate dual graph representation attempts to minimize thread switching for parallel languages, minimize the total cost of synchronization and making effective use of critical processor resources such as registers and cache bandwidth. Dual graph is a directed graph with three types of arcs: data, control and dependence. Data arcs: A data arc specifies the value produced by output of a node that is used as input operand by another node. Control arcs: A control arc u v specifies that an instruction u executes before instruction v and has direct responsibility for scheduling v. Dependence arcs: A dependence arc specifies that an instruction will be scheduled as an indirect consequence of executing another instruction. Control is represented by tokens traveling along the control arcs. A node fires when control tokens are present on all its control inputs. Upon firing, a node computes a result based on data values bound to its data inputs, binds the result to its data outputs and propagates control tokens to its control outputs. The different types of nodes used in dual graphs are as follows: Fig 7: Dual graph nodes 15 A language in which functions, expressions, and data structures are non- strict. (e.g. returning results before all operands are computed, accessing and passing data structures around while components are still being computed)

13 A simple node describes an arithmetic or logic operation. A join synchronizes control paths. A switch conditionally steers control. A merge steers control from many control inputs to a single control output. A label indicates a separation constraint; the adjacent nodes must be in distinct threads. An outlet sends a message or initiates a request. An inlet receives a message or split-phase response. A const node represents a manifest constant. Dual graphs are generated by expanding data flow program graph instructions. This local transformation is described by expansion rules for individual program graph nodes. A program graph arc expands into a data arc and a control arc in the dual graph. A TAM partition is a subset of dual graph nodes and their incident control and dependence edges. A partition consists of an input region containing only the inlet, merge and label nodes and a body containing simple nodes, outlets, switches and joins. The outputs of a partition are its outlet nodes and all leaving control arcs. A partition is safe if (i) no output of the partition needs to be produced before all inputs to the body are available, (ii) when the inputs to the body are available, all nodes in the body are executed (no conditional execution within a partition) and (iii) no arc connects a body node to an input node of the same partition (acyclic partition). TAM partitions are generated in one of the following ways. Dataflow partitioning: A unary operation never needs dynamic synchronization; thus joins, inlets, merge and labels each start a new partition. Simple, Switch and outlet nodes are placed into the partition of their control predecessor. Dependence set partitioning: It finds safe partitions by grouping together nodes which depend on the same set of input nodes. This guarantees that there are no cyclic dependencies within a partition and is hence more powerful than the Dataflow partitioning. Dominance set partitioning: It finds safe partitions by grouping together nodes which dominate the same set of output nodes (outlet nodes and nodes that directly feed a control input of a merge or label) These partitions will then be merged into larger safer partitions by iteratively applying two merge rules. Merge up rule: Two partitions alpha and beta can be merged into a larger partition if (i) all input arcs to beta come from alpha, (ii) beta contains no inlet nodes (iii) no output arc from the body of alpha goes to an input node in beta. Merge down rule: Two partitions alpha and beta can be merged into a larger partition if (i) all output arcs from alpha go to beta, (ii) alpha contains no outlet nodes and (iii) no output arc from the body of a goes to an input node of beta. After merging, the synchronization costs can be reduced by eliminating arc reduction and Switch and merge combining methods. VII. Conclusion We can conclude that a balanced graph partitioning is necessary for the following reasons: (i) The original graph may be too large to have all of its nodes perform actions on a single processor when there are multiprocessors available

14 (ii) Assigning several graphs to the same processor without partitioning some of them may result into unbalanced load, which would not take full advantage of the machine s capability Common Design Pattern: We can summarize a common design pattern seen in all the partitioning algorithms. The general steps that could be followed to form efficient partitions are as follows: Initial processor allocation algorithms allocated nodes directly to processors with the following goals in minds: Reduce the matching overhead, increase the efficiency of individual processing units and reduce the number of processors. However, the nodes were randomly allocated to the processors and these algorithms resulted in bad allocations or required more number of processors than the optimal number at times. Hence the first step would be to partition a data flow graph instead of direct allocation. Partition the graph into regions/sub-graphs with an initial size. Numerous algorithms like the forward algorithm, backward algorithm, the reduced communication cost technique, vertically layered scheme and Schauser s dual graph technique were studied to partition the graphs. o The forward and backward algorithms were simple implementations of partition schemes but did not consider the communication cost. The initial size of the partition was one and nodes were included in a partition only if all its predecessors were present in that partition and including the node would not result in a partition size greater than maximum partition size. o The reduced communication technique introduced an intermediate graph (CFG) before partitioning the graph. The translation of an input graph to CFG involved removing cycles in the graph. This acyclic graph was horizontally layered and the horizontally layered nodes were allocated concurrently to different partitions. o The Vertically layered scheme partitioned acyclic graphs into horizontal layers and allocated the Critical Path to the central vertical layer. The LDP from each node on the Critical Path was allocated to other vertical layers based on heuristics. o Schauser et. al. implemented partitioning for a specific machine (Threaded Abstract Machine). They introduced an intermediate form (dual graphs) that is mapped into safe partitions. A partition of a dual graph consists of an input region (inlet, merge and label nodes), a body (simple nodes, outlets, switches and joins) and output regions (outlet nodes and all leaving control arcs). These regions/ sub-graphs/ partitions are then combined into larger regions to minimize communication costs between them and sequential sub-graphs are allocated to same regions. o The processor allocation algorithms, the forward allocation algorithm and the backward algorithm did not combine the partitions since they do not consider the communication costs

15 o The reduced communication cost algorithm takes into consideration a couple of things: In order to minimize the communication cost reduce the levels in the cut-set and minimize the cut-set o The vertically layered scheme separated nodes into vertical layers and then optimized the vertical layers based on two kinds of communication types o Optimizations in Schauser et. al. is done my merging the partitions into larger safe partitions by the Merge Up and Merge Down rule. Other Factors: Some other factors that also need to be considered during this partitioning process and merging process are as follows: Communication cost between regions is a very important factor Initial cluster/ region size is also relevant. Larger the initial size, lesser the number of clusters and also the computation effort. However, small initial size may provide a better partition in some cases because larger number of arcs is considered in the derivation of the final cut set. The partition should be in a manner that the number of different levels in the cut-set of the partition should be kept minimal The regions should be a single entry region in order to reduce the complexity of the partition algorithm A compromise needs to be made for strongly connected components (cycles in graphs) with respect to the following: Parallelism vs. Communication Cost. Increase in parallelism increases the communication cost and vice versa Loops in the dataflow graphs also can be implemented as SWITCH and MERGE operators as seen in vertically layered scheme REFERENCES 1. Y.-F. Lee, B. Ryder, and M. Fiuczynski. Region analysis: a parallel elimination method for data flow analysis. IEEE Transactions on Software Engineering, 21(11): , November Y.-F. Lee, B. Ryder. A comprehensive approach to parallel data flow analysis. International Conference on Supercomputing. Proceedings of the 1992 international conference on supercomputing C. Koutsougeras, C. A. Papachristou, and R. R. Vemuri Data flow graph partitioning to reduce communication cost. International Symposium on Microarchitecture. Proceedings of the 19 th annual workshop on microprogramming, B. Lee, A.R. Hurson, and T.Y. Feng. A Vertical Layered Allocation Scheme for Data Flow System. 5. K.E. Schauser, D.E. Culler, and T. von Eicken. Compiler-controlled multithreading for lenient parallel languages. In FPCA '91, pages Springer- Verlag, LNCS-523, B. Lee, A.R. Hurson. Dataflow Architectures and Multithreading 7. Sharp, J. A. (John A.). Data flow computing

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

Motivation for B-Trees

Motivation for B-Trees 1 Motivation for Assume that we use an AVL tree to store about 20 million records We end up with a very deep binary tree with lots of different disk accesses; log2 20,000,000 is about 24, so this takes

More information

B-Trees and External Memory

B-Trees and External Memory Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 and External Memory 1 1 (2, 4) Trees: Generalization of BSTs Each internal node

More information

CS521 \ Notes for the Final Exam

CS521 \ Notes for the Final Exam CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )

More information

B-Trees and External Memory

B-Trees and External Memory Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 B-Trees and External Memory 1 (2, 4) Trees: Generalization of BSTs Each internal

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

CPEG 852 Advanced Topics in Computing Systems The Dataflow Model of Computation

CPEG 852 Advanced Topics in Computing Systems The Dataflow Model of Computation CPEG 852 Advanced Topics in Computing Systems The Dataflow Model of Computation Dynamic Dataflow Stéphane Zuckerman Computer Architecture & Parallel Systems Laboratory Electrical & Computer Engineering

More information

Portland State University ECE 588/688. Dataflow Architectures

Portland State University ECE 588/688. Dataflow Architectures Portland State University ECE 588/688 Dataflow Architectures Copyright by Alaa Alameldeen and Haitham Akkary 2018 Hazards in von Neumann Architectures Pipeline hazards limit performance Structural hazards

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Topic 9: Control Flow

Topic 9: Control Flow Topic 9: Control Flow COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Front End The Back End (Intel-HP codename for Itanium ; uses compiler to identify parallelism)

More information

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging.

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging. Priority Queue, Heap and Heap Sort In this time, we will study Priority queue, heap and heap sort. Heap is a data structure, which permits one to insert elements into a set and also to find the largest

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

CONSIDERATIONS CONCERNING PARALLEL AND DISTRIBUTED ARCHITECTURE FOR INTELLIGENT SYSTEMS

CONSIDERATIONS CONCERNING PARALLEL AND DISTRIBUTED ARCHITECTURE FOR INTELLIGENT SYSTEMS CONSIDERATIONS CONCERNING PARALLEL AND DISTRIBUTED ARCHITECTURE FOR INTELLIGENT SYSTEMS 1 Delia Ungureanu, 2 Dominic Mircea Kristaly, 3 Adrian Virgil Craciun 1, 2 Automatics Department, Transilvania University

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Implementation of Process Networks in Java

Implementation of Process Networks in Java Implementation of Process Networks in Java Richard S, Stevens 1, Marlene Wan, Peggy Laramie, Thomas M. Parks, Edward A. Lee DRAFT: 10 July 1997 Abstract A process network, as described by G. Kahn, is a

More information

Parallel Graph Algorithms

Parallel Graph Algorithms Parallel Graph Algorithms Design and Analysis of Parallel Algorithms 5DV050/VT3 Part I Introduction Overview Graphs definitions & representations Minimal Spanning Tree (MST) Prim s algorithm Single Source

More information

Algorithm Design (8) Graph Algorithms 1/2

Algorithm Design (8) Graph Algorithms 1/2 Graph Algorithm Design (8) Graph Algorithms / Graph:, : A finite set of vertices (or nodes) : A finite set of edges (or arcs or branches) each of which connect two vertices Takashi Chikayama School of

More information

Scheduling with Bus Access Optimization for Distributed Embedded Systems

Scheduling with Bus Access Optimization for Distributed Embedded Systems 472 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 5, OCTOBER 2000 Scheduling with Bus Access Optimization for Distributed Embedded Systems Petru Eles, Member, IEEE, Alex

More information

CSE 421 Applications of DFS(?) Topological sort

CSE 421 Applications of DFS(?) Topological sort CSE 421 Applications of DFS(?) Topological sort Yin Tat Lee 1 Precedence Constraints In a directed graph, an edge (i, j) means task i must occur before task j. Applications Course prerequisite: course

More information

COMPUTATIONAL PROPERIES OF DSP ALGORITHMS

COMPUTATIONAL PROPERIES OF DSP ALGORITHMS COMPUTATIONAL PROPERIES OF DSP ALGORITHMS 1 DSP Algorithms A DSP algorithm is a computational rule, f, that maps an ordered input sequence, x(nt), to an ordered output sequence, y(nt), according to xnt

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

UNIT 5 GRAPH. Application of Graph Structure in real world:- Graph Terminologies:

UNIT 5 GRAPH. Application of Graph Structure in real world:- Graph Terminologies: UNIT 5 CSE 103 - Unit V- Graph GRAPH Graph is another important non-linear data structure. In tree Structure, there is a hierarchical relationship between, parent and children that is one-to-many relationship.

More information

Lecture 8 13 March, 2012

Lecture 8 13 March, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 8 13 March, 2012 1 From Last Lectures... In the previous lecture, we discussed the External Memory and Cache Oblivious memory models.

More information

Compiler Optimisation

Compiler Optimisation Compiler Optimisation 4 Dataflow Analysis Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This lecture:

More information

Combining Analyses, Combining Optimizations - Summary

Combining Analyses, Combining Optimizations - Summary Combining Analyses, Combining Optimizations - Summary 1. INTRODUCTION Cliff Click s thesis Combining Analysis, Combining Optimizations [Click and Cooper 1995] uses a structurally different intermediate

More information

Branch-and-bound: an example

Branch-and-bound: an example Branch-and-bound: an example Giovanni Righini Università degli Studi di Milano Operations Research Complements The Linear Ordering Problem The Linear Ordering Problem (LOP) is an N P-hard combinatorial

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

CS 4120 Lecture 31 Interprocedural analysis, fixed-point algorithms 9 November 2011 Lecturer: Andrew Myers

CS 4120 Lecture 31 Interprocedural analysis, fixed-point algorithms 9 November 2011 Lecturer: Andrew Myers CS 4120 Lecture 31 Interprocedural analysis, fixed-point algorithms 9 November 2011 Lecturer: Andrew Myers These notes are not yet complete. 1 Interprocedural analysis Some analyses are not sufficiently

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

COMP3121/3821/9101/ s1 Assignment 1

COMP3121/3821/9101/ s1 Assignment 1 Sample solutions to assignment 1 1. (a) Describe an O(n log n) algorithm (in the sense of the worst case performance) that, given an array S of n integers and another integer x, determines whether or not

More information

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs Computational Optimization ISE 407 Lecture 16 Dr. Ted Ralphs ISE 407 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms in

More information

CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL

CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL 5.1 INTRODUCTION The survey presented in Chapter 1 has shown that Model based testing approach for automatic generation of test

More information

Dataflow Architectures. Karin Strauss

Dataflow Architectures. Karin Strauss Dataflow Architectures Karin Strauss Introduction Dataflow machines: programmable computers with hardware optimized for fine grain data-driven parallel computation fine grain: at the instruction granularity

More information

Outline. Graphs. Divide and Conquer.

Outline. Graphs. Divide and Conquer. GRAPHS COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges books. Outline Graphs.

More information

On the Max Coloring Problem

On the Max Coloring Problem On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive

More information

FORTH SEMESTER DIPLOMA EXAMINATION IN ENGINEERING/ TECHNOLIGY- OCTOBER, 2012 DATA STRUCTURE

FORTH SEMESTER DIPLOMA EXAMINATION IN ENGINEERING/ TECHNOLIGY- OCTOBER, 2012 DATA STRUCTURE TED (10)-3071 Reg. No.. (REVISION-2010) Signature. FORTH SEMESTER DIPLOMA EXAMINATION IN ENGINEERING/ TECHNOLIGY- OCTOBER, 2012 DATA STRUCTURE (Common to CT and IF) [Time: 3 hours (Maximum marks: 100)

More information

Module 14: Approaches to Control Flow Analysis Lecture 27: Algorithm and Interval. The Lecture Contains: Algorithm to Find Dominators.

Module 14: Approaches to Control Flow Analysis Lecture 27: Algorithm and Interval. The Lecture Contains: Algorithm to Find Dominators. The Lecture Contains: Algorithm to Find Dominators Loop Detection Algorithm to Detect Loops Extended Basic Block Pre-Header Loops With Common eaders Reducible Flow Graphs Node Splitting Interval Analysis

More information

An Optical Data -Flow Computer

An Optical Data -Flow Computer An ptical Data -Flow Computer Ahmed Louri Department of Electrical and Computer Engineering The University of Arizona Tucson, Arizona 85721 Abstract For many applications, such as signal and image processing,

More information

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 161 CHAPTER 5 Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 1 Introduction We saw in the previous chapter that real-life classifiers exhibit structure and

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Efficient pebbling for list traversal synopses

Efficient pebbling for list traversal synopses Efficient pebbling for list traversal synopses Yossi Matias Ely Porat Tel Aviv University Bar-Ilan University & Tel Aviv University Abstract 1 Introduction 1.1 Applications Consider a program P running

More information

Worst-case Ethernet Network Latency for Shaped Sources

Worst-case Ethernet Network Latency for Shaped Sources Worst-case Ethernet Network Latency for Shaped Sources Max Azarov, SMSC 7th October 2005 Contents For 802.3 ResE study group 1 Worst-case latency theorem 1 1.1 Assumptions.............................

More information

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

CS8391-DATA STRUCTURES QUESTION BANK UNIT I

CS8391-DATA STRUCTURES QUESTION BANK UNIT I CS8391-DATA STRUCTURES QUESTION BANK UNIT I 2MARKS 1.Define data structure. The data structure can be defined as the collection of elements and all the possible operations which are required for those

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,

More information

On The Complexity of Virtual Topology Design for Multicasting in WDM Trees with Tap-and-Continue and Multicast-Capable Switches

On The Complexity of Virtual Topology Design for Multicasting in WDM Trees with Tap-and-Continue and Multicast-Capable Switches On The Complexity of Virtual Topology Design for Multicasting in WDM Trees with Tap-and-Continue and Multicast-Capable Switches E. Miller R. Libeskind-Hadas D. Barnard W. Chang K. Dresner W. M. Turner

More information

Graphs. A graph is a data structure consisting of nodes (or vertices) and edges. An edge is a connection between two nodes

Graphs. A graph is a data structure consisting of nodes (or vertices) and edges. An edge is a connection between two nodes Graphs Graphs A graph is a data structure consisting of nodes (or vertices) and edges An edge is a connection between two nodes A D B E C Nodes: A, B, C, D, E Edges: (A, B), (A, D), (D, E), (E, C) Nodes

More information

MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct.

MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct. MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct. In linked list the elements are necessarily to be contiguous In linked list the elements may locate at far positions

More information

Control Flow Analysis & Def-Use. Hwansoo Han

Control Flow Analysis & Def-Use. Hwansoo Han Control Flow Analysis & Def-Use Hwansoo Han Control Flow Graph What is CFG? Represents program structure for internal use of compilers Used in various program analyses Generated from AST or a sequential

More information

On Covering a Graph Optimally with Induced Subgraphs

On Covering a Graph Optimally with Induced Subgraphs On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number

More information

Graphs and Network Flows ISE 411. Lecture 7. Dr. Ted Ralphs

Graphs and Network Flows ISE 411. Lecture 7. Dr. Ted Ralphs Graphs and Network Flows ISE 411 Lecture 7 Dr. Ted Ralphs ISE 411 Lecture 7 1 References for Today s Lecture Required reading Chapter 20 References AMO Chapter 13 CLRS Chapter 23 ISE 411 Lecture 7 2 Minimum

More information

Control-Flow Analysis

Control-Flow Analysis Control-Flow Analysis Dragon book [Ch. 8, Section 8.4; Ch. 9, Section 9.6] Compilers: Principles, Techniques, and Tools, 2 nd ed. by Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jerey D. Ullman on reserve

More information

Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems

Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems Valentin Gies and Thierry M. Bernard ENSTA, 32 Bd Victor 75015, Paris, FRANCE, contact@vgies.com,

More information

Propagating separable equalities in an MDD store

Propagating separable equalities in an MDD store Propagating separable equalities in an MDD store T. Hadzic 1, J. N. Hooker 2, and P. Tiedemann 3 1 University College Cork t.hadzic@4c.ucc.ie 2 Carnegie Mellon University john@hooker.tepper.cmu.edu 3 IT

More information

High-Level Synthesis

High-Level Synthesis High-Level Synthesis 1 High-Level Synthesis 1. Basic definition 2. A typical HLS process 3. Scheduling techniques 4. Allocation and binding techniques 5. Advanced issues High-Level Synthesis 2 Introduction

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms Analysis of Algorithms Unit 4 - Analysis of well known Algorithms 1 Analysis of well known Algorithms Brute Force Algorithms Greedy Algorithms Divide and Conquer Algorithms Decrease and Conquer Algorithms

More information

Extended Dataflow Model For Automated Parallel Execution Of Algorithms

Extended Dataflow Model For Automated Parallel Execution Of Algorithms Extended Dataflow Model For Automated Parallel Execution Of Algorithms Maik Schumann, Jörg Bargenda, Edgar Reetz and Gerhard Linß Department of Quality Assurance and Industrial Image Processing Ilmenau

More information

From Routing to Traffic Engineering

From Routing to Traffic Engineering 1 From Routing to Traffic Engineering Robert Soulé Advanced Networking Fall 2016 2 In the beginning B Goal: pair-wise connectivity (get packets from A to B) Approach: configure static rules in routers

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs Anan Bouakaz Pascal Fradet Alain Girault Real-Time and Embedded Technology and Applications Symposium, Vienna April 14th, 2016

More information

Search and Optimization

Search and Optimization Search and Optimization Search, Optimization and Game-Playing The goal is to find one or more optimal or sub-optimal solutions in a given search space. We can either be interested in finding any one solution

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic

More information

Test 1 Last 4 Digits of Mav ID # Multiple Choice. Write your answer to the LEFT of each problem. 2 points each t 1

Test 1 Last 4 Digits of Mav ID # Multiple Choice. Write your answer to the LEFT of each problem. 2 points each t 1 CSE 0 Name Test Fall 00 Last Digits of Mav ID # Multiple Choice. Write your answer to the LEFT of each problem. points each t. What is the value of k? k=0 A. k B. t C. t+ D. t+ +. Suppose that you have

More information

Question Bank Subject: Advanced Data Structures Class: SE Computer

Question Bank Subject: Advanced Data Structures Class: SE Computer Question Bank Subject: Advanced Data Structures Class: SE Computer Question1: Write a non recursive pseudo code for post order traversal of binary tree Answer: Pseudo Code: 1. Push root into Stack_One.

More information

Database Systems. File Organization-2. A.R. Hurson 323 CS Building

Database Systems. File Organization-2. A.R. Hurson 323 CS Building File Organization-2 A.R. Hurson 323 CS Building Indexing schemes for Files The indexing is a technique in an attempt to reduce the number of accesses to the secondary storage in an information retrieval

More information

These are not polished as solutions, but ought to give a correct idea of solutions that work. Note that most problems have multiple good solutions.

These are not polished as solutions, but ought to give a correct idea of solutions that work. Note that most problems have multiple good solutions. CSE 591 HW Sketch Sample Solutions These are not polished as solutions, but ought to give a correct idea of solutions that work. Note that most problems have multiple good solutions. Problem 1 (a) Any

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

A Propagation Engine for GCC

A Propagation Engine for GCC A Propagation Engine for GCC Diego Novillo Red Hat Canada dnovillo@redhat.com May 1, 2005 Abstract Several analyses and transformations work by propagating known values and attributes throughout the program.

More information

Thus, it is reasonable to compare binary search trees and binary heaps as is shown in Table 1.

Thus, it is reasonable to compare binary search trees and binary heaps as is shown in Table 1. 7.2 Binary Min-Heaps A heap is a tree-based structure, but it doesn t use the binary-search differentiation between the left and right sub-trees to create a linear ordering. Instead, a binary heap only

More information

A distributed algorithm for minimum weight spanning trees

A distributed algorithm for minimum weight spanning trees A distributed algorithm for minimum weight spanning trees R. G. Gallager, P. A. Humblet and P. M. Spira Prepared by: Guy Flysher and Amir Rubinshtein Preface In this document we will review Gallager,

More information

Dataflow Languages. Languages for Embedded Systems. Prof. Stephen A. Edwards. March Columbia University

Dataflow Languages. Languages for Embedded Systems. Prof. Stephen A. Edwards. March Columbia University Dataflow Languages Languages for Embedded Systems Prof. Stephen A. Edwards Columbia University March 2009 Philosophy of Dataflow Languages Drastically different way of looking at computation Von Neumann

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Distributed Deadlock Detection for. Distributed Process Networks

Distributed Deadlock Detection for. Distributed Process Networks 0 Distributed Deadlock Detection for Distributed Process Networks Alex Olson Embedded Software Systems Abstract The distributed process network (DPN) model allows for greater scalability and performance

More information

COMP Data Structures

COMP Data Structures COMP 2140 - Data Structures Shahin Kamali Topic 5 - Sorting University of Manitoba Based on notes by S. Durocher. COMP 2140 - Data Structures 1 / 55 Overview Review: Insertion Sort Merge Sort Quicksort

More information

would be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp

would be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp 1 Introduction 1.1 Parallel Randomized Algorihtms Using Sampling A fundamental strategy used in designing ecient algorithms is divide-and-conquer, where that input data is partitioned into several subproblems

More information

Parallel Graph Algorithms

Parallel Graph Algorithms Parallel Graph Algorithms Design and Analysis of Parallel Algorithms 5DV050 Spring 202 Part I Introduction Overview Graphsdenitions, properties, representation Minimal spanning tree Prim's algorithm Shortest

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Unit 2 Packet Switching Networks - II

Unit 2 Packet Switching Networks - II Unit 2 Packet Switching Networks - II Dijkstra Algorithm: Finding shortest path Algorithm for finding shortest paths N: set of nodes for which shortest path already found Initialization: (Start with source

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

Decreasing a key FIB-HEAP-DECREASE-KEY(,, ) 3.. NIL. 2. error new key is greater than current key 6. CASCADING-CUT(, )

Decreasing a key FIB-HEAP-DECREASE-KEY(,, ) 3.. NIL. 2. error new key is greater than current key 6. CASCADING-CUT(, ) Decreasing a key FIB-HEAP-DECREASE-KEY(,, ) 1. if >. 2. error new key is greater than current key 3.. 4.. 5. if NIL and.

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

3. Evaluation of Selected Tree and Mesh based Routing Protocols

3. Evaluation of Selected Tree and Mesh based Routing Protocols 33 3. Evaluation of Selected Tree and Mesh based Routing Protocols 3.1 Introduction Construction of best possible multicast trees and maintaining the group connections in sequence is challenging even in

More information

COL351: Analysis and Design of Algorithms (CSE, IITD, Semester-I ) Name: Entry number:

COL351: Analysis and Design of Algorithms (CSE, IITD, Semester-I ) Name: Entry number: Name: Entry number: There are 6 questions for a total of 75 points. 1. Consider functions f(n) = 10n2 n + 3 n and g(n) = n3 n. Answer the following: (a) ( 1 / 2 point) State true or false: f(n) is O(g(n)).

More information

Greedy Algorithms CHAPTER 16

Greedy Algorithms CHAPTER 16 CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often

More information

Chapter 11 Search Algorithms for Discrete Optimization Problems

Chapter 11 Search Algorithms for Discrete Optimization Problems Chapter Search Algorithms for Discrete Optimization Problems (Selected slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003.

More information

ASSIGNMENT- I Topic: Functional Modeling, System Design, Object Design. Submitted by, Roll Numbers:-49-70

ASSIGNMENT- I Topic: Functional Modeling, System Design, Object Design. Submitted by, Roll Numbers:-49-70 ASSIGNMENT- I Topic: Functional Modeling, System Design, Object Design Submitted by, Roll Numbers:-49-70 Functional Models The functional model specifies the results of a computation without specifying

More information

& ( D. " mnp ' ( ) n 3. n 2. ( ) C. " n

& ( D.  mnp ' ( ) n 3. n 2. ( ) C.  n CSE Name Test Summer Last Digits of Mav ID # Multiple Choice. Write your answer to the LEFT of each problem. points each. The time to multiply two n " n matrices is: A. " n C. "% n B. " max( m,n, p). The

More information

Software Synthesis from Dataflow Models for G and LabVIEW

Software Synthesis from Dataflow Models for G and LabVIEW Software Synthesis from Dataflow Models for G and LabVIEW Hugo A. Andrade Scott Kovner Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 andrade@mail.utexas.edu

More information

Compiler Optimization and Code Generation

Compiler Optimization and Code Generation Compiler Optimization and Code Generation Professor: Sc.D., Professor Vazgen Melikyan 1 Course Overview Introduction: Overview of Optimizations 1 lecture Intermediate-Code Generation 2 lectures Machine-Independent

More information