Data Flow Graph Partitioning Schemes

Size: px

Start display at page:

Download "Data Flow Graph Partitioning Schemes"

Jonah Lang
5 years ago
Views:

1 Data Flow Graph Partitioning Schemes Avanti Nadgir and Harshal Haridas Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania Abstract: The ordering of operations in a data flow program is not specified by the programmer, but is implied by the data dependencies. This advantage can be exploited further on multiprocessor architectures by grouping the nodes in the corresponding data flow graph and allocating these groups efficiently to processors. This paper presents and compares some of the numerous heuristic approaches that were proposed to partition data flow graphs and assign them to different processors. The processor allocation algorithms, which do not consider the communication cost between processors, are presented initially. A Region analysis algorithm that enables better load balancing but again does not consider communication cost is presented. Schauser et. al. developed an intermediate data structure dual graph to transition a data flow graph into threads which easily compiled and mapped on Threaded Abstract Machine. Lastly, algorithms, which consider these costs, are studied. We present a design pattern that can be summarized from these algorithms to efficiently partition data flow graphs. We also list the key issues that need to be addressed during data flow partitioning. Abstract:... 1 I. Introduction... 1 II. Processor Allocation Strategies... 2 III. Region Analysis: A Parallel Elimination method for data flow analysis... 4 IV. Data flow graph partitioning to Reduce Communication Cost... 7 V. A Vertically Layered Allocation Scheme for Data flow systems... 8 VII. Conclusion REFERENCES I. Introduction Data flow analysis is the compile- time collection of semantic information from a program. A program is represented by a flow graph G = (N, E, ρ), a rooted directed graph with the unique root ρ such that for any node v ε N there is a path from ρ to v. E comprise the edges in the flow graph. The operands conveyed from one node to another along these edges are called tokens. A node is active if tokens are available at all inputs. On a single processor, these nodes are executed serially after they become active. The multiprocessor system can exploit the inherent parallelism due to asynchronous execution for data flow graphs. Parallel data flow analysis methods offer the promise of calculating detailed semantic information about a program at compile time more efficiently than sequential techniques. However, a common question that arises on a multiprocessor system is Given a set of data flow problems to be solved for program Q on Parallel machine P, what is the best parallel execution time achievable? In this paper we summarize a pattern to solve the above problem based on the study of some of the numerous heuristic approaches proposed in papers. Initially, the data flow graphs are partitioned into intervals or regions of a particular size. These regions are then combined to form larger regions such that the cost of moving tokens across regions is minimized for the entire data flow graph. Paper Overview. Section II reports the Processor allocation strategies. Two algorithms are developed to allocate paths to processors in a near optimal manner. Section III presents the Region Analysis method, a new parallel elimination method for data flow analysis.

2 Section IV discusses an advanced partitioning method that reduces the communication cost between regions. Section V describes a vertically layered scheme for data flow system. Finally, section VI discusses an intermediate dual graph scheme that Schauser et. al. have put forth to convert a data flow graph into threads that compile and run on a threaded abstract machine. Finally, we devise a pattern common to these algorithms to efficiently partition Data flow graphs. II. Processor Allocation Strategies Any two operators connected by an arc in a dataflow program show a data dependency between the two operators. Consequently, the execution of one must necessarily precede the execution of the other in time. Since they cannot be executed in parallel, there is no reason for mapping them on two different Processing Elements (PEs). The operand produced by the first operator can be stored directly into the second when mapped on the same PE. Lubomir Bic converted a data flow representation into sequential code segments (SCS) based on the above argument. Applying this scheme resulted in the following advantages: Reduction in matching overhead Increase in efficiency of individual processing units Two processor allocation algorithms to allocate SCS were developed. The first does not always use the minimum number of processors required. The second algorithm ensures that the minimum number of processors is used. Fig 1: Data flow graph For each operation, o, at Ie 1 (o) Do if a processor which executed an operation q such that q o Then allocate o to that processor Else if a processor which executed an operation q such that q o Then allocate o to that processor Else allocate o to any idle processor End if End if 1 * Ie (earliest Initial time): The time at which an operation o starts execution

3 End do Algorithm 1 Let F (o) be the scheduled finish time of operation o. Allocate critical path(s) to processor(s) Let R denote the set of allocated operations For all other processors Do while there are operations to allocate Do while an operation o with Il 2 (o) < min rεr F(r) Do select the operation o with minimum Il (o) If a processor which executed an operation q such that q o Then allocate o to that processor at Ie (o) and add o to R Else if a processor which executed an operation q such that q o Then allocate o to that processor at Ie (o) and add o to R Else allocate o any idle processor at Ie (o) and add o to R End if End if End do While an operation o with Il (o) >= min rεr F(r) Do select the operation with minimum Il (o) If a processor which executed an operation q such that q o Then allocate o to that processor at max (Ie (o), F(q)) and add o to R Else if a processor which executed an operation q such that q o Then allocate o to that processor at max (Ie (o), F(q)) and add o to R Else allocate o to any idle processor at max (Ie (o), min rεr F(r)) and add o to R End if End if End do End do End do Algorithm 2 Fig 2: Processor Allocation Algorithms Algorithm 1 allocates nodes as soon as possible to idle processors. Algorithm 2 ensures that minimum number of processors is used. The shortcomings of these algorithms are: The communication cost to transmit tokens between processors is not considered The number of nodes allocated to a processor is not limited to an upper bound 2 * Il (latest Initial time): The time at which an operation o must start to be executed to maintain the minimum execution time

4 The algorithms assume that the entire set of nodes will be executed during the execution of the program and do not take into consideration conditional loops The focus behind designing of these algorithms was to minimize the number of processors. However, minimizing number of processors may not always result in optimal processing time and hence in efficient utilization. III. Region Analysis: A Parallel Elimination method for data flow analysis Elimination algorithms are type of graph partitioning algorithms that partition graphs into single entry regions 3. Previous work on parallel elimination methods has been hampered by the lack of control over the interval size of the regions. This can prohibit effective parallel executions of these methods. A new elimination method (Region Analysis) was designed to overcome this problem. Region analysis emphasizes flow graph partitioning to enable better load balancing in a more effective parallel algorithm. Considering a forward data flow problem, we can intuitively describe elimination methods as having two phases: elimination and propagation. During elimination, the algorithm summarizes the data flow within an interval in terms of the data flow solution at the entry node. Then, the propagation phase accounts for the data flow solution within a region. The Region Partition Problem. Given a size limit S ε Z+ and a reducible flow graph G = (N, E, ρ), partition G into r (disjoint) regions R hi = (N i, E i, h i ) with region size N i <= S, 1 <= i <= r, such that r is minimized. Two approximation algorithms have been presented for region partitioning: The Forward algorithm The Bottom-Up algorithm. The Forward algorithm: The name stems from the way it forms regions by proceeding along the direction of flow graph edges. The forward algorithm begins forming a region with one node that is the head node. A node is included in the region if and only if all of its immediate predecessors are already in the region, satisfying the entry constraint, and the resulting size is no greater that S, satisfying the size constraint. A back targeted node 4 is made a head node. 3 A region is a connected sub-graph such that all incoming edges from other parts of the flow graph to the region enter into its head node 4 if there is an edge u v and v u in a flow graph the node v is a back targeted node

5 The Forward Algorithm The Bottom-Up algorithm: It uses the dominator tree 5 of a flow graph and topological order on flow graph nodes. It visits the nodes only after having visited all the children and is thus a bottom-up algorithm. In an a cyclic graph, each node v can be labeled with a number ts(v), 1<= ts(v) <= n, by a topological sort; that is, ts(u) < ts(v) if and only if u is a predecessor of v. We can thus visit the dominator tree nodes in a bottom-up fashion by visiting them in reverse topological order. We also have all necessary information regarding the entry constraint to decide the merger of R v into R par (v) if we visit the dominator tree nodes in bottom-up manner and visit sibling nodes in topological order. 5 if there is an edge u -> v, u dominates v. A tree formed from the nodes a cyclic in nature is a dominator tree

6 The Bottom-Up algorithm Comparing the Forward and Bottom-Up algorithms: The advantages of the Forward algorithm are efficiency and ease of implementation The forward algorithm has no prior knowledge of the graph structure and proceeds in an oblivious manner. This can result in poor partitioning for a graph with many leaf nodes. The Bottom-Up algorithm visits a node only after having visited all of its children The Bottom-Up algorithm is expected to produce better partitions than the Forward algorithm since it uses some knowledge of the flow graph structures Fewer regions would be formed in Bottom-Up technique as compared to the Forward technique The average size of the regions in Bottom-Up would be larger than the regions formed by the Forward Region technique The previous processor allocation algorithms and the Region Analysis partitioning did not consider the communication costs between regions or processors. In the next section we look at some heuristic approaches to partition a data flow graph which take into consideration the communication cost.

7 IV. Data flow graph partitioning to Reduce Communication Cost The objective is to reduce the overhead due to token transfer through the communication network of the machine. The load distribution on the rings 6 is improved when this scheme is employed on large graphs. There are a couple of good reasons to partition a graph so that two nodes which communicate are allocated to different rings even when the inter processor communication costs are very high: (i) the original graph may be too large to have all of its nodes stored in the node store of a single ring. (ii). Assigning several graphs to the machine rings without partitioning some of them may result into unbalanced load which would not take full advantage of the machine s capability. The partitioning cannot be based on the number of token transfer load factors 7 on each arc since conditional nodes will execute paths based on the input data set values. However, based on the input data set values, the load factors are not distributed randomly relative to each other, but in clusters of arcs with exactly the same number of tokens. This statement comprises the first heuristic rule and forms the basis for a data flow graph subcontraction (i.e., a data flow graph whose cycles are each shrunk to a single dense node). The graph is converted into a canonical flow graph (CFG)[3]. The CFG is acyclic in nature and layers the nodes horizontally. Each node in the same layer is initially assigned the same level label. Consider a directed path from node α to node β and let T(x) denote the time needed for processing the elementary functions associated with the nodes on path x. Nodes at level n become executable no earlier than T * L(F(n)) where L(F(n)) or L(n) is the level label of F(n). A lower estimate on T(x) is [L (α) L(β) + 1] * T However, this equation does not take into consideration the communication costs. In order to consider the communication overhead, we consider concurrent 8 transfer of tokens over the network and conclude that the overhead can be expressed as D * q(x, P) where q(x, P) is the total number of non-concurrent arrivals of tokens to the nodes of path x and D is transmission delay. Hence the lower estimate of T(x) becomes [L (α) L(β) + 1] * T + q(x, P) * D This analysis leads to a second heuristic rule according to which the number of different levels in the cut-set of the partition should be kept minimal i.e. the corresponding partitions are preferred to extend more horizontally than vertically. The load of communication network (number of token transfers/ time) is a factor affecting D. This suggests that the size of the cut-set should be kept minimal. We can conclude based on these analyses that a good partitioning method should compromise among the above two heuristic rules and minimality of the cut-set. 6 A ring has multiple functional units, each of which is capable of performing a number of elementary functions 7 A load factor of an arc indicates the total number of tokens transported over the arc during an execution of the data flow graph 8 Two or more tokens are said to be transported concurrently through the network when they emanate from nodes of data flow graph that have the same level labels in the CFG G

8 The algorithm initially partitions the graph in a number of clusters and then merges these clusters together to obtain optimal load balanced components and reduced inter component communication clusters. (i). The algorithm initially locates the strongly connected components (SCCs) of the data flow graph and includes them in initial set of clusters. (ii). A depth first search is then employed to assign level labels for partitioning the graph into horizontal layers to form the CFG. The remaining members of the initial cluster set are determined as follows. (iii). Among the nodes that do not belong to the current cluster and are adjacent to it, select a node such that the difference between the maximum and minimum arc levels in the current cluster changes the least (i.e. restricting the partition to expand horizontally). Also, if there is more than one node that satisfies this condition, then the node that increments the smallest of X i values, (where X i is the number of nodes in the current cluster belonging to level i) is chosen. This ensures that the cluster s diameter is kept as small as possible. (iv). After the clusters are formed, each is assigned to a specific ring and this assignment is indicated with a RING-MARK label on each of them. The marking starts with clusters with SCCs. (v). If the final partition is to be composed of R components and the number of SCCs is greater than R, a preliminary merging procedure is performed to determine the SCCs that have to be assigned in common to rings. The combined cost in the common rings should not exceed the ring capacity. If the final set of clusters exceeds R and no two SCCs can still be merged under the constraints in the preliminary merging procedure, the following steps are followed: (vi). Returning to the step (iv), (the initial clusters) RING-MARKs are assigned to each of the clusters containing SCCs with the restriction that any two SCCs that were merged together in the preliminary merging procedure are assigned the same RING-MARK. (vii). Pairs of clusters are then selected such that at most one of them has been assigned a RING-MARK, and their total size does not exceed the ring capacity. Such pairs are merged. V. A Vertically Layered Allocation Scheme for Data flow systems The proposed allocation scheme is based on two general allocation philosophies: (i) assign concurrently executable nodes to separate PEs and (ii) assign serially connected nodes to the same PE. i.e., total execution time and contention are minimized by distributing nodes of a data flow graph on all available processors while total communication time is minimized by clustering nodes in as few PEs as possible. In order to determine a compromise between computation and communication cost, the proposed allocation scheme analyzes the Critical Path 9 (CP) and the Communication to Execution Cost Ratio 10 (CTR) of data flow graphs. The set of nodes that lie on the critical path is given highest priority and assigned to a PE. All other serially connected nodes in the graph are found recursively by determining the longest directed path (LDP) emanating from the nodes that have already been assigned to PEs. The CP and the LDP thus minimize contention and inter-processor communication time by assigning a single PE to each serially connected node. 9 The Critical path of a data flow graph defines the longest path from the root node to the exit node 10 The Communication cost to Execution cost ratio indicates whether the inter-pe communication overhead offsets the advantage gained by overlapping the execution of two subsets of nodes in separate processing elements

9 Allocation scheme: A directed data flow graph is initially converted into an acyclic graph by traversing the graph in depth first search (DFS) manner and marking all backward pointing arcs that form closed loops. A modified topological sort seen earlier is then performed to partition the graph into disjoint horizontal layers such that the nodes in each layer can be performed in parallel and the layers are linearly ordered Fig 3: Dividing the graph into horizontal layers The layering of the data flow graph is followed by the allocation of the nodes in two phases: the separation phase and the optimization phase. In separation phase, the dataflow graph is initially partitioned into distinct program modules only on the basis of the execution times T. Each program module consists of a serially connected set of nodes. This is done by rearranging the nodes into vertical layers, where nodes constituting a single vertical layer can be allocated to a single PE. The conditional nodes and loops are handled primarily. A conditional node is implemented as a SWITCH operator, which sends a token to one of its successor functions based on the logical result of a predicate P. The MERGE operator then accepts the result token on one of its input and directs it on the output. Loops are also implemented with SWITCH and MERGE operators. In the case of a deterministic loop 11, the execution time can be determined on the basis of the iterations. For random loops, the number of iterations is based on the probability assigned to the conditional node, which indicates whether the execution of the loop continues. The expected number of iterations of a random loop when a probability p is assigned to a conditional node is E(I) = p / 1-p 11 The number of operations in a loop are either deterministic or random

10 SWITCH TF P f g TF MERGE Fig 4: Conditional Node The probability for a node not conditional in nature is 1. After the probabilities are assigned, the approximate CP is determined by evaluating the earliest time (e) and the latest time (l) that a node can finish its execution. The set of critical nodes is then found by determining the degree of criticality (i.e. l-e) of each node and a random CP is selected if there is no unique one. The set of nodes on the critical path are placed in a FIFO queue according to their precedence relationship 12. V 1 V 2 V Fig 5: Vertically layered graph The LDP for each of these queued nodes is determined iteratively by removing a node n from the queue and following a procedure similar to finding a critical path from the node n. The nodes that have been already arranged into vertical layers or are in the CP are not included in the LDP. Each set of nodes thus obtained is then assigned to the first available vertical layer. The separation phase is completed when the queue is empty and all the nodes in the graph are rearranged into vertical layers. 12 If node N i precedes node N j in execution N i has precedence over N j and this relation is termed as the precedence relationship

11 In optimization phase, the inter-pe communication delays 13 are minimized. Two types of inter-pe communication behaviors are identified for optimization: Type A inter PE communication exists if two subsets of nodes exhibit a precedence relationship and are arranged in two distinct vertical layers. Type B inter PE communication exists if three subsets of nodes exhibits a precedence relationship and are arranged in three different vertical layers. V α V α V ν V β C α β N β N α C β α n α n β C β α Fig 6: Type A and Type B inter-pe communication There are three possible cases in Type A inter-pe communication behavior: (i) T β + C αβ < T α (ii) T α < T β + C αβ < T α + T β (iii) T α + T β < T β + C αβ where T α and T β represent the execution times on PEs α and β respectively and C αβ is the total communication cost (c αβ + c βα ) 14. In case (i), the execution time on β plus the communication cost is less than the execution time on α. Hence, the initial assignment of the vertical layers to two distinct PEs does not affect the overall execution time. In case (ii), the execution time T β and the communication cost are significant enough to affect the execution time T α and therefore, the total execution time. The set of nodes assigned to the PEs α and β can be combined into a single vertical layer to eliminate the communication delay. This will be only if a single vertical layer assignment will not result in increase of the total execution time. In case (iii), the execution of the nodes on a single PE results in a superior performance. In other words, the ratio of the communication to execution cost is greater than 1. Combining the two subsets of nodes in the two vertical layers and executing them on a single PE can achieve an improved total execution time. 13 Communication between two nodes in successive horizontal layers and not assigned to the same processing element. 14 c αβ represents the communication delay when transporting a token from node α to node β

In the case of Type B communication behavior, if the communication costs are greater than the execution costs on a single PE, the subsets of nodes are combined into a single vertical layer that can

12 In the case of Type B communication behavior, if the communication costs are greater than the execution costs on a single PE, the subsets of nodes are combined into a single vertical layer that can be executed on a single PE. Combining nodes in vertical layers may result in a new critical path. Hence, this process is repeated in an iterative manner until no improvement in performance can be obtained by combining two subsets of nodes associated with the critical path. VI. Schauser et al. proposed an intermediate graph (dual graph) representation to partition a data flow graph, generate threads and compile them for a Threaded Abstract Machine (TAM). Synchronization, thread scheduling and storage management in a TAM are explicit in machine language and exposed to a compiler. Hence a multithreaded execution is addressed as a compilation problem. Compiler controlled multithreading is examined through compilation of Id90, a lenient parallel language 15 for a threaded abstract machine (TAM). The intermediate dual graph representation attempts to minimize thread switching for parallel languages, minimize the total cost of synchronization and making effective use of critical processor resources such as registers and cache bandwidth. Dual graph is a directed graph with three types of arcs: data, control and dependence. Data arcs: A data arc specifies the value produced by output of a node that is used as input operand by another node. Control arcs: A control arc u v specifies that an instruction u executes before instruction v and has direct responsibility for scheduling v. Dependence arcs: A dependence arc specifies that an instruction will be scheduled as an indirect consequence of executing another instruction. Control is represented by tokens traveling along the control arcs. A node fires when control tokens are present on all its control inputs. Upon firing, a node computes a result based on data values bound to its data inputs, binds the result to its data outputs and propagates control tokens to its control outputs. The different types of nodes used in dual graphs are as follows: Fig 7: Dual graph nodes 15 A language in which functions, expressions, and data structures are non- strict. (e.g. returning results before all operands are computed, accessing and passing data structures around while components are still being computed)

13 A simple node describes an arithmetic or logic operation. A join synchronizes control paths. A switch conditionally steers control. A merge steers control from many control inputs to a single control output. A label indicates a separation constraint; the adjacent nodes must be in distinct threads. An outlet sends a message or initiates a request. An inlet receives a message or split-phase response. A const node represents a manifest constant. Dual graphs are generated by expanding data flow program graph instructions. This local transformation is described by expansion rules for individual program graph nodes. A program graph arc expands into a data arc and a control arc in the dual graph. A TAM partition is a subset of dual graph nodes and their incident control and dependence edges. A partition consists of an input region containing only the inlet, merge and label nodes and a body containing simple nodes, outlets, switches and joins. The outputs of a partition are its outlet nodes and all leaving control arcs. A partition is safe if (i) no output of the partition needs to be produced before all inputs to the body are available, (ii) when the inputs to the body are available, all nodes in the body are executed (no conditional execution within a partition) and (iii) no arc connects a body node to an input node of the same partition (acyclic partition). TAM partitions are generated in one of the following ways. Dataflow partitioning: A unary operation never needs dynamic synchronization; thus joins, inlets, merge and labels each start a new partition. Simple, Switch and outlet nodes are placed into the partition of their control predecessor. Dependence set partitioning: It finds safe partitions by grouping together nodes which depend on the same set of input nodes. This guarantees that there are no cyclic dependencies within a partition and is hence more powerful than the Dataflow partitioning. Dominance set partitioning: It finds safe partitions by grouping together nodes which dominate the same set of output nodes (outlet nodes and nodes that directly feed a control input of a merge or label) These partitions will then be merged into larger safer partitions by iteratively applying two merge rules. Merge up rule: Two partitions alpha and beta can be merged into a larger partition if (i) all input arcs to beta come from alpha, (ii) beta contains no inlet nodes (iii) no output arc from the body of alpha goes to an input node in beta. Merge down rule: Two partitions alpha and beta can be merged into a larger partition if (i) all output arcs from alpha go to beta, (ii) alpha contains no outlet nodes and (iii) no output arc from the body of a goes to an input node of beta. After merging, the synchronization costs can be reduced by eliminating arc reduction and Switch and merge combining methods. VII. Conclusion We can conclude that a balanced graph partitioning is necessary for the following reasons: (i) The original graph may be too large to have all of its nodes perform actions on a single processor when there are multiprocessors available

14 (ii) Assigning several graphs to the same processor without partitioning some of them may result into unbalanced load, which would not take full advantage of the machine s capability Common Design Pattern: We can summarize a common design pattern seen in all the partitioning algorithms. The general steps that could be followed to form efficient partitions are as follows: Initial processor allocation algorithms allocated nodes directly to processors with the following goals in minds: Reduce the matching overhead, increase the efficiency of individual processing units and reduce the number of processors. However, the nodes were randomly allocated to the processors and these algorithms resulted in bad allocations or required more number of processors than the optimal number at times. Hence the first step would be to partition a data flow graph instead of direct allocation. Partition the graph into regions/sub-graphs with an initial size. Numerous algorithms like the forward algorithm, backward algorithm, the reduced communication cost technique, vertically layered scheme and Schauser s dual graph technique were studied to partition the graphs. o The forward and backward algorithms were simple implementations of partition schemes but did not consider the communication cost. The initial size of the partition was one and nodes were included in a partition only if all its predecessors were present in that partition and including the node would not result in a partition size greater than maximum partition size. o The reduced communication technique introduced an intermediate graph (CFG) before partitioning the graph. The translation of an input graph to CFG involved removing cycles in the graph. This acyclic graph was horizontally layered and the horizontally layered nodes were allocated concurrently to different partitions. o The Vertically layered scheme partitioned acyclic graphs into horizontal layers and allocated the Critical Path to the central vertical layer. The LDP from each node on the Critical Path was allocated to other vertical layers based on heuristics. o Schauser et. al. implemented partitioning for a specific machine (Threaded Abstract Machine). They introduced an intermediate form (dual graphs) that is mapped into safe partitions. A partition of a dual graph consists of an input region (inlet, merge and label nodes), a body (simple nodes, outlets, switches and joins) and output regions (outlet nodes and all leaving control arcs). These regions/ sub-graphs/ partitions are then combined into larger regions to minimize communication costs between them and sequential sub-graphs are allocated to same regions. o The processor allocation algorithms, the forward allocation algorithm and the backward algorithm did not combine the partitions since they do not consider the communication costs

15 o The reduced communication cost algorithm takes into consideration a couple of things: In order to minimize the communication cost reduce the levels in the cut-set and minimize the cut-set o The vertically layered scheme separated nodes into vertical layers and then optimized the vertical layers based on two kinds of communication types o Optimizations in Schauser et. al. is done my merging the partitions into larger safe partitions by the Merge Up and Merge Down rule. Other Factors: Some other factors that also need to be considered during this partitioning process and merging process are as follows: Communication cost between regions is a very important factor Initial cluster/ region size is also relevant. Larger the initial size, lesser the number of clusters and also the computation effort. However, small initial size may provide a better partition in some cases because larger number of arcs is considered in the derivation of the final cut set. The partition should be in a manner that the number of different levels in the cut-set of the partition should be kept minimal The regions should be a single entry region in order to reduce the complexity of the partition algorithm A compromise needs to be made for strongly connected components (cycles in graphs) with respect to the following: Parallelism vs. Communication Cost. Increase in parallelism increases the communication cost and vice versa Loops in the dataflow graphs also can be implemented as SWITCH and MERGE operators as seen in vertically layered scheme REFERENCES 1. Y.-F. Lee, B. Ryder, and M. Fiuczynski. Region analysis: a parallel elimination method for data flow analysis. IEEE Transactions on Software Engineering, 21(11): , November Y.-F. Lee, B. Ryder. A comprehensive approach to parallel data flow analysis. International Conference on Supercomputing. Proceedings of the 1992 international conference on supercomputing C. Koutsougeras, C. A. Papachristou, and R. R. Vemuri Data flow graph partitioning to reduce communication cost. International Symposium on Microarchitecture. Proceedings of the 19 th annual workshop on microprogramming, B. Lee, A.R. Hurson, and T.Y. Feng. A Vertical Layered Allocation Scheme for Data Flow System. 5. K.E. Schauser, D.E. Culler, and T. von Eicken. Compiler-controlled multithreading for lenient parallel languages. In FPCA '91, pages Springer- Verlag, LNCS-523, B. Lee, A.R. Hurson. Dataflow Architectures and Multithreading 7. Sharp, J. A. (John A.). Data flow computing

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm