CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY
|
|
- Ruth Clark
- 5 years ago
- Views:
Transcription
1 CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY T. Andronikos, F. M. Ciorba, P. Theodoropoulos, D. Kamenopoulos and G. Papakonstantinou Computing Systems Laboratory National Technical University of Athens 9th Heroon Polytechnioy Zographou Campus, 157-0, Athens, Greece ABSTRACT This paper deals with general nested loops and proposes a novel dynamic scheduling technique. General loops contain complex loop bodies (consisting of arbitrary program statements, such as assignments, conditions and repetitions) that exhibit uniform loop-carried dependencies. Therefore it is now possible to achieve efficient parallelization for a vast class of loops, mostly found in DSP, PDEs, signal and video coding. At the core of this technique lies a simple and efficient dynamic rule (SDS - Successive Dynamic Scheduling) for determining the next ready-to-beexecuted iteration at runtime. The central idea is to schedule the iterations on-the-fly by using SDS, along the optimal hyperplane (determined using the QuickHull algorithm). Furthermore, a tool (CRONUS/1) that implements this theory and automatically produces the SPMD parallel code for message passing architectures is presented. As a testing case study, the FSBM motion estimation algorithm (used in video coding standards, e.g., MPEG-2, H.21) was used. The tool was also tested on a suite of randomly generated loops. The experimental results validate the presented theory and corroborate the efficiency of the generated parallel code. KEY WORDS General loops, dynamic scheduling, automatic SPMD code generation, message passing architectures. 1 Introduction Parallelizing computational intensive programs has lead to dramatic performance improvements. Usually these programs contain repetitive iterations, the majority in the form of nested for loops. The iterations within a loop nest can be either independent iterations or precedence constrained iterations. The latter can be uniform (constant) or nonuniform throughout the execution of the program. This paper tackles the uniform precedence constraints case by presenting a novel scheduling policy and a code generation technique for general nested loops. When parallelizing nested loops the following tasks need to be carried out: Detection of the inherent parallelism by applying (when necessary) any program transformation that may enable this. Computation scheduling (specifying when the different computations are performed). Computation mapping (specifying where the different computations are performed). Explicitly managing the memory and communication (additional task for distributed memory systems). Generating the code so that each processor will execute its apportioned computation and communication explicitly. Related Work. Scheduling nested loops with uniform dependencies was studied by Lamport, who partitioned the index space into hyperplanes [1]. The idea behind this is that all points that belong to the same hyperplane can be executed in parallel. Darte proved that this method is nearly optimal [2], while the problem of finding the hyperlane that results in the minimum makespan was solved in [3]. Moldovan, Shang, Darte and others applied the hyperplane method to find a linear optimal execution schedule, using diophantine equations [4], linear programming in subspaces [5], or integer programming [2]. All these approaches consider unit execution time for each iteration and zero communication for each communication step (UET model). Today the most prominent technique for the parallelization and code generation of nested loops is tiling [], [7], [], [9]. Nevertheless, parallelizing nested loops for distributed memory machines incurs communication overhead. One way to tackle the scheduling problem in the presence of communication delay, is to model the algorithm by a directed acyclic task graph (DAG), where the computation and communication times are depicted as node and edge weights, respectively. Since the general scheduling problem is well known to be NP-complete [10], [11], researchers have given attention to other methods such heuristics, approximation algorithms etc. In order to minimize the inter-processor communication cost, many researchers focus on partitioning the index space
2 into as independently as possible groups of computations. Shang and Fortes presented in [12] a method for dividing the index space into independent sets of computations that are assigned to different processors, such that the communication cost is zeroed. Kshemkalyani and Singhal [13] identified different communication patterns in distributed computations in order to reduce the overall communication cost. Feautrier [14] explored the possibility of having the parallelizing compiler determine the distribution of data and computations, using only information available in the source program, provided that it contains only for loops and arrays with affine subscripts. Engels et al. presented in [15] a polynomial-time algorithm for scheduling unit-length jobs on a fixed number of identical parallel machines for the case where the precedence constraints graph is a forest of in/out-trees and the delays are bounded by a constant, such that the makespan of the resulting schedule is minimized. However the asserted constraints restrict the applicability of their algorithm. In [1] T. Yang et al. present a tool for statically scheduling DAGs and generating the appropriate code for specific message passing MIMD architectures. The DAGs static scheduling followed by parallel code generation is a drawback in terms of makespan. On top of that, the parallel code is generated for specific architectures (ncube- 2 and INTEL-2), proving lack of portability. Calland et al. present in [17] a method for scheduling and mapping (columnwise/rowwise) tiles on limited computational resources, assuming communication/computation overlap, and in [1] they present a static technique for heterogeneous networks. Generally there are two basic computation scheduling methods: static and dynamic. Static scheduling implies the creation of a timetable for the iterations of the loop nest at compilation time (by traversing once the index space) that preserves the precedence constraints, followed by computation mapping. Dynamic scheduling implies finding a rule of instructing every processor at run time what loop iteration to compute, rather than explicit specification at compile time. Herein, the computation mapping is embedded in the dynamic rule. Code generation is a complex problem, and most methods exploit inter-loop iterations parallelism (task level parallelism), with some effort concerning intra-loop parallelization (instruction level parallelism). A generic approach to code generation is not very practical due to the complexity of the problem. Instead, efficient code generation for classes of specific problems with high applicability leads to enhanced performance over the generic approach. Generally, most of the scheduling complexity stems from the fact that precedence constraints (i.e., the computations must be executed in a specific order so as to preserve the meaning of the original sequential program) and resource constraints (i.e., the number of concurrent operations at any time step is bounded by the available resources of the multicomputer at hand) must both be tackled simultaneously. Problem Description. Given a sequential general loop structure, automatically produce an equivalent high speed-up parallel program using a bounded (but arbitrary) number of processors. Our Approach. In this paper we tackle the computation scheduling, computation mapping and code generation for general loops (GLs) that have arbitrary execution and communication times. GLs are those nested loops for which the loop body consists of generic program statements (such as assignments, conditions and repetitions) and which exhibit uniform loop-carried dependencies (or even no loop-carried dependencies at all). The loop iterations are considered as points in a Cartesian n-dimensional index space, and the loop-carried dependence vectors are considered as directed arcs between the corresponding index points. We focus on iteration level parallelism under the assumption that the number of processors used is finite. Finally, we automatically produce an efficient parallel code for these processors. Our approach to the above problem consists of three steps: 1. We determine the optimal family of hyperplanes 1 using the well known QuickHull algorithm [19]. 2. We define the lexicographic ordering for this hyperplanes, and by applying SDS we schedule the iterations on-the-fly. 3. The presented tool automatically produces the parallel code, for a given number of processors. Step 1 is performed at compile time, whereas steps 2 and 3 are performed at run time. The reason for performing step 2 at run-time is the ability to exploit the uniformity of the index space, in order to find an efficient adaptive rule. Also, when dealing with large index spaces, performing step 2 at compile time would be a very tedious work. In our approach, the overhead associated with step 2 is greatly reduced by performing it at run time. The run time scheduling (SDS) introduced in this paper produces a very efficient time scheduling while meeting the precedence constraints. Also, it scales with the available number of processors, in that for any given number of processors it produces an efficient time schedule. The tool CRONUS/1 produces the parallel code automatically, using SDS routines and MPI primitives for dynamic scheduling and executing of the sequential program. CRONUS/1 takes as input a sequential program, which is parsed and the essential parameters for code generation are extracted: depth of the loop nest (n), size of the index space ( J ) and dependence vectors set (DS) (if they exist). Once the crucial parameters are available, the tool calls the QuickHull algorithm, which returns the optimal hyperplane. At this point, the available number of processors (NP ) is required as input. With the help of an automatic code generator the appropriate parallel code is generated for the given number of processors. This parallel code 1 A hyperplane is optimal if no other hyperplane leads to a smaller makespan.
3 contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the multicomputer at hand. DS = { d 1,..., d m }, m n, is the set of the m dependence vectors, which must be uniform, i.e., constant throughout the index space, and must have non negative coordinates. Contribution The efficiency of the scheduling rule used by SDS. SDS uses this rule to instruct each processors which index point to execute next, as well as to which processors to send data to, and receive data from (if necessary). The efficiency of the rule ensures that during the execution of the parallel code no artificial communication or computation delays, that are not present in the initial sequential program, occur. Moreover, each processor spends most of the time executing the appropriate iteration rather than communicating data. The validity and optimality of SDS. The validity is ensured because the points on any subsequent hyperplane depend (eventually) on the points of the current hyperplane (and perhaps on points of previous hyperplanes), whereas the optimality yields from the fact that all inherent parallelism is exploited by following the optimal hyperplane. Perfect load balancing of SDS under the assumption that all processors are homogeneous. This is ensured by the dynamic rule because it distributes the iterations evenly among processors (in a round robin fashion). Automatic parallel code generation. The automatic code generator facilitates the writing of the parallel code without any help from the user. Therefore it greatly reduces the overhead associated with handwritten coding, especially when conducting experiments for performance comparisons. The rationale behind the automatic code generator is independent of the dynamic scheduling methodology employed. Moreover, generating parallel code for the MPI platform yields high portability. 2 Terminology and Definitions 2.1 General Loops General Loops (GLs) have the form shown in Fig. 1. In this model the loop body LB(j) contains general programs statements that include assignment statements, conditional if statements and repetitions such as for or while. The lower and upper bounds for the loop indices are l i and u i Z, respectively. The depth of the loop nest, n, determines the dimension of the index space J = {j = (i 1,..., i n ) N n l r i r u r, 1 r n}. Each point of the n-dimensional index space is a distinct iteration of the loop body. L = (l 1,..., l n ) and U = (u 1,..., u n ) are the initial and terminal points of the index space. Figure 1. Computational model The only requirement is that any loop instance LB(j) depends on previous loop instance(s) as follows: LB(j) = f(lb(j d 1 ),..., LB(j d m )) (1) In the above equation f must be a computable function. 3. If the above condition does not hold, the proposed methodology is not applicable to this particular loop. Deciding whether condition (1) holds for any arbitrary loop (containing if, while, exit, continue statements) is a very difficult problem, and we do not claim we have a solution for it. However, we claim that if one verifies condition (1), our methodology is valid. In practice this is usually very easy. To demonstrate that for many real life examples it is trivial to establish that condition (1) holds, we have chosen FSBM as a case study. The nested loops for FSBM is shown in Figure Finding the optimal scheduling hyperplane using Convex Hulls The hyperplane (or wavefront) method [1] was one of the first methods for parallelizing uniform dependence loops. Consequently, it formed the base of many heuristics algorithms developed in the recent years. The choice of the optimal hyperplane for a given index space was one of the challenges that many researchers tackled using diophantine equations, linear programming in subspaces and integer programming. In [3] and [20] the problem of finding the optimal hyperplane for uniform dependence loops was reduced to the problem of computing the convex hull 2 of the dependence vectors and the terminal point. For instance, given the dependence vectors d 1 = (1, ), d 2 = (2, 5), d 3 = (3, 3), d 4 = (, 2) and d 5 = (, 1), consider two index spaces: in the first the terminal point is U 1 = (75, 90) (see Figure 3 (a)), and in the second 2 The convex hull formed from the index points j 1,..., j m is defined as: CH = {j N n j = λ 1 j λ mj m, where λ 1,..., λ m 0 and λ λ m = 1}.
4 3 Lexicographic Ordering on Hyperplanes The central issue in dynamic scheduling is finding an adaptive rule for instructing every processor what to do at run time rather than explicitly specifying at compile time. The adaptive rule is based on the ability to determine the next-to-be-executed point or the required-alreadyexecuted-point for any loop instance. In order to efficiently define such a rule, a partial ordering over the index points is necessary. This section describes the lexicographic ordering on hyperplanes by which the index space is traversed lexicographically along hyperplanes, yielding a zigzag traversal in a 2D index space, or a spiral traversal in a 3D (or higher) index space. In order to achieve this, the concepts of successor and predecessor for a point are introduced. The successor provides: the means by which each processor can determine what iteration point to execute next, and Figure 2. Six-level nested loop full search block matching motion estimation algorithm a fast and reliable way for the currently running processor to determine the processors requiring the locally computed data. The converse hold for the predecessor. 4 2 y d1 A d2 B d3 C d4 d5 D (a) E U1 U x Figure 3. spaces 4 2 y d1 Optimal hyperplane for two different index U 2 = (105, 90) (see Figure 3 (b)). As shown in Figure 3 the the convex hulls are the index points contained by the polygons A, B, C, E, U 1 and A, B, C, E, U 2, respectively. The optimal hyperplane in the first case is defined by the dependence vectors ( d 2, d 3 ), whereas in the second case it is defined by ( d 3, d 5 ). Hence, the equation of the optimal family of hyperplanes in the first case is 2x 1 + x 2 = k, k N, and in the second case 2x 1 + 5x 2 = k, k N. Scheduling the loop nest of Figure 3(a) along hyperplane 2x 1 + x 2 = k leads to a parallel time of (240/9) + 1 = 2 time steps, assuming the initial point of the index space is L 1 = (0, 0). The algorithm used to compute the convex hulls is the well known QuickHull algorithm [19]. A d2 B d3 C d4 d5 D (b) E x All index points on a hyperplane can be lexicographically ordered. Therefore, given a family of hyperplanes of the general form Π k : a 1 x a n x n = k, (2) where a i, x i, k N, there exists a lexicographically minimum and a lexicographically maximum index point on every hyperplane that contains index points 3. The algorithms that produce the minimum and maximum points of hyperplane Π k, denoted min k,n and max k,n, are given in the Appendix. Both min k,n and max k,n depend on the current hyperplane s value k and the number of coordinates n. Let i and j be two index points that belong to the same hyperplane Π k. j is the successor of i, denoted j = Succ(i), if i < j (i is lexicographically smaller than j) and for no other index point j of the same hyperplane does it hold that i < j < j. In the special case where i is the maximum index point of Π k, then Succ(i) is the minimum index point of Π k+1. In a similar fashion, i is the predecessor of j, denoted i = P red(j), if j is the successor of i. Finally, we define: j if r = 0; Succ r (j) = Succ(j) if r = 1; Succ(Succ r 1 (j)) if r > 1. Continuing the example of the previous section, we depict in Figure 3 minimum and the maximum points for 3 It is possible the intersection of the index space with a particular hyperplane to be the empty set, in which case this hyperplane has no minimum and no maximum index point.
5 hyperplanes 2x 1 +x 2 = 9 (Figure 3(a)) and 2x 1 +5x 2 = 21 (Figure 3(b)) min max (a) min Figure 4. Minimum and maximum points on hyperplanes. (b) max Generation phase) for the given number of processors. This is achieved with the help of a Perl script that operates on a configuration file, which contains all the required information. Due to the fact that we tackle complex loop nests, implementation of a general parser for such loops is beyond the scope of our research. Hence, the user must manually define the loop body and the index space boundaries. In the configuration file the user must also define a startup function in C (automatically called by the parallel generated code) to perform data initialization on every processor, right before the actual parallel computation starts. The parallel code is written in C and contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the multicomputer at hand (in the Run Time stage). In this paper we advocate the use of successor index points along the optimal hyperplane as an adaptive dynamic rule. This rule is efficient in the sense that the overhead induced by its computation is negligible. Moreover, with regards to distributed memory platforms, it does not incur any additional communication cost. The index space is traversed hyperplane by hyperplane, and each hyperplane lexicographically, yielding a zig-zag/spiral traversal. This ensures its validity and its optimality. The former because the points of any subsequent hyperplane depend (eventually) on the points of the current hyperplane (and perhaps on points of previous hyperplanes), and the latter because by following the optimal hyperplane all inherent parallelism is exploited. 4 Overview of CRONUS/1 This section gives an overview of the tool and details the SDS method. 1. Cronus/1 is an existing semi-automatic parallelization tool and a detailed description of the tool is contained in the Cronus User s Guide that can be found at cflorina/research/ongoing research.html. At this site one can also find the generated parallel code for the FSBM example given in the following section. The organization of CRONUS/1 is given in Figure 4. In the first stage (User Input), the user inputs a serial program. The next stage (Compile Time) consists of the loop nest detection phase (Parallelism Detection). If no loop nest can be found in the sequential program, the tool stops. Throughout the second phase (Parameters Extraction), the program is parsed and the following essential parameters are extracted: depth of the loop nest (n), size of the (iteration) index space ( J ) and dependence vectors set (DS). Once the crucial parameters are available, the tool calls the QuickHull algorithm, which returns the optimal hyperplane. At this point, the available number of processors (NP ) is required as input. With the help of an automatic code generator the appropriate parallel code is generated (in the Automatic Code Figure 5. Organization of CRONUS/1 4.1 Successive Dynamic Scheduling (SDS) The scheduling method introduced in this paper was developed based on the following assumption: the multiprocessor system is uniform/homogenenous (i.e., the processors are identical) and non preemptive (a processor completes the current task before executing a new one). The most prominent features of SDS are: It is a dynamic scheduling strategy (both scheduling and task execution are performed at runtime based on the availability of processors and on releasing iteration points from their dependencies, if they exist). It is a distributed scheduling (the scheduling task and/or the scheduling information are distributed among processors and their memories). It is a self-scheduling technique (an idle processor determines its next iteration of the loop nest by incrementing the loop indices in a synchronized way). The scheduling policy is the following: assuming there are NP available processors (P 1,..., P NP ), P 1 executes the initial index point L, P 2 executes Succ(L), P 3 executes Succ 2 (L), and so on, until all processors are employed in execution for the first time. Upon completion of L, P 1 executes the next ready-to-be-executed point, found by skipping NP points in the index space (in the zigzag/spiral manner described in 3). The coordinates of this point are obtained by applying the Succ function NP times
6 to the point currently executed by P 1 : Succ NP (L). Similarly, upon completion of its current point (call it j), P 2 executes the point given by Succ NP (j) and so on until exhausting all index points. SDS ends when the terminal point U has been executed. Thus, SDS uses a perfect load distribution policy because it assigns to a processor only one loop iteration at a time. In other words, the way iterations are assigned to processors follows the round-robin scheduling method to achieve this perfect load balancing. 5 Experimental Validation CRONUS/1 was coded in C, except for the Automatic Code Generator written in Perl. The parallel code produced by CRONUS/1 uses point-to-point, synchronous send and receive MPI calls when required. The experiments were conducted on a cluster with 1 identical 500MHz Pentium III nodes. Each node has 25MB of RAM and 10GB hard drive and runs Linux with kernel version. We used MPI (MPICH) to run the experiments over the FastEthernet interconnection network. 5.1 Case Study: FSBM ME algorithm Block motion estimation in video coding standards such as MPEG-1, 2, 4 and H.21 is perhaps one of the most computation-intensive multimedia operations. Hence is also the most implemented algorithm. The block matching algorithm is an essential element in video compression to remove the temporal redundancy among adjacent frames. The motion compensated frame is reconstructed from motion estimated blocks of pixels. Every pixel in each block is assumed to displace with the same 2D displacement called motion vector, obtained with the Block ME algorithm. The Full-Search Block-Matching Motion Estimation Algorithm (FSBM ME) [21] is a block matching method, for which every pixel in the search area is tested in order to find the best matching block. Therefore, this algorithm offers the best match, at an extremely high computation cost. Assuming a current video frame is divided into N h N v blocks in the horizontal and vertical directions, respectively, with each block containing N N pixels, the most popular similarity criterion is the mean absolute distortion (MAD), defined as MAD(m, n) = 1 N 2 N 1 i=0 N 1 j=0 x(i, j) y(i + m, j + n) (3) where x(i, j) and y(i + m, j + n) are the pixels of the current and previous frames. The motion vector (MV) corresponding to the minimum MAD within the search area is given by MV = arg{minmad(m, n)}, p m, n p, (4) where p is the search range parameter. The algorithm focuses on the situation where the search area is a region in the reference frame consisting of (2p + 1) 2 pixels. In FSBM, MAD differences between the current block and all (2p + 1) 2 candidate blocks are to be computed. The displacement that yields the minimum MAD among these (2p + 1) 2 positions is chosen as the motion vector corresponding to the present block. For the entire video frame, this highly regular FSBM can be described as a six-level nested loop algorithm, as shown in Fig. 2. As it can be seen from the figure, the general loop nest is designated by the outer two loops, whereas the inner four loops represent the loop body. Unfortunately, this algorithm does not have any loop-carried dependencies involving the two outer loop indices, i.e., the iterations of the FSBM loop body are completely independent of each other. This makes the FSBM ME algorithm a particular case study for CRONUS/1 due to no loop-carried dependencies. We have also tested the algorithm with artificial dependencies involving the two outer loops, and the results are comparable with the ones presented in 5.2. We did not present these results here due to space limitations. However, CRONUS/1 performs very well for the FSBM as it is (as it can be seen in later in this section), thus preserving the advantage of perfect load balancing provided by the SDS. 5.2 Testing with FSBM In this section we present the performance of CRONUS/1 for our case study FSBM algorithm. The parallel execution time is the sum of communication time, busy time (SDS overhead + loop body computation) and idle time (time spend by a processor on waiting for data to become available, or points to become eligible for execution). The obtained speedup (defined as the sequential execution time over the parallel execution time) is reported in comparison with the ideal speedup for different number of processors. The efficiency (measured in percents) is defined as the speedup over the number of processors used to achieve Speedup # of processors, the respective speedup, i.e. efficiency = and the optimal efficiency is considered 100%. We experimented on the FSBM algorithm with different frame sizes and search ranges and produced the serial and parallel code. Our results prove to be very close to the ideal speedup. The results in Fig. were produced for a frame of pixels, a search range of 10 blocks, and the block size was pixels. The results in Fig. 7 are for pixels frame, search range of 15 blocks and the block size was the same as previous. Finally, Fig. shows the performance of our tool for frames of pixels, search range 15 blocks, and the same block size. For all FSBM tests, CRONUS/1 used the available 1-node cluster. This case study does not exhibit loop carried dependencies (with respect to the two outer loops). However we chose to present FSBM algorithm because of its high practical importance in video coding and to its extremely high computational cost (due to its complex loop body).
7 Figure. Speedup comparison for 1024 x 7 frame size In this section we present the performance of CRONUS/1 for a suite of randomly generated GLs with uniform loop carried dependencies. The parallel execution time is taken as above (the sum of communication time, busy time: SDS overhead + loop body computation, and idle time: the time spend by a processor on waiting for data to become available, or points to become eligible for execution). The obtained speedup is defined as above, and is reported in comparison with the ideal speedup for different number of processors. The efficiency (measured in percents) is defined as the speedup over the number of processors used to achieve Speedup # of processors, the respective speedup, i.e. efficiency = and the optimal efficiency is considered 100%. The prerequisite for parallelizing to obtain most favorable speedups is that the problem size must be sufficiently large. The two sets of randomly generated GLs have index space sizes of and 00 00, respectively, 2- uniform dependence vectors, with sizes of 1-. The results for these two sets are given in Fig. 9 and 10. The experimental results validate the presented theory and corroborate the efficiency of the generated parallel code. Figure 7. Speedup comparison for 120 x 1024 frame size Figure 9. Speedup comparison for random examples with 00 x 00 index space size Figure. Speedup comparison for 100 x 1024 frame size 5.3 Experiments on random examples Figure 10. Speedup comparison for random examples with 00 x 00 index space size Conclusion In this paper we propose a dynamic scheduling policy and present a tool that automatically generates parallel code for GLs based on this theory. Our philosophy is that simplicity and efficiency are the key factors for minimizing the
8 runtime of the parallel program. In our approach the compilation time is kept to a minimum because the index space is not traversed and the issue of what iteration to compute next is solved at runtime. We have decided to make this trade-off because the successor concept is very simple and efficient, i.e., it does not incur a significant penalty especially considering heavy loop bodies. We strongly believe that an efficient parallelizing tool should strive to minimize the sum of compilation and run time compared to cost of the original sequential algorithm. Further work will focus on porting the tool onto different interconnection networks such as SCI and Myrinet, both intended for reducing the network latency of the currently used FastEthernet, and on reducing the communication costs. 7 Appendix 7.1 Defining the successor on hyperplanes Consider the hyperplane Π k : a 1 x a n x n = k, where a i, x i, k N. min k,n and max k,n can be defined recursively as follows: { ( k min k,1 = max k,1 = a 1 ) if k = a 1 k a 1 ; undef ined, otherwise k for (i= a n+1 ; i >= 0; i- -) { x n+1 = i; if min (k an+1 i),n is defined then return (min (k an+1 i),n, x n+1); } Figure 11. Computing min k,n+1 recursively Now let j = (j 1,..., j n ) be an index point of hyperplane Π k. Succ(j) is defined as follows: for (l = n 1; l > 0; l- -) { p = j l j n ; for (i=1; i < p a l +1; i++) { q = p a l i; if min q,n l is defined then return (j 1,..., j l + a l i, min q,n l ); } } Figure 12. Computing the successor 7.2 Implementation and availability The tool described here is available by request from the authors. More information and other related papers can be found at cflorina/research/ongoing research.html. References [1] L. Lamport. The parallel execution of DO loops. Comm. of the ACM, 37(2):3 93, February [2] A. Darte, L. Khachiyan, and Y. Robert. Linear scheduling is nearly optimal. Par. Proc. Letters, 1.2:73 1, [3] G. Papakonstantinou, T. Andronikos, and I. Drositis. On the parallelization of UET/UET-UCT loops. NPSC Journal on Computing, [4] D. I. Moldovan and J. Fortes. Partitioning and mapping algorithms into fixed size systolic arrays. IEEE Transactions on Computers, C-35(1):1 11, 19. [5] W. Shang and J.A.B. Fortes. Time optimal linear schedules for algorithms with uniform dependencies. IEEE Transactions on Computers, 40(): , [] F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of the 15th Annual ACM SIG ACT- SIGPLAN Symposium on Principles of Programming Languages, pages , January 19. [7] Jingling Xue. On tiling as a loop transformation. Parallel Processing Letters, 7(4): , [] G. Goumas, M. Athanasaki, and N. Koziris. Automatic code generation for executing tiled nested loops onto parallel architectures. In Proceedings of the 17th symposium on Proceedings of the 2002 ACM symposium on applied computing, pages 7 1. ACM Press, [9] G. Goumas, A. Sotiropoulos, and N. Koziris. Minimizing completion time for loops tiling with computation and communication overlapping. In 15th International Parallel and Distributed Processing Symposium, California, April IEEE Press. [10] J. Ullman. NP-complete scheduling problems. J. of Comp. and System Sciences, 10:34 393, [11] M. R. Garey and D. S. Johnson. Computers and Intractability, a Guide to the Theory of NPcompleteness. W. H. Freeman and Company, New York, [12] W. Shang and J.A.B. Fortes. Independent partitioning of algorithms with uniform dependencies. IEEE Transactions on Computers, 41:190 20, [13] A.D. Kshemkalyani and M. Singhal. Communication patterns in distributed computations. Journal of Parallel and Distributed Computing, 2: , [14] P. Feautrier. Automatic distribution of data and computations. In Technical Report 2000/3, Mar 2000.
9 [15] D.W. Engels, J. Feldman, D.R. Karger, and M. Ruhl. Parallel processor scheduling with delay constraints. In 12th Annual Symp. on Discrete algorithms, pages , NY, USA, ACM Press. [1] T. Yang and A. Gerasoulis. PYRROS: Static task scheduling and code generation for message passing multiprocessors. In Proceedings of the 1992 ACM International Conference on Supercomputing, Washington, DC, [17] Pierre-Yves Calland, Jack Dongarra, and Yves Robert. Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience, 11(3): , [1] Pierre Boulet, Jack Dongarra, Yves Robert, and Frédéric Vivien. Static tiling for heterogeneous computing platforms. Parallel Computing, 25(5):547 5, [19] C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4):49 43, 199. [20] I. Drositis, T. Andronikos, M. Kalathas, G. Papakonstantinou, and N. Koziris. Optimal loop parallelization in n-dimensional index spaces, [21] H. Yee and Yu Hen Hu. A novel modular systolic array architecture for full-search block matching motion estimation. IEEE Transactions on Circuits and Systems for Video Technology, 5(5):407 41, October 1995.
Chain Pattern Scheduling for nested loops
Chain Pattern Scheduling for nested loops Florina Ciorba, Theodore Andronikos and George Papakonstantinou Computing Sstems Laborator, Computer Science Division, Department of Electrical and Computer Engineering,
More informationA Level-wise Priority Based Task Scheduling for Heterogeneous Systems
International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract
More informationEnhancing Self-Scheduling Algorithms via Synchronization and Weighting
Enhancing Self-Scheduling Algorithms via Synchronization and Weighting Florina M. Ciorba joint work with I. Riakiotakis, T. Andronikos, G. Papakonstantinou and A. T. Chronopoulos National Technical University
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationOptimal Scheduling for UET-UCT Generalized n-dimensional Grid Task Graphs =
Optimal Scheduling for UET-UCT Generalized n-dimensional Grid Task Graphs = Theodore Andronikos, Nectarios Koziris, George Papakonstantinou and Panayiotis Tsanakas National Technical University of Athens
More informationAdvanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens School of Electrical
More informationScan Scheduling Specification and Analysis
Scan Scheduling Specification and Analysis Bruno Dutertre System Design Laboratory SRI International Menlo Park, CA 94025 May 24, 2000 This work was partially funded by DARPA/AFRL under BAE System subcontract
More informationFUTURE communication networks are expected to support
1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationAutomatic Parallel Code Generation for Tiled Nested Loops
2004 ACM Symposium on Applied Computing Automatic Parallel Code Generation for Tiled Nested Loops Georgios Goumas, Nikolaos Drosinos, Maria Athanasaki, Nectarios Koziris National Technical University of
More informationModule 7 VIDEO CODING AND MOTION ESTIMATION
Module 7 VIDEO CODING AND MOTION ESTIMATION Lesson 20 Basic Building Blocks & Temporal Redundancy Instructional Objectives At the end of this lesson, the students should be able to: 1. Name at least five
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationLayer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints
Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,
More informationA Hybrid Recursive Multi-Way Number Partitioning Algorithm
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Hybrid Recursive Multi-Way Number Partitioning Algorithm Richard E. Korf Computer Science Department University
More informationA Novel Task Scheduling Algorithm for Heterogeneous Computing
A Novel Task Scheduling Algorithm for Heterogeneous Computing Vinay Kumar C. P.Katti P. C. Saxena SC&SS SC&SS SC&SS Jawaharlal Nehru University Jawaharlal Nehru University Jawaharlal Nehru University New
More informationProvably Efficient Non-Preemptive Task Scheduling with Cilk
Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798. Abstract We consider the
More informationPerformance Study of the MPI and MPI-CH Communication Libraries on the IBM SP
Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu
More informationProfiling Dependence Vectors for Loop Parallelization
Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw
More informationA Parallel Algorithm for Exact Structure Learning of Bayesian Networks
A Parallel Algorithm for Exact Structure Learning of Bayesian Networks Olga Nikolova, Jaroslaw Zola, and Srinivas Aluru Department of Computer Engineering Iowa State University Ames, IA 0010 {olia,zola,aluru}@iastate.edu
More informationREDUCTION IN RUN TIME USING TRAP ANALYSIS
REDUCTION IN RUN TIME USING TRAP ANALYSIS 1 Prof. K.V.N.Sunitha 2 Dr V. Vijay Kumar 1 Professor & Head, CSE Dept, G.Narayanamma Inst.of Tech. & Science, Shaikpet, Hyderabad, India. 2 Dr V. Vijay Kumar
More informationEfficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.
Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. D.A. Karras, S.A. Karkanis and D. E. Maroulis University of Piraeus, Dept.
More informationDOWNLOAD PDF SYNTHESIZING LINEAR-ARRAY ALGORITHMS FROM NESTED FOR LOOP ALGORITHMS.
Chapter 1 : Zvi Kedem â Research Output â NYU Scholars Excerpt from Synthesizing Linear-Array Algorithms From Nested for Loop Algorithms We will study linear systolic arrays in this paper, as linear arrays
More informationHierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach
Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach O. El Badawy, M. S. Kamel Pattern Analysis and Machine Intelligence Laboratory, Department of Systems Design Engineering,
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More information3 No-Wait Job Shops with Variable Processing Times
3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select
More informationLecture 9: Load Balancing & Resource Allocation
Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently
More informationSupernode Transformation On Parallel Systems With Distributed Memory An Analytical Approach
Santa Clara University Scholar Commons Engineering Ph.D. Theses Student Scholarship 3-21-2017 Supernode Transformation On Parallel Systems With Distributed Memory An Analytical Approach Yong Chen Santa
More informationQUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose
QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,
More informationContext based optimal shape coding
IEEE Signal Processing Society 1999 Workshop on Multimedia Signal Processing September 13-15, 1999, Copenhagen, Denmark Electronic Proceedings 1999 IEEE Context based optimal shape coding Gerry Melnikov,
More informationParallel Implementation of 3D FMA using MPI
Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system
More informationAPPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH
APPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH Daniel Wespetal Computer Science Department University of Minnesota-Morris wesp0006@mrs.umn.edu Joel Nelson Computer Science Department University
More informationMulti-Way Number Partitioning
Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Multi-Way Number Partitioning Richard E. Korf Computer Science Department University of California,
More informationOn Multi-Stack Boundary Labeling Problems
On Multi-Stack Boundary Labeling Problems MICHAEL A. BEKOS 1, MICHAEL KAUFMANN 2, KATERINA POTIKA 1, ANTONIOS SYMVONIS 1 1 National Technical University of Athens School of Applied Mathematical & Physical
More informationThe Encoding Complexity of Network Coding
The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network
More informationModule 7 VIDEO CODING AND MOTION ESTIMATION
Module 7 VIDEO CODING AND MOTION ESTIMATION Version ECE IIT, Kharagpur Lesson Block based motion estimation algorithms Version ECE IIT, Kharagpur Lesson Objectives At the end of this less, the students
More informationRate Distortion Optimization in Video Compression
Rate Distortion Optimization in Video Compression Xue Tu Dept. of Electrical and Computer Engineering State University of New York at Stony Brook 1. Introduction From Shannon s classic rate distortion
More informationRecent PTAS Algorithms on the Euclidean TSP
Recent PTAS Algorithms on the Euclidean TSP by Leonardo Zambito Submitted as a project for CSE 4080, Fall 2006 1 Introduction The Traveling Salesman Problem, or TSP, is an on going study in computer science.
More informationA CSP Search Algorithm with Reduced Branching Factor
A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il
More informationByzantine Consensus in Directed Graphs
Byzantine Consensus in Directed Graphs Lewis Tseng 1,3, and Nitin Vaidya 2,3 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering, and 3 Coordinated Science Laboratory
More informationComplementary Graph Coloring
International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Complementary Graph Coloring Mohamed Al-Ibrahim a*,
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationInteger Programming Theory
Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x
More informationOn Universal Cycles of Labeled Graphs
On Universal Cycles of Labeled Graphs Greg Brockman Harvard University Cambridge, MA 02138 United States brockman@hcs.harvard.edu Bill Kay University of South Carolina Columbia, SC 29208 United States
More informationContention-Aware Scheduling with Task Duplication
Contention-Aware Scheduling with Task Duplication Oliver Sinnen, Andrea To, Manpreet Kaur Department of Electrical and Computer Engineering, University of Auckland Private Bag 92019, Auckland 1142, New
More informationAffine and Unimodular Transformations for Non-Uniform Nested Loops
th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and
More informationOptimal Sequential Multi-Way Number Partitioning
Optimal Sequential Multi-Way Number Partitioning Richard E. Korf, Ethan L. Schreiber, and Michael D. Moffitt Computer Science Department University of California, Los Angeles Los Angeles, CA 90095 IBM
More informationData Flow Graph Partitioning Schemes
Data Flow Graph Partitioning Schemes Avanti Nadgir and Harshal Haridas Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802 Abstract: The
More informationREDUCING GRAPH COLORING TO CLIQUE SEARCH
Asia Pacific Journal of Mathematics, Vol. 3, No. 1 (2016), 64-85 ISSN 2357-2205 REDUCING GRAPH COLORING TO CLIQUE SEARCH SÁNDOR SZABÓ AND BOGDÁN ZAVÁLNIJ Institute of Mathematics and Informatics, University
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationHEURISTIC ALGORITHMS FOR THE GENERALIZED MINIMUM SPANNING TREE PROBLEM
Proceedings of the International Conference on Theory and Applications of Mathematics and Informatics - ICTAMI 24, Thessaloniki, Greece HEURISTIC ALGORITHMS FOR THE GENERALIZED MINIMUM SPANNING TREE PROBLEM
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationDiscrete Optimization. Lecture Notes 2
Discrete Optimization. Lecture Notes 2 Disjunctive Constraints Defining variables and formulating linear constraints can be straightforward or more sophisticated, depending on the problem structure. The
More informationarxiv:cs/ v1 [cs.ds] 20 Feb 2003
The Traveling Salesman Problem for Cubic Graphs David Eppstein School of Information & Computer Science University of California, Irvine Irvine, CA 92697-3425, USA eppstein@ics.uci.edu arxiv:cs/0302030v1
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationA modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems
A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University
More informationUsing animation to motivate motion
Using animation to motivate motion In computer generated animation, we take an object and mathematically render where it will be in the different frames Courtesy: Wikipedia Given the rendered frames (or
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More informationMinimum-Area Rectangle Containing a Set of Points
Minimum-Area Rectangle Containing a Set of Points David Eberly, Geometric Tools, Redmond WA 98052 https://www.geometrictools.com/ This work is licensed under the Creative Commons Attribution 4.0 International
More informationPredicated Software Pipelining Technique for Loops with Conditions
Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process
More informationAutomatic Counterflow Pipeline Synthesis
Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More information6. Parallel Volume Rendering Algorithms
6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks
More informationarxiv: v2 [cs.ds] 22 Jun 2016
Federated Scheduling Admits No Constant Speedup Factors for Constrained-Deadline DAG Task Systems Jian-Jia Chen Department of Informatics, TU Dortmund University, Germany arxiv:1510.07254v2 [cs.ds] 22
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationPartitioning and mapping nested loops on multicomputer
Partitioning and mapping nested loops on multicomputer Tzung-Shi Chen* & Jang-Ping Sheu% ^Department of Information Management, Chang Jung University, Taiwan ^Department of Computer Science and Information
More informationγ 2 γ 3 γ 1 R 2 (b) a bounded Yin set (a) an unbounded Yin set
γ 1 γ 3 γ γ 3 γ γ 1 R (a) an unbounded Yin set (b) a bounded Yin set Fig..1: Jordan curve representation of a connected Yin set M R. A shaded region represents M and the dashed curves its boundary M that
More informationComplexity Results on Graphs with Few Cliques
Discrete Mathematics and Theoretical Computer Science DMTCS vol. 9, 2007, 127 136 Complexity Results on Graphs with Few Cliques Bill Rosgen 1 and Lorna Stewart 2 1 Institute for Quantum Computing and School
More informationPreemptive Scheduling of Equal-Length Jobs in Polynomial Time
Preemptive Scheduling of Equal-Length Jobs in Polynomial Time George B. Mertzios and Walter Unger Abstract. We study the preemptive scheduling problem of a set of n jobs with release times and equal processing
More informationA Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems
A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems Yi-Hsuan Lee and Cheng Chen Department of Computer Science and Information Engineering National Chiao Tung University, Hsinchu,
More informationNew Optimal Load Allocation for Scheduling Divisible Data Grid Applications
New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,
More informationCHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION
CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant
More informationA Connection between Network Coding and. Convolutional Codes
A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationCrossing Families. Abstract
Crossing Families Boris Aronov 1, Paul Erdős 2, Wayne Goddard 3, Daniel J. Kleitman 3, Michael Klugerman 3, János Pach 2,4, Leonard J. Schulman 3 Abstract Given a set of points in the plane, a crossing
More informationParallel Algorithm Design. Parallel Algorithm Design p. 1
Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html
More informationOn the Max Coloring Problem
On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive
More informationAn algorithm for Performance Analysis of Single-Source Acyclic graphs
An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs
More informationAn Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering
An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering T. Ropinski, F. Steinicke, K. Hinrichs Institut für Informatik, Westfälische Wilhelms-Universität Münster
More informationStructural Advantages for Ant Colony Optimisation Inherent in Permutation Scheduling Problems
Structural Advantages for Ant Colony Optimisation Inherent in Permutation Scheduling Problems James Montgomery No Institute Given Abstract. When using a constructive search algorithm, solutions to scheduling
More informationScheduling on clusters and grids
Some basics on scheduling theory Grégory Mounié, Yves Robert et Denis Trystram ID-IMAG 6 mars 2006 Some basics on scheduling theory 1 Some basics on scheduling theory Notations and Definitions List scheduling
More informationTheorem 2.9: nearest addition algorithm
There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used
More informationParallel Job Scheduling
Parallel Job Scheduling Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam -1- Scheduling on UMA Multiprocessors Schedule: allocation of tasks to processors Dynamic scheduling A single queue of ready
More informationFlash Drive Emulation
Flash Drive Emulation Eric Aderhold & Blayne Field aderhold@cs.wisc.edu & bfield@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison Abstract Flash drives are becoming increasingly
More informationOn Covering a Graph Optimally with Induced Subgraphs
On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number
More informationOptimized energy aware scheduling to minimize makespan in distributed systems.
Biomedical Research 2017; 28 (7): 2877-2883 ISSN 0970-938X www.biomedres.info Optimized aware scheduling to minimize makespan in distributed systems. Rajkumar K 1*, Swaminathan P 2 1 Department of Computer
More informationA Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems
A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time
More informationChapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation
Chapter 10 Basic Video Compression Techniques 10.1 Introduction to Video Compression 10.2 Video Compression with Motion Compensation 10.3 Search for Motion Vectors 10.4 H.261 10.5 H.263 10.6 Further Exploration
More informationToward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation
th International Conference on Advanced Computing and Communications Toward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation Avishek Saha Department of Computer Science and Engineering,
More informationParallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle
Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Plamenka Borovska Abstract: The paper investigates the efficiency of parallel branch-and-bound search on multicomputer cluster for the
More informationDecoupled Software Pipelining in LLVM
Decoupled Software Pipelining in LLVM 15-745 Final Project Fuyao Zhao, Mark Hahnenberg fuyaoz@cs.cmu.edu, mhahnenb@andrew.cmu.edu 1 Introduction 1.1 Problem Decoupled software pipelining [5] presents an
More informationDistributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne
Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel
More informationObjective. A Finite State Machine Approach to Cluster Identification Using the Hoshen-Kopelman Algorithm. Hoshen-Kopelman Algorithm
Objective A Finite State Machine Approach to Cluster Identification Using the Cluster Identification Want to find and identify homogeneous patches in a D matrix, where: Cluster membership defined by adjacency
More informationGeometric Algorithms in Three Dimensions Tutorial. FSP Seminar, Strobl,
Geometric Algorithms in Three Dimensions Tutorial FSP Seminar, Strobl, 22.06.2006 Why Algorithms in Three and Higher Dimensions Which algorithms (convex hulls, triangulations etc.) can be generalized to
More informationAutomatic Parallelization of Sequential C Code
Automatic Parallelization of Sequential C Code Pete Gasper Department of Mathematics and Computer Science South Dakota School of Mines and Technology peter.gasper@gold.sdsmt.edu Caleb Herbst Department
More informationOn the Minimum k-connectivity Repair in Wireless Sensor Networks
On the Minimum k-connectivity epair in Wireless Sensor Networks Hisham M. Almasaeid and Ahmed E. Kamal Dept. of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011 Email:{hisham,kamal}@iastate.edu
More informationColour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation
ÖGAI Journal 24/1 11 Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology
More informationarxiv: v2 [cs.cc] 29 Mar 2010
On a variant of Monotone NAE-3SAT and the Triangle-Free Cut problem. arxiv:1003.3704v2 [cs.cc] 29 Mar 2010 Peiyush Jain, Microsoft Corporation. June 28, 2018 Abstract In this paper we define a restricted
More informationFast and Simple Algorithms for Weighted Perfect Matching
Fast and Simple Algorithms for Weighted Perfect Matching Mirjam Wattenhofer, Roger Wattenhofer {mirjam.wattenhofer,wattenhofer}@inf.ethz.ch, Department of Computer Science, ETH Zurich, Switzerland Abstract
More information