CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY

Size: px
Start display at page:

Download "CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY"

Transcription

1 CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY T. Andronikos, F. M. Ciorba, P. Theodoropoulos, D. Kamenopoulos and G. Papakonstantinou Computing Systems Laboratory National Technical University of Athens 9th Heroon Polytechnioy Zographou Campus, 157-0, Athens, Greece ABSTRACT This paper deals with general nested loops and proposes a novel dynamic scheduling technique. General loops contain complex loop bodies (consisting of arbitrary program statements, such as assignments, conditions and repetitions) that exhibit uniform loop-carried dependencies. Therefore it is now possible to achieve efficient parallelization for a vast class of loops, mostly found in DSP, PDEs, signal and video coding. At the core of this technique lies a simple and efficient dynamic rule (SDS - Successive Dynamic Scheduling) for determining the next ready-to-beexecuted iteration at runtime. The central idea is to schedule the iterations on-the-fly by using SDS, along the optimal hyperplane (determined using the QuickHull algorithm). Furthermore, a tool (CRONUS/1) that implements this theory and automatically produces the SPMD parallel code for message passing architectures is presented. As a testing case study, the FSBM motion estimation algorithm (used in video coding standards, e.g., MPEG-2, H.21) was used. The tool was also tested on a suite of randomly generated loops. The experimental results validate the presented theory and corroborate the efficiency of the generated parallel code. KEY WORDS General loops, dynamic scheduling, automatic SPMD code generation, message passing architectures. 1 Introduction Parallelizing computational intensive programs has lead to dramatic performance improvements. Usually these programs contain repetitive iterations, the majority in the form of nested for loops. The iterations within a loop nest can be either independent iterations or precedence constrained iterations. The latter can be uniform (constant) or nonuniform throughout the execution of the program. This paper tackles the uniform precedence constraints case by presenting a novel scheduling policy and a code generation technique for general nested loops. When parallelizing nested loops the following tasks need to be carried out: Detection of the inherent parallelism by applying (when necessary) any program transformation that may enable this. Computation scheduling (specifying when the different computations are performed). Computation mapping (specifying where the different computations are performed). Explicitly managing the memory and communication (additional task for distributed memory systems). Generating the code so that each processor will execute its apportioned computation and communication explicitly. Related Work. Scheduling nested loops with uniform dependencies was studied by Lamport, who partitioned the index space into hyperplanes [1]. The idea behind this is that all points that belong to the same hyperplane can be executed in parallel. Darte proved that this method is nearly optimal [2], while the problem of finding the hyperlane that results in the minimum makespan was solved in [3]. Moldovan, Shang, Darte and others applied the hyperplane method to find a linear optimal execution schedule, using diophantine equations [4], linear programming in subspaces [5], or integer programming [2]. All these approaches consider unit execution time for each iteration and zero communication for each communication step (UET model). Today the most prominent technique for the parallelization and code generation of nested loops is tiling [], [7], [], [9]. Nevertheless, parallelizing nested loops for distributed memory machines incurs communication overhead. One way to tackle the scheduling problem in the presence of communication delay, is to model the algorithm by a directed acyclic task graph (DAG), where the computation and communication times are depicted as node and edge weights, respectively. Since the general scheduling problem is well known to be NP-complete [10], [11], researchers have given attention to other methods such heuristics, approximation algorithms etc. In order to minimize the inter-processor communication cost, many researchers focus on partitioning the index space

2 into as independently as possible groups of computations. Shang and Fortes presented in [12] a method for dividing the index space into independent sets of computations that are assigned to different processors, such that the communication cost is zeroed. Kshemkalyani and Singhal [13] identified different communication patterns in distributed computations in order to reduce the overall communication cost. Feautrier [14] explored the possibility of having the parallelizing compiler determine the distribution of data and computations, using only information available in the source program, provided that it contains only for loops and arrays with affine subscripts. Engels et al. presented in [15] a polynomial-time algorithm for scheduling unit-length jobs on a fixed number of identical parallel machines for the case where the precedence constraints graph is a forest of in/out-trees and the delays are bounded by a constant, such that the makespan of the resulting schedule is minimized. However the asserted constraints restrict the applicability of their algorithm. In [1] T. Yang et al. present a tool for statically scheduling DAGs and generating the appropriate code for specific message passing MIMD architectures. The DAGs static scheduling followed by parallel code generation is a drawback in terms of makespan. On top of that, the parallel code is generated for specific architectures (ncube- 2 and INTEL-2), proving lack of portability. Calland et al. present in [17] a method for scheduling and mapping (columnwise/rowwise) tiles on limited computational resources, assuming communication/computation overlap, and in [1] they present a static technique for heterogeneous networks. Generally there are two basic computation scheduling methods: static and dynamic. Static scheduling implies the creation of a timetable for the iterations of the loop nest at compilation time (by traversing once the index space) that preserves the precedence constraints, followed by computation mapping. Dynamic scheduling implies finding a rule of instructing every processor at run time what loop iteration to compute, rather than explicit specification at compile time. Herein, the computation mapping is embedded in the dynamic rule. Code generation is a complex problem, and most methods exploit inter-loop iterations parallelism (task level parallelism), with some effort concerning intra-loop parallelization (instruction level parallelism). A generic approach to code generation is not very practical due to the complexity of the problem. Instead, efficient code generation for classes of specific problems with high applicability leads to enhanced performance over the generic approach. Generally, most of the scheduling complexity stems from the fact that precedence constraints (i.e., the computations must be executed in a specific order so as to preserve the meaning of the original sequential program) and resource constraints (i.e., the number of concurrent operations at any time step is bounded by the available resources of the multicomputer at hand) must both be tackled simultaneously. Problem Description. Given a sequential general loop structure, automatically produce an equivalent high speed-up parallel program using a bounded (but arbitrary) number of processors. Our Approach. In this paper we tackle the computation scheduling, computation mapping and code generation for general loops (GLs) that have arbitrary execution and communication times. GLs are those nested loops for which the loop body consists of generic program statements (such as assignments, conditions and repetitions) and which exhibit uniform loop-carried dependencies (or even no loop-carried dependencies at all). The loop iterations are considered as points in a Cartesian n-dimensional index space, and the loop-carried dependence vectors are considered as directed arcs between the corresponding index points. We focus on iteration level parallelism under the assumption that the number of processors used is finite. Finally, we automatically produce an efficient parallel code for these processors. Our approach to the above problem consists of three steps: 1. We determine the optimal family of hyperplanes 1 using the well known QuickHull algorithm [19]. 2. We define the lexicographic ordering for this hyperplanes, and by applying SDS we schedule the iterations on-the-fly. 3. The presented tool automatically produces the parallel code, for a given number of processors. Step 1 is performed at compile time, whereas steps 2 and 3 are performed at run time. The reason for performing step 2 at run-time is the ability to exploit the uniformity of the index space, in order to find an efficient adaptive rule. Also, when dealing with large index spaces, performing step 2 at compile time would be a very tedious work. In our approach, the overhead associated with step 2 is greatly reduced by performing it at run time. The run time scheduling (SDS) introduced in this paper produces a very efficient time scheduling while meeting the precedence constraints. Also, it scales with the available number of processors, in that for any given number of processors it produces an efficient time schedule. The tool CRONUS/1 produces the parallel code automatically, using SDS routines and MPI primitives for dynamic scheduling and executing of the sequential program. CRONUS/1 takes as input a sequential program, which is parsed and the essential parameters for code generation are extracted: depth of the loop nest (n), size of the index space ( J ) and dependence vectors set (DS) (if they exist). Once the crucial parameters are available, the tool calls the QuickHull algorithm, which returns the optimal hyperplane. At this point, the available number of processors (NP ) is required as input. With the help of an automatic code generator the appropriate parallel code is generated for the given number of processors. This parallel code 1 A hyperplane is optimal if no other hyperplane leads to a smaller makespan.

3 contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the multicomputer at hand. DS = { d 1,..., d m }, m n, is the set of the m dependence vectors, which must be uniform, i.e., constant throughout the index space, and must have non negative coordinates. Contribution The efficiency of the scheduling rule used by SDS. SDS uses this rule to instruct each processors which index point to execute next, as well as to which processors to send data to, and receive data from (if necessary). The efficiency of the rule ensures that during the execution of the parallel code no artificial communication or computation delays, that are not present in the initial sequential program, occur. Moreover, each processor spends most of the time executing the appropriate iteration rather than communicating data. The validity and optimality of SDS. The validity is ensured because the points on any subsequent hyperplane depend (eventually) on the points of the current hyperplane (and perhaps on points of previous hyperplanes), whereas the optimality yields from the fact that all inherent parallelism is exploited by following the optimal hyperplane. Perfect load balancing of SDS under the assumption that all processors are homogeneous. This is ensured by the dynamic rule because it distributes the iterations evenly among processors (in a round robin fashion). Automatic parallel code generation. The automatic code generator facilitates the writing of the parallel code without any help from the user. Therefore it greatly reduces the overhead associated with handwritten coding, especially when conducting experiments for performance comparisons. The rationale behind the automatic code generator is independent of the dynamic scheduling methodology employed. Moreover, generating parallel code for the MPI platform yields high portability. 2 Terminology and Definitions 2.1 General Loops General Loops (GLs) have the form shown in Fig. 1. In this model the loop body LB(j) contains general programs statements that include assignment statements, conditional if statements and repetitions such as for or while. The lower and upper bounds for the loop indices are l i and u i Z, respectively. The depth of the loop nest, n, determines the dimension of the index space J = {j = (i 1,..., i n ) N n l r i r u r, 1 r n}. Each point of the n-dimensional index space is a distinct iteration of the loop body. L = (l 1,..., l n ) and U = (u 1,..., u n ) are the initial and terminal points of the index space. Figure 1. Computational model The only requirement is that any loop instance LB(j) depends on previous loop instance(s) as follows: LB(j) = f(lb(j d 1 ),..., LB(j d m )) (1) In the above equation f must be a computable function. 3. If the above condition does not hold, the proposed methodology is not applicable to this particular loop. Deciding whether condition (1) holds for any arbitrary loop (containing if, while, exit, continue statements) is a very difficult problem, and we do not claim we have a solution for it. However, we claim that if one verifies condition (1), our methodology is valid. In practice this is usually very easy. To demonstrate that for many real life examples it is trivial to establish that condition (1) holds, we have chosen FSBM as a case study. The nested loops for FSBM is shown in Figure Finding the optimal scheduling hyperplane using Convex Hulls The hyperplane (or wavefront) method [1] was one of the first methods for parallelizing uniform dependence loops. Consequently, it formed the base of many heuristics algorithms developed in the recent years. The choice of the optimal hyperplane for a given index space was one of the challenges that many researchers tackled using diophantine equations, linear programming in subspaces and integer programming. In [3] and [20] the problem of finding the optimal hyperplane for uniform dependence loops was reduced to the problem of computing the convex hull 2 of the dependence vectors and the terminal point. For instance, given the dependence vectors d 1 = (1, ), d 2 = (2, 5), d 3 = (3, 3), d 4 = (, 2) and d 5 = (, 1), consider two index spaces: in the first the terminal point is U 1 = (75, 90) (see Figure 3 (a)), and in the second 2 The convex hull formed from the index points j 1,..., j m is defined as: CH = {j N n j = λ 1 j λ mj m, where λ 1,..., λ m 0 and λ λ m = 1}.

4 3 Lexicographic Ordering on Hyperplanes The central issue in dynamic scheduling is finding an adaptive rule for instructing every processor what to do at run time rather than explicitly specifying at compile time. The adaptive rule is based on the ability to determine the next-to-be-executed point or the required-alreadyexecuted-point for any loop instance. In order to efficiently define such a rule, a partial ordering over the index points is necessary. This section describes the lexicographic ordering on hyperplanes by which the index space is traversed lexicographically along hyperplanes, yielding a zigzag traversal in a 2D index space, or a spiral traversal in a 3D (or higher) index space. In order to achieve this, the concepts of successor and predecessor for a point are introduced. The successor provides: the means by which each processor can determine what iteration point to execute next, and Figure 2. Six-level nested loop full search block matching motion estimation algorithm a fast and reliable way for the currently running processor to determine the processors requiring the locally computed data. The converse hold for the predecessor. 4 2 y d1 A d2 B d3 C d4 d5 D (a) E U1 U x Figure 3. spaces 4 2 y d1 Optimal hyperplane for two different index U 2 = (105, 90) (see Figure 3 (b)). As shown in Figure 3 the the convex hulls are the index points contained by the polygons A, B, C, E, U 1 and A, B, C, E, U 2, respectively. The optimal hyperplane in the first case is defined by the dependence vectors ( d 2, d 3 ), whereas in the second case it is defined by ( d 3, d 5 ). Hence, the equation of the optimal family of hyperplanes in the first case is 2x 1 + x 2 = k, k N, and in the second case 2x 1 + 5x 2 = k, k N. Scheduling the loop nest of Figure 3(a) along hyperplane 2x 1 + x 2 = k leads to a parallel time of (240/9) + 1 = 2 time steps, assuming the initial point of the index space is L 1 = (0, 0). The algorithm used to compute the convex hulls is the well known QuickHull algorithm [19]. A d2 B d3 C d4 d5 D (b) E x All index points on a hyperplane can be lexicographically ordered. Therefore, given a family of hyperplanes of the general form Π k : a 1 x a n x n = k, (2) where a i, x i, k N, there exists a lexicographically minimum and a lexicographically maximum index point on every hyperplane that contains index points 3. The algorithms that produce the minimum and maximum points of hyperplane Π k, denoted min k,n and max k,n, are given in the Appendix. Both min k,n and max k,n depend on the current hyperplane s value k and the number of coordinates n. Let i and j be two index points that belong to the same hyperplane Π k. j is the successor of i, denoted j = Succ(i), if i < j (i is lexicographically smaller than j) and for no other index point j of the same hyperplane does it hold that i < j < j. In the special case where i is the maximum index point of Π k, then Succ(i) is the minimum index point of Π k+1. In a similar fashion, i is the predecessor of j, denoted i = P red(j), if j is the successor of i. Finally, we define: j if r = 0; Succ r (j) = Succ(j) if r = 1; Succ(Succ r 1 (j)) if r > 1. Continuing the example of the previous section, we depict in Figure 3 minimum and the maximum points for 3 It is possible the intersection of the index space with a particular hyperplane to be the empty set, in which case this hyperplane has no minimum and no maximum index point.

5 hyperplanes 2x 1 +x 2 = 9 (Figure 3(a)) and 2x 1 +5x 2 = 21 (Figure 3(b)) min max (a) min Figure 4. Minimum and maximum points on hyperplanes. (b) max Generation phase) for the given number of processors. This is achieved with the help of a Perl script that operates on a configuration file, which contains all the required information. Due to the fact that we tackle complex loop nests, implementation of a general parser for such loops is beyond the scope of our research. Hence, the user must manually define the loop body and the index space boundaries. In the configuration file the user must also define a startup function in C (automatically called by the parallel generated code) to perform data initialization on every processor, right before the actual parallel computation starts. The parallel code is written in C and contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the multicomputer at hand (in the Run Time stage). In this paper we advocate the use of successor index points along the optimal hyperplane as an adaptive dynamic rule. This rule is efficient in the sense that the overhead induced by its computation is negligible. Moreover, with regards to distributed memory platforms, it does not incur any additional communication cost. The index space is traversed hyperplane by hyperplane, and each hyperplane lexicographically, yielding a zig-zag/spiral traversal. This ensures its validity and its optimality. The former because the points of any subsequent hyperplane depend (eventually) on the points of the current hyperplane (and perhaps on points of previous hyperplanes), and the latter because by following the optimal hyperplane all inherent parallelism is exploited. 4 Overview of CRONUS/1 This section gives an overview of the tool and details the SDS method. 1. Cronus/1 is an existing semi-automatic parallelization tool and a detailed description of the tool is contained in the Cronus User s Guide that can be found at cflorina/research/ongoing research.html. At this site one can also find the generated parallel code for the FSBM example given in the following section. The organization of CRONUS/1 is given in Figure 4. In the first stage (User Input), the user inputs a serial program. The next stage (Compile Time) consists of the loop nest detection phase (Parallelism Detection). If no loop nest can be found in the sequential program, the tool stops. Throughout the second phase (Parameters Extraction), the program is parsed and the following essential parameters are extracted: depth of the loop nest (n), size of the (iteration) index space ( J ) and dependence vectors set (DS). Once the crucial parameters are available, the tool calls the QuickHull algorithm, which returns the optimal hyperplane. At this point, the available number of processors (NP ) is required as input. With the help of an automatic code generator the appropriate parallel code is generated (in the Automatic Code Figure 5. Organization of CRONUS/1 4.1 Successive Dynamic Scheduling (SDS) The scheduling method introduced in this paper was developed based on the following assumption: the multiprocessor system is uniform/homogenenous (i.e., the processors are identical) and non preemptive (a processor completes the current task before executing a new one). The most prominent features of SDS are: It is a dynamic scheduling strategy (both scheduling and task execution are performed at runtime based on the availability of processors and on releasing iteration points from their dependencies, if they exist). It is a distributed scheduling (the scheduling task and/or the scheduling information are distributed among processors and their memories). It is a self-scheduling technique (an idle processor determines its next iteration of the loop nest by incrementing the loop indices in a synchronized way). The scheduling policy is the following: assuming there are NP available processors (P 1,..., P NP ), P 1 executes the initial index point L, P 2 executes Succ(L), P 3 executes Succ 2 (L), and so on, until all processors are employed in execution for the first time. Upon completion of L, P 1 executes the next ready-to-be-executed point, found by skipping NP points in the index space (in the zigzag/spiral manner described in 3). The coordinates of this point are obtained by applying the Succ function NP times

6 to the point currently executed by P 1 : Succ NP (L). Similarly, upon completion of its current point (call it j), P 2 executes the point given by Succ NP (j) and so on until exhausting all index points. SDS ends when the terminal point U has been executed. Thus, SDS uses a perfect load distribution policy because it assigns to a processor only one loop iteration at a time. In other words, the way iterations are assigned to processors follows the round-robin scheduling method to achieve this perfect load balancing. 5 Experimental Validation CRONUS/1 was coded in C, except for the Automatic Code Generator written in Perl. The parallel code produced by CRONUS/1 uses point-to-point, synchronous send and receive MPI calls when required. The experiments were conducted on a cluster with 1 identical 500MHz Pentium III nodes. Each node has 25MB of RAM and 10GB hard drive and runs Linux with kernel version. We used MPI (MPICH) to run the experiments over the FastEthernet interconnection network. 5.1 Case Study: FSBM ME algorithm Block motion estimation in video coding standards such as MPEG-1, 2, 4 and H.21 is perhaps one of the most computation-intensive multimedia operations. Hence is also the most implemented algorithm. The block matching algorithm is an essential element in video compression to remove the temporal redundancy among adjacent frames. The motion compensated frame is reconstructed from motion estimated blocks of pixels. Every pixel in each block is assumed to displace with the same 2D displacement called motion vector, obtained with the Block ME algorithm. The Full-Search Block-Matching Motion Estimation Algorithm (FSBM ME) [21] is a block matching method, for which every pixel in the search area is tested in order to find the best matching block. Therefore, this algorithm offers the best match, at an extremely high computation cost. Assuming a current video frame is divided into N h N v blocks in the horizontal and vertical directions, respectively, with each block containing N N pixels, the most popular similarity criterion is the mean absolute distortion (MAD), defined as MAD(m, n) = 1 N 2 N 1 i=0 N 1 j=0 x(i, j) y(i + m, j + n) (3) where x(i, j) and y(i + m, j + n) are the pixels of the current and previous frames. The motion vector (MV) corresponding to the minimum MAD within the search area is given by MV = arg{minmad(m, n)}, p m, n p, (4) where p is the search range parameter. The algorithm focuses on the situation where the search area is a region in the reference frame consisting of (2p + 1) 2 pixels. In FSBM, MAD differences between the current block and all (2p + 1) 2 candidate blocks are to be computed. The displacement that yields the minimum MAD among these (2p + 1) 2 positions is chosen as the motion vector corresponding to the present block. For the entire video frame, this highly regular FSBM can be described as a six-level nested loop algorithm, as shown in Fig. 2. As it can be seen from the figure, the general loop nest is designated by the outer two loops, whereas the inner four loops represent the loop body. Unfortunately, this algorithm does not have any loop-carried dependencies involving the two outer loop indices, i.e., the iterations of the FSBM loop body are completely independent of each other. This makes the FSBM ME algorithm a particular case study for CRONUS/1 due to no loop-carried dependencies. We have also tested the algorithm with artificial dependencies involving the two outer loops, and the results are comparable with the ones presented in 5.2. We did not present these results here due to space limitations. However, CRONUS/1 performs very well for the FSBM as it is (as it can be seen in later in this section), thus preserving the advantage of perfect load balancing provided by the SDS. 5.2 Testing with FSBM In this section we present the performance of CRONUS/1 for our case study FSBM algorithm. The parallel execution time is the sum of communication time, busy time (SDS overhead + loop body computation) and idle time (time spend by a processor on waiting for data to become available, or points to become eligible for execution). The obtained speedup (defined as the sequential execution time over the parallel execution time) is reported in comparison with the ideal speedup for different number of processors. The efficiency (measured in percents) is defined as the speedup over the number of processors used to achieve Speedup # of processors, the respective speedup, i.e. efficiency = and the optimal efficiency is considered 100%. We experimented on the FSBM algorithm with different frame sizes and search ranges and produced the serial and parallel code. Our results prove to be very close to the ideal speedup. The results in Fig. were produced for a frame of pixels, a search range of 10 blocks, and the block size was pixels. The results in Fig. 7 are for pixels frame, search range of 15 blocks and the block size was the same as previous. Finally, Fig. shows the performance of our tool for frames of pixels, search range 15 blocks, and the same block size. For all FSBM tests, CRONUS/1 used the available 1-node cluster. This case study does not exhibit loop carried dependencies (with respect to the two outer loops). However we chose to present FSBM algorithm because of its high practical importance in video coding and to its extremely high computational cost (due to its complex loop body).

7 Figure. Speedup comparison for 1024 x 7 frame size In this section we present the performance of CRONUS/1 for a suite of randomly generated GLs with uniform loop carried dependencies. The parallel execution time is taken as above (the sum of communication time, busy time: SDS overhead + loop body computation, and idle time: the time spend by a processor on waiting for data to become available, or points to become eligible for execution). The obtained speedup is defined as above, and is reported in comparison with the ideal speedup for different number of processors. The efficiency (measured in percents) is defined as the speedup over the number of processors used to achieve Speedup # of processors, the respective speedup, i.e. efficiency = and the optimal efficiency is considered 100%. The prerequisite for parallelizing to obtain most favorable speedups is that the problem size must be sufficiently large. The two sets of randomly generated GLs have index space sizes of and 00 00, respectively, 2- uniform dependence vectors, with sizes of 1-. The results for these two sets are given in Fig. 9 and 10. The experimental results validate the presented theory and corroborate the efficiency of the generated parallel code. Figure 7. Speedup comparison for 120 x 1024 frame size Figure 9. Speedup comparison for random examples with 00 x 00 index space size Figure. Speedup comparison for 100 x 1024 frame size 5.3 Experiments on random examples Figure 10. Speedup comparison for random examples with 00 x 00 index space size Conclusion In this paper we propose a dynamic scheduling policy and present a tool that automatically generates parallel code for GLs based on this theory. Our philosophy is that simplicity and efficiency are the key factors for minimizing the

8 runtime of the parallel program. In our approach the compilation time is kept to a minimum because the index space is not traversed and the issue of what iteration to compute next is solved at runtime. We have decided to make this trade-off because the successor concept is very simple and efficient, i.e., it does not incur a significant penalty especially considering heavy loop bodies. We strongly believe that an efficient parallelizing tool should strive to minimize the sum of compilation and run time compared to cost of the original sequential algorithm. Further work will focus on porting the tool onto different interconnection networks such as SCI and Myrinet, both intended for reducing the network latency of the currently used FastEthernet, and on reducing the communication costs. 7 Appendix 7.1 Defining the successor on hyperplanes Consider the hyperplane Π k : a 1 x a n x n = k, where a i, x i, k N. min k,n and max k,n can be defined recursively as follows: { ( k min k,1 = max k,1 = a 1 ) if k = a 1 k a 1 ; undef ined, otherwise k for (i= a n+1 ; i >= 0; i- -) { x n+1 = i; if min (k an+1 i),n is defined then return (min (k an+1 i),n, x n+1); } Figure 11. Computing min k,n+1 recursively Now let j = (j 1,..., j n ) be an index point of hyperplane Π k. Succ(j) is defined as follows: for (l = n 1; l > 0; l- -) { p = j l j n ; for (i=1; i < p a l +1; i++) { q = p a l i; if min q,n l is defined then return (j 1,..., j l + a l i, min q,n l ); } } Figure 12. Computing the successor 7.2 Implementation and availability The tool described here is available by request from the authors. More information and other related papers can be found at cflorina/research/ongoing research.html. References [1] L. Lamport. The parallel execution of DO loops. Comm. of the ACM, 37(2):3 93, February [2] A. Darte, L. Khachiyan, and Y. Robert. Linear scheduling is nearly optimal. Par. Proc. Letters, 1.2:73 1, [3] G. Papakonstantinou, T. Andronikos, and I. Drositis. On the parallelization of UET/UET-UCT loops. NPSC Journal on Computing, [4] D. I. Moldovan and J. Fortes. Partitioning and mapping algorithms into fixed size systolic arrays. IEEE Transactions on Computers, C-35(1):1 11, 19. [5] W. Shang and J.A.B. Fortes. Time optimal linear schedules for algorithms with uniform dependencies. IEEE Transactions on Computers, 40(): , [] F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of the 15th Annual ACM SIG ACT- SIGPLAN Symposium on Principles of Programming Languages, pages , January 19. [7] Jingling Xue. On tiling as a loop transformation. Parallel Processing Letters, 7(4): , [] G. Goumas, M. Athanasaki, and N. Koziris. Automatic code generation for executing tiled nested loops onto parallel architectures. In Proceedings of the 17th symposium on Proceedings of the 2002 ACM symposium on applied computing, pages 7 1. ACM Press, [9] G. Goumas, A. Sotiropoulos, and N. Koziris. Minimizing completion time for loops tiling with computation and communication overlapping. In 15th International Parallel and Distributed Processing Symposium, California, April IEEE Press. [10] J. Ullman. NP-complete scheduling problems. J. of Comp. and System Sciences, 10:34 393, [11] M. R. Garey and D. S. Johnson. Computers and Intractability, a Guide to the Theory of NPcompleteness. W. H. Freeman and Company, New York, [12] W. Shang and J.A.B. Fortes. Independent partitioning of algorithms with uniform dependencies. IEEE Transactions on Computers, 41:190 20, [13] A.D. Kshemkalyani and M. Singhal. Communication patterns in distributed computations. Journal of Parallel and Distributed Computing, 2: , [14] P. Feautrier. Automatic distribution of data and computations. In Technical Report 2000/3, Mar 2000.

9 [15] D.W. Engels, J. Feldman, D.R. Karger, and M. Ruhl. Parallel processor scheduling with delay constraints. In 12th Annual Symp. on Discrete algorithms, pages , NY, USA, ACM Press. [1] T. Yang and A. Gerasoulis. PYRROS: Static task scheduling and code generation for message passing multiprocessors. In Proceedings of the 1992 ACM International Conference on Supercomputing, Washington, DC, [17] Pierre-Yves Calland, Jack Dongarra, and Yves Robert. Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience, 11(3): , [1] Pierre Boulet, Jack Dongarra, Yves Robert, and Frédéric Vivien. Static tiling for heterogeneous computing platforms. Parallel Computing, 25(5):547 5, [19] C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4):49 43, 199. [20] I. Drositis, T. Andronikos, M. Kalathas, G. Papakonstantinou, and N. Koziris. Optimal loop parallelization in n-dimensional index spaces, [21] H. Yee and Yu Hen Hu. A novel modular systolic array architecture for full-search block matching motion estimation. IEEE Transactions on Circuits and Systems for Video Technology, 5(5):407 41, October 1995.

Chain Pattern Scheduling for nested loops

Chain Pattern Scheduling for nested loops Chain Pattern Scheduling for nested loops Florina Ciorba, Theodore Andronikos and George Papakonstantinou Computing Sstems Laborator, Computer Science Division, Department of Electrical and Computer Engineering,

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

Enhancing Self-Scheduling Algorithms via Synchronization and Weighting

Enhancing Self-Scheduling Algorithms via Synchronization and Weighting Enhancing Self-Scheduling Algorithms via Synchronization and Weighting Florina M. Ciorba joint work with I. Riakiotakis, T. Andronikos, G. Papakonstantinou and A. T. Chronopoulos National Technical University

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Optimal Scheduling for UET-UCT Generalized n-dimensional Grid Task Graphs =

Optimal Scheduling for UET-UCT Generalized n-dimensional Grid Task Graphs = Optimal Scheduling for UET-UCT Generalized n-dimensional Grid Task Graphs = Theodore Andronikos, Nectarios Koziris, George Papakonstantinou and Panayiotis Tsanakas National Technical University of Athens

More information

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens School of Electrical

More information

Scan Scheduling Specification and Analysis

Scan Scheduling Specification and Analysis Scan Scheduling Specification and Analysis Bruno Dutertre System Design Laboratory SRI International Menlo Park, CA 94025 May 24, 2000 This work was partially funded by DARPA/AFRL under BAE System subcontract

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Automatic Parallel Code Generation for Tiled Nested Loops

Automatic Parallel Code Generation for Tiled Nested Loops 2004 ACM Symposium on Applied Computing Automatic Parallel Code Generation for Tiled Nested Loops Georgios Goumas, Nikolaos Drosinos, Maria Athanasaki, Nectarios Koziris National Technical University of

More information

Module 7 VIDEO CODING AND MOTION ESTIMATION

Module 7 VIDEO CODING AND MOTION ESTIMATION Module 7 VIDEO CODING AND MOTION ESTIMATION Lesson 20 Basic Building Blocks & Temporal Redundancy Instructional Objectives At the end of this lesson, the students should be able to: 1. Name at least five

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,

More information

A Hybrid Recursive Multi-Way Number Partitioning Algorithm

A Hybrid Recursive Multi-Way Number Partitioning Algorithm Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Hybrid Recursive Multi-Way Number Partitioning Algorithm Richard E. Korf Computer Science Department University

More information

A Novel Task Scheduling Algorithm for Heterogeneous Computing

A Novel Task Scheduling Algorithm for Heterogeneous Computing A Novel Task Scheduling Algorithm for Heterogeneous Computing Vinay Kumar C. P.Katti P. C. Saxena SC&SS SC&SS SC&SS Jawaharlal Nehru University Jawaharlal Nehru University Jawaharlal Nehru University New

More information

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Provably Efficient Non-Preemptive Task Scheduling with Cilk Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798. Abstract We consider the

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

Profiling Dependence Vectors for Loop Parallelization

Profiling Dependence Vectors for Loop Parallelization Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw

More information

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks A Parallel Algorithm for Exact Structure Learning of Bayesian Networks Olga Nikolova, Jaroslaw Zola, and Srinivas Aluru Department of Computer Engineering Iowa State University Ames, IA 0010 {olia,zola,aluru}@iastate.edu

More information

REDUCTION IN RUN TIME USING TRAP ANALYSIS

REDUCTION IN RUN TIME USING TRAP ANALYSIS REDUCTION IN RUN TIME USING TRAP ANALYSIS 1 Prof. K.V.N.Sunitha 2 Dr V. Vijay Kumar 1 Professor & Head, CSE Dept, G.Narayanamma Inst.of Tech. & Science, Shaikpet, Hyderabad, India. 2 Dr V. Vijay Kumar

More information

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. D.A. Karras, S.A. Karkanis and D. E. Maroulis University of Piraeus, Dept.

More information

DOWNLOAD PDF SYNTHESIZING LINEAR-ARRAY ALGORITHMS FROM NESTED FOR LOOP ALGORITHMS.

DOWNLOAD PDF SYNTHESIZING LINEAR-ARRAY ALGORITHMS FROM NESTED FOR LOOP ALGORITHMS. Chapter 1 : Zvi Kedem â Research Output â NYU Scholars Excerpt from Synthesizing Linear-Array Algorithms From Nested for Loop Algorithms We will study linear systolic arrays in this paper, as linear arrays

More information

Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach

Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach O. El Badawy, M. S. Kamel Pattern Analysis and Machine Intelligence Laboratory, Department of Systems Design Engineering,

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Lecture 9: Load Balancing & Resource Allocation

Lecture 9: Load Balancing & Resource Allocation Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently

More information

Supernode Transformation On Parallel Systems With Distributed Memory An Analytical Approach

Supernode Transformation On Parallel Systems With Distributed Memory An Analytical Approach Santa Clara University Scholar Commons Engineering Ph.D. Theses Student Scholarship 3-21-2017 Supernode Transformation On Parallel Systems With Distributed Memory An Analytical Approach Yong Chen Santa

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

Context based optimal shape coding

Context based optimal shape coding IEEE Signal Processing Society 1999 Workshop on Multimedia Signal Processing September 13-15, 1999, Copenhagen, Denmark Electronic Proceedings 1999 IEEE Context based optimal shape coding Gerry Melnikov,

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

APPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH

APPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH APPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH Daniel Wespetal Computer Science Department University of Minnesota-Morris wesp0006@mrs.umn.edu Joel Nelson Computer Science Department University

More information

Multi-Way Number Partitioning

Multi-Way Number Partitioning Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Multi-Way Number Partitioning Richard E. Korf Computer Science Department University of California,

More information

On Multi-Stack Boundary Labeling Problems

On Multi-Stack Boundary Labeling Problems On Multi-Stack Boundary Labeling Problems MICHAEL A. BEKOS 1, MICHAEL KAUFMANN 2, KATERINA POTIKA 1, ANTONIOS SYMVONIS 1 1 National Technical University of Athens School of Applied Mathematical & Physical

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Module 7 VIDEO CODING AND MOTION ESTIMATION

Module 7 VIDEO CODING AND MOTION ESTIMATION Module 7 VIDEO CODING AND MOTION ESTIMATION Version ECE IIT, Kharagpur Lesson Block based motion estimation algorithms Version ECE IIT, Kharagpur Lesson Objectives At the end of this less, the students

More information

Rate Distortion Optimization in Video Compression

Rate Distortion Optimization in Video Compression Rate Distortion Optimization in Video Compression Xue Tu Dept. of Electrical and Computer Engineering State University of New York at Stony Brook 1. Introduction From Shannon s classic rate distortion

More information

Recent PTAS Algorithms on the Euclidean TSP

Recent PTAS Algorithms on the Euclidean TSP Recent PTAS Algorithms on the Euclidean TSP by Leonardo Zambito Submitted as a project for CSE 4080, Fall 2006 1 Introduction The Traveling Salesman Problem, or TSP, is an on going study in computer science.

More information

A CSP Search Algorithm with Reduced Branching Factor

A CSP Search Algorithm with Reduced Branching Factor A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il

More information

Byzantine Consensus in Directed Graphs

Byzantine Consensus in Directed Graphs Byzantine Consensus in Directed Graphs Lewis Tseng 1,3, and Nitin Vaidya 2,3 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering, and 3 Coordinated Science Laboratory

More information

Complementary Graph Coloring

Complementary Graph Coloring International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Complementary Graph Coloring Mohamed Al-Ibrahim a*,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Integer Programming Theory

Integer Programming Theory Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x

More information

On Universal Cycles of Labeled Graphs

On Universal Cycles of Labeled Graphs On Universal Cycles of Labeled Graphs Greg Brockman Harvard University Cambridge, MA 02138 United States brockman@hcs.harvard.edu Bill Kay University of South Carolina Columbia, SC 29208 United States

More information

Contention-Aware Scheduling with Task Duplication

Contention-Aware Scheduling with Task Duplication Contention-Aware Scheduling with Task Duplication Oliver Sinnen, Andrea To, Manpreet Kaur Department of Electrical and Computer Engineering, University of Auckland Private Bag 92019, Auckland 1142, New

More information

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Affine and Unimodular Transformations for Non-Uniform Nested Loops th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and

More information

Optimal Sequential Multi-Way Number Partitioning

Optimal Sequential Multi-Way Number Partitioning Optimal Sequential Multi-Way Number Partitioning Richard E. Korf, Ethan L. Schreiber, and Michael D. Moffitt Computer Science Department University of California, Los Angeles Los Angeles, CA 90095 IBM

More information

Data Flow Graph Partitioning Schemes

Data Flow Graph Partitioning Schemes Data Flow Graph Partitioning Schemes Avanti Nadgir and Harshal Haridas Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802 Abstract: The

More information

REDUCING GRAPH COLORING TO CLIQUE SEARCH

REDUCING GRAPH COLORING TO CLIQUE SEARCH Asia Pacific Journal of Mathematics, Vol. 3, No. 1 (2016), 64-85 ISSN 2357-2205 REDUCING GRAPH COLORING TO CLIQUE SEARCH SÁNDOR SZABÓ AND BOGDÁN ZAVÁLNIJ Institute of Mathematics and Informatics, University

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

HEURISTIC ALGORITHMS FOR THE GENERALIZED MINIMUM SPANNING TREE PROBLEM

HEURISTIC ALGORITHMS FOR THE GENERALIZED MINIMUM SPANNING TREE PROBLEM Proceedings of the International Conference on Theory and Applications of Mathematics and Informatics - ICTAMI 24, Thessaloniki, Greece HEURISTIC ALGORITHMS FOR THE GENERALIZED MINIMUM SPANNING TREE PROBLEM

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

Discrete Optimization. Lecture Notes 2

Discrete Optimization. Lecture Notes 2 Discrete Optimization. Lecture Notes 2 Disjunctive Constraints Defining variables and formulating linear constraints can be straightforward or more sophisticated, depending on the problem structure. The

More information

arxiv:cs/ v1 [cs.ds] 20 Feb 2003

arxiv:cs/ v1 [cs.ds] 20 Feb 2003 The Traveling Salesman Problem for Cubic Graphs David Eppstein School of Information & Computer Science University of California, Irvine Irvine, CA 92697-3425, USA eppstein@ics.uci.edu arxiv:cs/0302030v1

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Using animation to motivate motion

Using animation to motivate motion Using animation to motivate motion In computer generated animation, we take an object and mathematically render where it will be in the different frames Courtesy: Wikipedia Given the rendered frames (or

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Minimum-Area Rectangle Containing a Set of Points

Minimum-Area Rectangle Containing a Set of Points Minimum-Area Rectangle Containing a Set of Points David Eberly, Geometric Tools, Redmond WA 98052 https://www.geometrictools.com/ This work is licensed under the Creative Commons Attribution 4.0 International

More information

Predicated Software Pipelining Technique for Loops with Conditions

Predicated Software Pipelining Technique for Loops with Conditions Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

arxiv: v2 [cs.ds] 22 Jun 2016

arxiv: v2 [cs.ds] 22 Jun 2016 Federated Scheduling Admits No Constant Speedup Factors for Constrained-Deadline DAG Task Systems Jian-Jia Chen Department of Informatics, TU Dortmund University, Germany arxiv:1510.07254v2 [cs.ds] 22

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Partitioning and mapping nested loops on multicomputer

Partitioning and mapping nested loops on multicomputer Partitioning and mapping nested loops on multicomputer Tzung-Shi Chen* & Jang-Ping Sheu% ^Department of Information Management, Chang Jung University, Taiwan ^Department of Computer Science and Information

More information

γ 2 γ 3 γ 1 R 2 (b) a bounded Yin set (a) an unbounded Yin set

γ 2 γ 3 γ 1 R 2 (b) a bounded Yin set (a) an unbounded Yin set γ 1 γ 3 γ γ 3 γ γ 1 R (a) an unbounded Yin set (b) a bounded Yin set Fig..1: Jordan curve representation of a connected Yin set M R. A shaded region represents M and the dashed curves its boundary M that

More information

Complexity Results on Graphs with Few Cliques

Complexity Results on Graphs with Few Cliques Discrete Mathematics and Theoretical Computer Science DMTCS vol. 9, 2007, 127 136 Complexity Results on Graphs with Few Cliques Bill Rosgen 1 and Lorna Stewart 2 1 Institute for Quantum Computing and School

More information

Preemptive Scheduling of Equal-Length Jobs in Polynomial Time

Preemptive Scheduling of Equal-Length Jobs in Polynomial Time Preemptive Scheduling of Equal-Length Jobs in Polynomial Time George B. Mertzios and Walter Unger Abstract. We study the preemptive scheduling problem of a set of n jobs with release times and equal processing

More information

A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems

A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems Yi-Hsuan Lee and Cheng Chen Department of Computer Science and Information Engineering National Chiao Tung University, Hsinchu,

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Crossing Families. Abstract

Crossing Families. Abstract Crossing Families Boris Aronov 1, Paul Erdős 2, Wayne Goddard 3, Daniel J. Kleitman 3, Michael Klugerman 3, János Pach 2,4, Leonard J. Schulman 3 Abstract Given a set of points in the plane, a crossing

More information

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html

More information

On the Max Coloring Problem

On the Max Coloring Problem On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering

An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering T. Ropinski, F. Steinicke, K. Hinrichs Institut für Informatik, Westfälische Wilhelms-Universität Münster

More information

Structural Advantages for Ant Colony Optimisation Inherent in Permutation Scheduling Problems

Structural Advantages for Ant Colony Optimisation Inherent in Permutation Scheduling Problems Structural Advantages for Ant Colony Optimisation Inherent in Permutation Scheduling Problems James Montgomery No Institute Given Abstract. When using a constructive search algorithm, solutions to scheduling

More information

Scheduling on clusters and grids

Scheduling on clusters and grids Some basics on scheduling theory Grégory Mounié, Yves Robert et Denis Trystram ID-IMAG 6 mars 2006 Some basics on scheduling theory 1 Some basics on scheduling theory Notations and Definitions List scheduling

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Parallel Job Scheduling

Parallel Job Scheduling Parallel Job Scheduling Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam -1- Scheduling on UMA Multiprocessors Schedule: allocation of tasks to processors Dynamic scheduling A single queue of ready

More information

Flash Drive Emulation

Flash Drive Emulation Flash Drive Emulation Eric Aderhold & Blayne Field aderhold@cs.wisc.edu & bfield@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison Abstract Flash drives are becoming increasingly

More information

On Covering a Graph Optimally with Induced Subgraphs

On Covering a Graph Optimally with Induced Subgraphs On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number

More information

Optimized energy aware scheduling to minimize makespan in distributed systems.

Optimized energy aware scheduling to minimize makespan in distributed systems. Biomedical Research 2017; 28 (7): 2877-2883 ISSN 0970-938X www.biomedres.info Optimized aware scheduling to minimize makespan in distributed systems. Rajkumar K 1*, Swaminathan P 2 1 Department of Computer

More information

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time

More information

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation Chapter 10 Basic Video Compression Techniques 10.1 Introduction to Video Compression 10.2 Video Compression with Motion Compensation 10.3 Search for Motion Vectors 10.4 H.261 10.5 H.263 10.6 Further Exploration

More information

Toward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation

Toward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation th International Conference on Advanced Computing and Communications Toward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation Avishek Saha Department of Computer Science and Engineering,

More information

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Plamenka Borovska Abstract: The paper investigates the efficiency of parallel branch-and-bound search on multicomputer cluster for the

More information

Decoupled Software Pipelining in LLVM

Decoupled Software Pipelining in LLVM Decoupled Software Pipelining in LLVM 15-745 Final Project Fuyao Zhao, Mark Hahnenberg fuyaoz@cs.cmu.edu, mhahnenb@andrew.cmu.edu 1 Introduction 1.1 Problem Decoupled software pipelining [5] presents an

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

Objective. A Finite State Machine Approach to Cluster Identification Using the Hoshen-Kopelman Algorithm. Hoshen-Kopelman Algorithm

Objective. A Finite State Machine Approach to Cluster Identification Using the Hoshen-Kopelman Algorithm. Hoshen-Kopelman Algorithm Objective A Finite State Machine Approach to Cluster Identification Using the Cluster Identification Want to find and identify homogeneous patches in a D matrix, where: Cluster membership defined by adjacency

More information

Geometric Algorithms in Three Dimensions Tutorial. FSP Seminar, Strobl,

Geometric Algorithms in Three Dimensions Tutorial. FSP Seminar, Strobl, Geometric Algorithms in Three Dimensions Tutorial FSP Seminar, Strobl, 22.06.2006 Why Algorithms in Three and Higher Dimensions Which algorithms (convex hulls, triangulations etc.) can be generalized to

More information

Automatic Parallelization of Sequential C Code

Automatic Parallelization of Sequential C Code Automatic Parallelization of Sequential C Code Pete Gasper Department of Mathematics and Computer Science South Dakota School of Mines and Technology peter.gasper@gold.sdsmt.edu Caleb Herbst Department

More information

On the Minimum k-connectivity Repair in Wireless Sensor Networks

On the Minimum k-connectivity Repair in Wireless Sensor Networks On the Minimum k-connectivity epair in Wireless Sensor Networks Hisham M. Almasaeid and Ahmed E. Kamal Dept. of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011 Email:{hisham,kamal}@iastate.edu

More information

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation ÖGAI Journal 24/1 11 Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology

More information

arxiv: v2 [cs.cc] 29 Mar 2010

arxiv: v2 [cs.cc] 29 Mar 2010 On a variant of Monotone NAE-3SAT and the Triangle-Free Cut problem. arxiv:1003.3704v2 [cs.cc] 29 Mar 2010 Peiyush Jain, Microsoft Corporation. June 28, 2018 Abstract In this paper we define a restricted

More information

Fast and Simple Algorithms for Weighted Perfect Matching

Fast and Simple Algorithms for Weighted Perfect Matching Fast and Simple Algorithms for Weighted Perfect Matching Mirjam Wattenhofer, Roger Wattenhofer {mirjam.wattenhofer,wattenhofer}@inf.ethz.ch, Department of Computer Science, ETH Zurich, Switzerland Abstract

More information