Feedback Guided Scheduling of Nested Loops

Size: px
Start display at page:

Download "Feedback Guided Scheduling of Nested Loops"

Transcription

1 Feedback Guided Scheduling of Nested Loops T. L. Freeman 1, D. J. Hancock 1, J. M. Bull 2, and R. W. Ford 1 1 Centre for Novel Computing, University of Manchester, Manchester, M13 9PL, U.K. 2 Edinburgh Parallel Computing Centre, University of Edinburgh, Edinburgh, EH9 3JZ, U.K. Abstract. In earlier papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to be very effective for certain loop scheduling problems which involve a sequential loop and a parallel inner loop and for which the workload of the parallel loop changes only slowly from one execution to the next. In this paper the extension of these ideas the case of nested parallel loops is investigated. We describe four feedback guided algorithms for scheduling nested loops and evaluate the performances of the algorithms on a set of synthetic benchmarks. 1 Introduction Loops are a rich source of parallelism for many applications. As a result, a variety of loop scheduling algorithms that aim to schedule parallel loop iterations to processors of a shared-memory machine in an almost optimal way have been suggested. In a number of recent papers, Bull, Ford and co-workers ([2], [3], [6]) have shown that a Feedback Guided Dynamic Loop Scheduling (FGDLS) algorithm performs well for certain parallel single loops. Parallel nested loops are an even richer source of parallelism; in this paper we extend these feedback guided scheduling ideas to the more general case of nested loops, and present numerical results to evaluate the performance of the resulting algorithms. 2 Loop Scheduling Many existing loop scheduling algorithms are based on either (variants of) guided selfscheduling (Polychronopoulos and Kuck [11], Eager and Zahorjan [5], Hummel et al. [7], Lucco [8], Tzen and Ni [13]), or (variants of) affinity scheduling algorithms (Markatos and LeBlanc [9], Subramaniam and Eager [12], Yan et al. [14]). The underlying assumption of these algorithms is that each execution of a parallel loop is independent of previous executions of the same loop and therefore has to be rescheduled from scratch. In a sequence of papers, Bull et al. [3] and Bull [2] (see also Ford et al. [6]) propose a substantially different loop scheduling algorithm, termed Feedback Guided Dynamic Loop Scheduling. The algorithm assumes that the workload is changing only slowly from one execution of a loop to the next, so that observed timing information from the current execution of the loop can, and should, be used to guide the scheduling of the next execution of the same loop. In this way it is possible to limit the number of chunks into which the loop iterations is divided to be equal to the number of processors, and

2 thereby it is possible to limit the overheads (loss of data locality, synchronisation costs, etc.) associated with guided self-scheduling and affinity scheduling algorithms. Note that almost all scheduling algorithms are designed to deal with the case of a single loop. The application of the algorithms to parallel loops usually proceeds by either (a) treating only the most loop as parallel (see Section 4.4), or (b) coalescing the loops into a single loop (see Section 4.2). In Section 4.1 we introduce a scheduling algorithm that treats the multidimensionality of the iteration space corresponding to the nested loops directly. 3 Feedback Guided Loop Scheduling The feedback guided algorithms described in [3] and [2] are designed to deal with the case of the scheduling, across p processors, of a single loop of the form: DO SEQUENTIAL J = 1, NSTEPS DO PARALLEL K = 1, NPOINTS CALL LOOP_BODY(K) We assume that the scheduling algorithm has defined the lower and upper loop iteration bounds, l t j ut j ¾ Æ, j 1 2 p on iteration step t, and that the corresponding measured execution times are given by Tj t j 1 2 p. This execution time data enables a piecewise constant approximation to the observed workload to be defined, and the iteration bounds for iteration t 1, l t 1 j u t 1 j ¾ Æ, j 1 2 p are then chosen to approximately equi-partition this piecewise constant function. This feedback guided algorithm has been shown to be particularly effective for problems where the workload of the parallel loop changes slowly from one execution to the next (see [2], [3]) this is a situation that occurs in many applications. In situations where the workload is more rapidly changing, guided self-scheduling or affinity scheduling algorithms are likely to be more efficient. In this paper we consider the extension of feedback guided loop scheduling algorithms to the case of nested parallel loops: DO SEQUENTIAL J = 1, NSTEPS DO PARALLEL K1 = 1, NPOINTS1.. DO PARALLEL KM = 1, NPOINTSM CALL LOOP_BODY(K1,...,KM)..

3 As in the single loop case, the issues that arise in the scheduling of nested loops are load balancing and communication minimisation. In higher dimensions, the problem of minimising the amount of communication implied by the schedule assumes greater significance there are two aspects to this problem: temporal locality and spatial locality. Temporal locality is a measure of the communication implied by the scheduling of different parts of the iteration space to different processors on successive iterations of the sequential loop. Spatial locality is a measure of the communication implied by the computation at each point of the iteration space. 4 Scheduling Nested Loops Of the approaches described in this section, the first, described in Section 4.1, addresses the multi-dimensional problem directly by defining an explicit recursive partition of the M-dimensional iteration space. The next two approaches are based on the transformation of the nested loops into an equivalent single loop, followed by the application of the one-dimensional Feedback Guided Loop Scheduling algorithm to this single loop; of these one-dimensional approaches, the first, described in Section 4.2, is based on coalescing the nested loops into a single loop; the second, described in Section 4.3, uses a space-filling curve technique to define a one-dimensional traversal of the M-dimensional iteration space. The final approach, described in Section 4.4, is based on the application of the onedimensional Feedback Guided Loop Scheduling algorithm to the most parallel loop only. 4.1 Recursive Partitioning This algorithm generalises the algorithm described in Section 3 to multiple dimensions. We assume that the scheduling algorithm has defined a partitioning into p disjoint subregions, S t j, of the M-dimensional iteration space on the (sequential) iteration t: so that p S t j j 1 S t j lt 1 j ut 1 j lt 2 j ut 2 j lt M j ut M j j 1 2 p 1 NPOINTS1 1 NPOINTS2 1 NPOINTSM (i.e. so that the union of the disjoint subregions is the complete iteration space). We also assume that the corresponding measured execution times, Tj t j 1 2 p are known. Based on this information, an M-dimensional piecewise constant approximation Ŵ t to the actual workload at iteration t can be formed: Ŵ t x 1 x 2 x M µ Tj t u t 1 j lt 1 j 1µ ut 2 j lt 2 j 1µ ut M j lt M j 1µ where x i ¾ li t j ut i j i 1 2 M, j 1 2 p. The recursive partitioning algorithm then defines new disjoint subregions S t 1 j j 1 2 p for scheduling the next

4 iteration t 1, so that p j 1 S t 1 j is the complete iteration space and so that the piecewise constant workload function Ŵ t is approximately equi-partitioned across these subregions. This is achieved by recursively partitioning the dimensions of the iteration space in our current implementation, the dimensions are treated from most loop to innermost loop, although it would be easy to develop other, more sophisticated, strategies (to take account of loop length, for example). To give a specific example, if p is a power of 2, then the first partitioning (splitting) point is such that the piecewise constant workload function Ŵ t is approximately bisected in its first dimension; the second partitioning points are in the second dimension and are such that the two piecewise constant workloads resulting from the first partitioning are approximately bisected in the second dimension; subsequent partitioning points further approximately bisect Ŵ t in the other dimensions. It is also straightforward to deal with the case when p is not a power of 2; for example, if p is odd, the first partitioning point is such that the piecewise constant workload function Ŵ t is approximately divided in the ratio p 2 : p 2 in its first dimension ( r is the largest integer less than r and r is the smallest integer greater than r); subsequent partitioning points divide Ŵ t in the appropriate ratios in the other dimensions. 4.2 Loop Coalescing A standard way of reducing a loop nest into a single loop is by loop coalescing (see [1] or [11] for details). The M parallel loops can be coalesced into a single loop of length NPOINTS1*NPOINTS2* *NPOINTSM. For example, for the case M 4 the coalesced loop is given by: DO SEQUENTIAL J=1,NSTEPS DO PARALLEL K=1,NPOINTS1*NPOINTS2*NPOINTS3*NPOINTS4 K1 = ((K-1)/(NPOINTS4*NPOINTS3*NPOINTS2)) + 1 K2 = MOD((K-1)/(NPOINTS4*NPOINTS3),NPOINTS2) + 1 K3 = MOD((K-1)/NPOINTS4,NPOINTS3) + 1 K4 = MOD(K-1,NPOINTS4) + 1 CALL LOOP_BODY(K1,K2,K3,K4) Having coalesced the loops we can use any one-dimensional algorithm to schedule the resulting single loop. The numerical results of [2], [3], show that the feedback guided scheduling algorithms are very effective for problems where the workload of a (one-dimensional) parallel loop changes slowly from one execution to the next. This is the situation in our case and thus we use the one-dimensional Feedback Guided Loop Scheduling algorithm described in Section 3 to schedule the coalesced loop. 4.3 Space-filling Traversal This algorithm is based on the use of space-filling curves to traverse the iteration space. Such curves visit each point in the space exactly once, and have the property that any

5 point on the curve is spatially adjacent to its neighbouring points. Again, any onedimensional scheduling algorithm can be used to partition the resulting one-parameter curve into p pieces (for the results of the next section the Feedback Guided Loop Scheduling algorithm described in Section 3 is used). This algorithm readily extends to M-dimensions, although our current implementation is restricted to M 2. A first order Hilbert curve (see [4]) is used to fill the 2-dimensional iteration space, and the mapping of two-dimensional coordinates to curve distances is performed using a series of bitwise operations. One drawback of the use of such a space-filling curve is that the iteration space must have sides of equal lengths, and the length must be a power of Outer Loop Parallelisation In this algorithm, only the most, K1, loop is treated as a parallel loop, the inner, K2,, KM, loops are treated as sequential loops; the one-dimensional Feedback Guided Loop Scheduling algorithm of Section 3 is used to schedule this single parallel loop. This approach is often recommended because of its simplicity and because it exploits parallelism at the coarsest level. 5 Numerical Results and Conclusions This section reports the results of a set of experiments that simulate the performance of the four algorithms described in Section 4 under synthetic load conditions. Simulation has been used for three reasons; for accuracy (reproducible timing results are notoriously hard to achieve on large HPC systems (see Appendix A of Mukhopadhyay [1])), so that the instrumenting of the code does not affect performance, and to allow large values of p to be tested. Statistics were recorded as each of the algorithms attempted to load-balance a number of computations with different load distributions. At the end of each iteration, the load imbalance was computed as (T max - T mean ) / T max, where T max was the maximum execution time of any processor for the given iteration and T mean was the corresponding mean execution time over all processors. The results presented in Figures 1 4 show average load imbalance over all iterations versus number of processors; we plot average load imbalance to show more clearly the performance of the algorithms in response to a dynamic load. The degree of data reuse between successive iterations was calculated as follows. For each point in the iteration space, a record was kept of the processor to which it was allocated in the previous iteration. For a given processor, the reuse was calculated as the percentage of points allocated to that processor that were also allocated to it in the previous iteration. As with load imbalance, the results presented in Figures 5 8 show reuse averaged over all iterations plotted against number of processors. Four different workload configurations have been used in the experiments:

6 Gauss A Gaussian workload where the centre of the workload rotates about the centre of the iteration space. Parameters control the width of the Gaussian curve, and the radius and time period of the rotation. Pond A constant workload within a circle centred at the centre of the iteration space where the radius of the circle oscillates sinusoidally; the workload outside the circle is zero. Parameters control the initial radius of the circle and the amplitude and period of the radial oscillations. Ridge A workload described by a ridge with Gaussian cross-section that traverses the iteration space. Parameters control the orientation of the ridge, the width of the Gaussian cross-section and the period of transit across the iteration space. Linear A workload that varies linearly across the iteration space, but is static. Only the linear workload is static, the other three workloads are dynamic. The computation is assumed to be executed on an iteration space that is Each of the algorithms described in Section 4 was used to partition the loads for numbers of processors varying between 2 and 256. The results displayed are for 4 iterations of the loop scheduling algorithm (for the dynamic workloads, this corresponds to 2 complete cycles of the workload period). The algorithms were also run for 1 and 2 iterations and very similar results were obtained. The four algorithms are labelled as (recursive partitioning or recursive co-ordinate bisection algorithm), (loop coalescing algorithm), (space-filling traversal or space-filling curve algorithm), ( loop parallelisation). We first consider the load balancing performance summarised in the graphs of Figures 1 4. For two of the dynamic loads (Gauss, Pond, Figures 1 and 2 respectively), there is little to choose between the four algorithms. For the third dynamic load (Ridge, Figure 3) and perform better than and ; and traverse the iteration space in a line by line fashion and this is a very effective strategy for this particular workload. For the static load (Linear, Figure 4), and outperform and, particularly for larger numbers of processors; we note that and achieve essentially perfect load balance. These results can be explained by considering the structures of the four algorithms. Slice and are one-dimensional algorithms adapted to solve the two-dimensional test problems; as such they adopt a relatively fine-grain approach to loop-scheduling and this should lead to good load balance properties. In contrast is a multi-dimensional algorithm and, in its initial stages (the first few partitions), it adopts a relatively coarse-grain approach to loop scheduling. For example, the first partition approximately divides the iteration space into two parts and any imbalance implied by this partition is inherited by subsequent recursive partitions. Outer, which schedules only the most loop, also adopts a coarse-grain approach to the scheduling problem. Given this, the load balance properties of and for the dynamic workload problems are surprisingly good and are often better than either or. We would expect the structure of to be advantageous when we consider the data reuse achieved by the algorithms and this is borne out by the graphs in Figures 5 8 which show the levels of intra-iteration data reuse achieved. Note that data re-use is a measure of the temporal locality achieved by the different scheduling algorithms. In terms of intra-iteration data reuse, clearly outperforms, and for all the dynamic load types and for nearly all numbers of processors. In particular for larger

7 numbers of processors substantially outperforms the other algorithms. Again the explanation for these results lies in the structure of the algorithms; in particular, the fact that treats the multi-dimensional iteration space directly is advantageous for data reuse. In broad terms we observe that the load balance properties of the four algorithms are very similar and it is not possible to identify one of the algorithms as significantly, and uniformly, better than the others. In contrast is substantially better than the other algorithms in terms of intra-iteration data reuse, particularly for larger numbers of processors. How these observed performance differences manifest themselves on an actual parallel machine will depend on a number of factors, including: the relative performances of computation and communication on the machine; the granularity of the computations in the parallel nested loops; the rate at which the workload changes and the extent of the imbalance of the work across the iteration space. A comprehensive set of further experiments to undertake this evaluation of the algorithms on a parallel machine will be reported in a forthcoming paper. Avg. Imbalance Fig. 1. Load Balance for Gauss Load Acknowledgements The authors acknowledge the support of the EPSRC research grant GR/K8574, Feedback Guided Affinity Loop Scheduling for Multiprocessors. References 1. D. F. Bacon, S. L. Graham and O. J. Sharp, (1994) Compiler Transformations for High- Performance Computing, ACM Computing Surveys, vol. 26, no. 4, pp J. M. Bull, (1998) Feedback Guided Loop Scheduling: Algorithms and Experiments, Proceedings of Euro-Par 98, Lecture Notes in Computer Science, Springer-Verlag.

8 Avg. Imbalance Fig. 2. Load Balance for Pond Load Avg. Imbalance Fig. 3. Load Balance for Ridge Load Avg. Imbalance Fig. 4. Load Balance for Linear Load

9 1 8 Avg. Reuse (%) Fig. 5. Reuse for Gauss Load 1 8 Avg. Reuse (%) Fig. 6. Reuse for Pond Load 1 8 Avg. Reuse (%) Fig. 7. Reuse for Ridge Load

10 1 8 Avg. Reuse (%) Fig. 8. Reuse for Linear Load 3. J. M. Bull, R. W. Ford and A. Dickinson, (1996) A Feedback Based Load Balance Algorithm for Physics Routines in NWP, Proceedings of Seventh ECMWF Workshop on the Use of Parallel Processors in Meteorology, World Scientific. 4. A. R. Butz, (1971) Alternative Algorithm for Hilbert s Space-Filling Curve, IEEE Trans. on Computers, vol. 2, pp D. L. Eager and J. Zahorjan, (1992) Adaptive Guided Self-Scheduling, Technical Report , Department of Computer Science and Engineering, University of Washington. 6. R. W. Ford, D. F. Snelling and A. Dickinson, (1994) Controlling Load Balance, Cache Use and Vector Length in the UM, Proceedings of Sixth ECMWF Workshop on the Use of Parallel Processors in Meteorology, World Scientific. 7. S. F. Hummel, E. Schonberg and L. E. Flynn, (1992) Factoring: A Practical and Robust Method for Scheduling Parallel Loops, Communications of the ACM, vol. 35, no. 8, pp S. Lucco, (1992) A Dynamic Scheduling Method for Irregular Parallel Programs, in Proceedings of ACM SIGPLAN 92 Conference on Programming Language Design and Implementation, San Francisco, CA, June 1992, pp E. P. Markatos and T. J. LeBlanc, (1994) Using Processor Affinity in Loop Scheduling on Shared Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 4, pp N. Mukhopadhyay, (1999) On the Effectiveness of Feedback-guided Parallelisation, PhD Thesis, Department of Computer Science, University of Manchester. 11. C. D. Polychronopoulos and D. J. Kuck, (1987) Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers, IEEE Transactions on Computers, C-36(12), pp S. Subramaniam and D. L. Eager, (1994) Affinity Scheduling of Unbalanced Workloads, in Proceedings of Supercomputing 94, IEEE Comp. Soc. Press, pp T. H. Tzen and L. M. Ni, (1993) Trapezoid Self-Scheduling Scheme for Parallel Computers, IEEE Trans. on Parallel and Distributed Systems, vol. 4, no. 1, pp Y. Yan, C. Jin and X. Zhang, (1997) Adaptively Scheduling Parallel Loops in Distributed Shared-Memory Systems, IEEE Trans. on Parallel and Distributed Systems, vol. 8, no. 1, pp

Symbolic Evaluation of Sums for Parallelising Compilers

Symbolic Evaluation of Sums for Parallelising Compilers Symbolic Evaluation of Sums for Parallelising Compilers Rizos Sakellariou Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL United Kingdom e-mail: rizos@csmanacuk Keywords:

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

The Evaluation of Parallel Compilers and Trapezoidal Self- Scheduling

The Evaluation of Parallel Compilers and Trapezoidal Self- Scheduling The Evaluation of Parallel Compilers and Trapezoidal Self- Scheduling Will Smith and Elizabeth Fehrmann May 23, 2006 Multiple Processor Systems Dr. Muhammad Shaaban Overview Serial Compilers Parallel Compilers

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

A Lightweight OpenMP Runtime

A Lightweight OpenMP Runtime Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Load Balancing in the Macro Pipeline Multiprocessor System using Processing Elements Stealing Technique. Olakanmi O. Oladayo

Load Balancing in the Macro Pipeline Multiprocessor System using Processing Elements Stealing Technique. Olakanmi O. Oladayo Load Balancing in the Macro Pipeline Multiprocessor System using Processing Elements Stealing Technique Olakanmi O. Oladayo Electrical & Electronic Engineering University of Ibadan, Ibadan Nigeria. Olarad4u@yahoo.com

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Affine and Unimodular Transformations for Non-Uniform Nested Loops th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

Hierarchical Multi level Approach to graph clustering

Hierarchical Multi level Approach to graph clustering Hierarchical Multi level Approach to graph clustering by: Neda Shahidi neda@cs.utexas.edu Cesar mantilla, cesar.mantilla@mail.utexas.edu Advisor: Dr. Inderjit Dhillon Introduction Data sets can be presented

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations

Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations Stefan Bischof, Ralf Ebner, and Thomas Erlebach Institut für Informatik Technische Universität München D-80290

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Increasing Instruction-Level Parallelism with Instruction Precomputation

Increasing Instruction-Level Parallelism with Instruction Precomputation Increasing Instruction-Level Parallelism with Instruction Precomputation Joshua J. Yi, Resit Sendag, and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing Institute

More information

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen Graph Partitioning for High-Performance Scientific Simulations Advanced Topics Spring 2008 Prof. Robert van Engelen Overview Challenges for irregular meshes Modeling mesh-based computations as graphs Static

More information

Filling Space with Random Line Segments

Filling Space with Random Line Segments Filling Space with Random Line Segments John Shier Abstract. The use of a nonintersecting random search algorithm with objects having zero width ("measure zero") is explored. The line length in the units

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

A Class of Loop Self-Scheduling for Heterogeneous Clusters

A Class of Loop Self-Scheduling for Heterogeneous Clusters A Class of Loop Self-Scheduling for Heterogeneous Clusters Anthony T. Chronopoulos Razvan Andonie Dept. of Computer Science, Dept. of Electronics and Computers, University of Texas at San Antonio, Transilvania

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

The Elimination of Correlation Errors in PIV Processing

The Elimination of Correlation Errors in PIV Processing 9 th International Symposium on Applications of Laser Techniques to Fluid Mechanics, Lisbon, Portugal, July, 1998 The Elimination of Correlation Errors in PIV Processing Douglas P. Hart Massachusetts Institute

More information

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Computing and Informatics, Vol. 36, 2017, , doi: /cai Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering

More information

Scalable loop self-scheduling schemes for heterogeneous clusters

Scalable loop self-scheduling schemes for heterogeneous clusters 110 Int. J. Computational Science and Engineering, Vol. 1, Nos. 2/3/4, 2005 Scalable loop self-scheduling schemes for heterogeneous clusters Anthony T. Chronopoulos*, Satish Penmatsa, Ning Yu and Du Yu

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*)

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) David K. Poulsen and Pen-Chung Yew Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation

Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation Chris J. Needham and Roger D. Boyle School of Computing, The University of Leeds, Leeds, LS2 9JT, UK {chrisn,roger}@comp.leeds.ac.uk

More information

A Dynamic Periodicity Detector: Application to Speedup Computation

A Dynamic Periodicity Detector: Application to Speedup Computation A Dynamic Periodicity Detector: Application to Speedup Computation Felix Freitag, Julita Corbalan, Jesus Labarta Departament d Arquitectura de Computadors (DAC),Universitat Politècnica de Catalunya(UPC)

More information

A Hybrid Recursive Multi-Way Number Partitioning Algorithm

A Hybrid Recursive Multi-Way Number Partitioning Algorithm Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Hybrid Recursive Multi-Way Number Partitioning Algorithm Richard E. Korf Computer Science Department University

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications Xiaolin Li and Manish Parashar The Applied Software Systems Laboratory Department of

More information

Report Seminar Algorithm Engineering

Report Seminar Algorithm Engineering Report Seminar Algorithm Engineering G. S. Brodal, R. Fagerberg, K. Vinther: Engineering a Cache-Oblivious Sorting Algorithm Iftikhar Ahmad Chair of Algorithm and Complexity Department of Computer Science

More information

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

A paralleled algorithm based on multimedia retrieval

A paralleled algorithm based on multimedia retrieval A paralleled algorithm based on multimedia retrieval Changhong Guo Teaching and Researching Department of Basic Course, Jilin Institute of Physical Education, Changchun 130022, Jilin, China Abstract With

More information

Lecture 14. Performance Programming with OpenMP

Lecture 14. Performance Programming with OpenMP Lecture 14 Performance Programming with OpenMP Sluggish front end nodes? Advice from NCSA Announcements Login nodes are honest[1-4].ncsa.uiuc.edu Login to specific node to target one that's not overloaded

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces

A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces Arun Kejariwal, Paolo D Alberto, Alexandru Nicolau Constantine D. Polychronopoulos Center for Embedded Computer Systems

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

A PACKAGE FOR DEVELOPMENT OF ALGORITHMS FOR GLOBAL OPTIMIZATION 1

A PACKAGE FOR DEVELOPMENT OF ALGORITHMS FOR GLOBAL OPTIMIZATION 1 Mathematical Modelling and Analysis 2005. Pages 185 190 Proceedings of the 10 th International Conference MMA2005&CMAM2, Trakai c 2005 Technika ISBN 9986-05-924-0 A PACKAGE FOR DEVELOPMENT OF ALGORITHMS

More information

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Yuan-Shin Hwang Department of Computer Science National Taiwan Ocean University Keelung 20224 Taiwan shin@cs.ntou.edu.tw

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

IN5050: Programming heterogeneous multi-core processors Thinking Parallel IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good

More information

Cache Management for Shared Sequential Data Access

Cache Management for Shared Sequential Data Access in: Proc. ACM SIGMETRICS Conf., June 1992 Cache Management for Shared Sequential Data Access Erhard Rahm University of Kaiserslautern Dept. of Computer Science 6750 Kaiserslautern, Germany Donald Ferguson

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Extendible Chained Bucket Hashing for Main Memory Databases. Abstract

Extendible Chained Bucket Hashing for Main Memory Databases. Abstract Extendible Chained Bucket Hashing for Main Memory Databases Pyung-Chul Kim *, Kee-Wook Rim, Jin-Pyo Hong Electronics and Telecommunications Research Institute (ETRI) P.O. Box 106, Yusong, Taejon, 305-600,

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

ECE 5730 Memory Systems

ECE 5730 Memory Systems ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line

More information

RECURSIVE PATCHING An Efficient Technique for Multicast Video Streaming

RECURSIVE PATCHING An Efficient Technique for Multicast Video Streaming ECUSIVE ATCHING An Efficient Technique for Multicast Video Streaming Y. W. Wong, Jack Y. B. Lee Department of Information Engineering The Chinese University of Hong Kong, Shatin, N.T., Hong Kong Email:

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Reading Assignment. Lazy Evaluation

Reading Assignment. Lazy Evaluation Reading Assignment Lazy Evaluation MULTILISP: a language for concurrent symbolic computation, by Robert H. Halstead (linked from class web page Lazy evaluation is sometimes called call by need. We do an

More information

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press,   ISSN Parallelization of software for coastal hydraulic simulations for distributed memory parallel computers using FORGE 90 Z.W. Song, D. Roose, C.S. Yu, J. Berlamont B-3001 Heverlee, Belgium 2, Abstract Due

More information

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation L SIMULATION OF SEMICONDUCTOR DEVICES AND PROCESSES Vol. 4 Edited by W. Fichtner, D. Aemmer - Zurich (Switzerland) September 12-14,1991 - Hartung-Gorre Massively Parallel Computation for Three-Dimensional

More information

Extended Dataflow Model For Automated Parallel Execution Of Algorithms

Extended Dataflow Model For Automated Parallel Execution Of Algorithms Extended Dataflow Model For Automated Parallel Execution Of Algorithms Maik Schumann, Jörg Bargenda, Edgar Reetz and Gerhard Linß Department of Quality Assurance and Industrial Image Processing Ilmenau

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs

Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs Jan Gustafsson Department of Computer Engineering, Mälardalen University Box 883, S-721 23 Västerås, Sweden jangustafsson@mdhse

More information

Partitioning with Space-Filling Curves on the Cubed-Sphere

Partitioning with Space-Filling Curves on the Cubed-Sphere Partitioning with Space-Filling Curves on the Cubed-Sphere John M. Dennis Scientific Computing Division National Center for Atmospheric Research P.O. Box 3000 Boulder, CO 80307 dennis@ucar.edu Abstract

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Interprocedural Dependence Analysis and Parallelization

Interprocedural Dependence Analysis and Parallelization RETROSPECTIVE: Interprocedural Dependence Analysis and Parallelization Michael G Burke IBM T.J. Watson Research Labs P.O. Box 704 Yorktown Heights, NY 10598 USA mgburke@us.ibm.com Ron K. Cytron Department

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Satisfiability Modulo Theory based Methodology for Floorplanning in VLSI Circuits

Satisfiability Modulo Theory based Methodology for Floorplanning in VLSI Circuits Satisfiability Modulo Theory based Methodology for Floorplanning in VLSI Circuits Suchandra Banerjee Anand Ratna Suchismita Roy mailnmeetsuchandra@gmail.com pacific.anand17@hotmail.com suchismita27@yahoo.com

More information

CS 450 Numerical Analysis. Chapter 7: Interpolation

CS 450 Numerical Analysis. Chapter 7: Interpolation Lecture slides based on the textbook Scientific Computing: An Introductory Survey by Michael T. Heath, copyright c 2018 by the Society for Industrial and Applied Mathematics. http://www.siam.org/books/cl80

More information

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel

More information

Coarse-to-Fine Search Technique to Detect Circles in Images

Coarse-to-Fine Search Technique to Detect Circles in Images Int J Adv Manuf Technol (1999) 15:96 102 1999 Springer-Verlag London Limited Coarse-to-Fine Search Technique to Detect Circles in Images M. Atiquzzaman Department of Electrical and Computer Engineering,

More information

Benchmarking a B-tree compression method

Benchmarking a B-tree compression method Benchmarking a B-tree compression method Filip Křižka, Michal Krátký, and Radim Bača Department of Computer Science, Technical University of Ostrava, Czech Republic {filip.krizka,michal.kratky,radim.baca}@vsb.cz

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Scalability of Efficient Parallel K-Means

Scalability of Efficient Parallel K-Means Scalability of Efficient Parallel K-Means David Pettinger and Giuseppe Di Fatta School of Systems Engineering The University of Reading Whiteknights, Reading, Berkshire, RG6 6AY, UK {D.G.Pettinger,G.DiFatta}@reading.ac.uk

More information

Optimizing Closures in O(0) time

Optimizing Closures in O(0) time Optimizing Closures in O(0 time Andrew W. Keep Cisco Systems, Inc. Indiana Univeristy akeep@cisco.com Alex Hearn Indiana University adhearn@cs.indiana.edu R. Kent Dybvig Cisco Systems, Inc. Indiana University

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing

5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing 1. Introduction 2. Cutting and Packing Problems 3. Optimisation Techniques 4. Automated Packing Techniques 5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing 6.

More information

Parallelisation of Surface-Related Multiple Elimination

Parallelisation of Surface-Related Multiple Elimination Parallelisation of Surface-Related Multiple Elimination G. M. van Waveren High Performance Computing Centre, Groningen, The Netherlands and I.M. Godfrey Stern Computing Systems, Lyon,

More information

Profiling Dependence Vectors for Loop Parallelization

Profiling Dependence Vectors for Loop Parallelization Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw

More information

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Storing data on disk The traditional storage hierarchy for DBMSs is: 1. main memory (primary storage) for data currently

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

Achieving Distributed Buffering in Multi-path Routing using Fair Allocation

Achieving Distributed Buffering in Multi-path Routing using Fair Allocation Achieving Distributed Buffering in Multi-path Routing using Fair Allocation Ali Al-Dhaher, Tricha Anjali Department of Electrical and Computer Engineering Illinois Institute of Technology Chicago, Illinois

More information

Measuring Geometrical Parameters of Involute Spur Gears to Sub-pixel Resolution.

Measuring Geometrical Parameters of Involute Spur Gears to Sub-pixel Resolution. Measuring Geometrical Parameters of Involute Spur Gears to Sub-pixel Resolution. Mark J Robinson * John P Oakley Dept. of Electrical Engineering University of Manchester Manchester M13 9PL email mark-rspeco.ee.man.ac.uk

More information

Multi-scale Techniques for Document Page Segmentation

Multi-scale Techniques for Document Page Segmentation Multi-scale Techniques for Document Page Segmentation Zhixin Shi and Venu Govindaraju Center of Excellence for Document Analysis and Recognition (CEDAR), State University of New York at Buffalo, Amherst

More information

OpenMP Optimization and its Translation to OpenGL

OpenMP Optimization and its Translation to OpenGL OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For

More information

History-aware Self-Scheduling

History-aware Self-Scheduling History-aware Self-Scheduling Arun Kejariwal Alexandru Nicolau Constantine D. Polychronopoulos Center for Embedded Computer Systems Center of Supercomputing Research and Development School of Information

More information

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Case Studies on Cache Performance and Optimization of Programs with Unit Strides SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information