(12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3)

Size: px
Start display at page:

Download "(12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3)"

Transcription

1 EXPERIENCES WITH FRACTILING IN N-BODY SIMULATIONS Ioana Banicescu and Rong Lu Mississippi State University Department of Computer Science and NSF Engineering Research Center for Computational Field Simulation Mailstop 9637, Mississippi State, MS fioana, Keywords: parallel algorithms, dynamic scheduling, load balancing, performance evaluation, scalability ABSTRACT N-body simulations pose load balancing problems mainly due to the irregularity of the distribution of particles and to the dierent processing requirements of particles in the interior and of those near the boundary of the computation space. In the past, most of the methods to overcome performance degradation due to load imbalance used proling work from a previous time step. The overhead of these methods increases with the problem size and the number of processors. Moreover, these methods are not robust to load imbalances due to systemic variances (data access latency and operating system interference). Recently, Fractiling, a new dynamic scheduling technique based on a probabilistic analysis, has considerably improved performance on N-body simulations in distributed memory shared-address space environment. This technique adapts to algorithmic as well as systemic variances. Our goal is to experimentally extend this technique and evaluate its benets in a message passing environment. Here we present our experiences with scheduling N-body simulations with Fractiling on IBM SP2 and SuperMSPARC where the parallel code execution time was improved by up to 40%. 1 INTRODUCTION Scientic problems are often irregular, large and computationally intensive. An interesting class of irregular scientic problems is the N-body problem. N-body simulations arise in many areas of science, ranging from astrophysics to molecular biology. Given the initial positions and velocities of N particles, the problem is to nd the positions and velocities of particles after a number of time steps. The naive sequential N-body algorithm has This work is supported in part by the NSF Grant ASC and by the Research Initiation Grant MSU The research was conducted using the resources of the Maui High Performance Computing Center, which is managed under a cooperative agreement between the United States Air Force Phillips Laboratory and the University of New Mexico. O(N 2 ) complexity per time step. Recently, approximation algorithms of O(N logn) and O(N) that compute the interactions between particles within a specied accuracy have been developed (Barnes and Hutt 1986; Appel 1985; Greengard 1987; Anderson 1992). N-body algorithms are amenable to parallel execution since the calculation of forces on each particle during a time step is, for the most part, independent. Therefore, N-body algorithms are amenable to parallel execution (Greengard and Rokhlin 1987; Leathrum Jr. 1992; Hu and Johnsson 1996; Lu and Okunbor 1997). Performance gains from parallel execution of N-body simulations are dicult to obtain due to load imbalance. Imbalance could be caused by irregularity of distribution of particles and by dierent processing requirements of particles in interior and near computation space boundary. In addition, the distribution of particles varies at each time step. Various methods have previously been employed to balance processor loads and to exploit locality for the next time step. Most of them use proling by gathering information on the work load from a previous time step in order to estimate the optimal work load distribution at the present time step. The cost of these methods increases with the number of processors and particles (Singh 1993; Warren and Salmon 1993). Moreover, these methods employ a static assignment of work load to processors during a time step, due to an assumption that the distribution of particles changes slowly between time steps. These assumptions are not valid in the entire spectrum of applications that use N-body simulations. Processor load imbalances are induced not only by application features, such as irregular data and conditional statements, but also by system eects, such as data access latency and operating system interference. Adapting to system induced load imbalances requires dynamic work assignment. Dynamic scheduling schemes attempt to maintain balanced loads by assigning work to idle processors at run time, during a time step. Thus, they accommodate systemic as well as algorithmic variances. There is a tension between exploiting data local- 1

2 ity and dynamic load balancing during a time step, as the re-assignment of work may necessitate access to remote data. In general, the cost of dynamic schemes is overhead and, potentially, loss of locality. An eective combined scheduling technique that balances processor loads and maintains locality, by exploiting self-similarity properties of fractals, is Fractiling (Hummel, Schonberg, and Flynn 1992; Banicescu and Hummel 1995a). Fractiling is based on a probabilistic analysis. It thus accommodates load imbalances caused by predictable events (such as irregular data) as well as unpredictable events (such as data access latency). Fractiling adapts to algorithmic and system induced load imbalances while maximizing data locality. In fractiling, work and the corresponding data are initially placed to processors in tiles, to maximize locality. The processors that nish early "borrow" decreasing size subtiles of work units from slower processors to balance loads. The sizes of these subtiles are chosen so that they have a high probability of nishing before the optimal time. The subtile assignment are computed in an ecient way by exploiting the self-similarity property of fractals. Previous work on load balancing N-body simulations with Fractiling was applied to the parallel implementation of the Greengard's 3-d Fast Multipole Algorithm and performed on a distributed memory shared-address space environment KSR-1 at the Cornell Theory Center (Banicescu 1996). This paper attempts to experimentally extend the validity and test the benets of this technique in message passing environment, on a SuperMSPARC at the NSF Engineering Research Center for Computational Field Simulation and on a IBM SP2. Our approach to load balancing N-body simulations in this environment is to incorporate fractiling into the N-body code that discretizes the space into an oct-tree. We compare implementations of a parallel and a fractiled N-body simulation on uniform and nonuniform distributions of particles of various sizes and using up to 64 processors. Fractiling could be applied to each level in the oct-tree. However, we choose to only fractile the leaf level since it is computationally the most intensive and imbalanced part of the code. Experimental work conrmed that fractiled N-body simulation code consistently improved performance for both uniform and nonuniform distributions of particles. The next section reviews some of the common techniques for scheduling N-body simulations on parallel and distributed machines. Section x3 describes dynamic scheduling with fractiling and outlines our implementation of the fractiled N-body algorithm code in message passing environment. We discuss experimental results and draw a few conclusions in sections x4 and x5. 2 BACKGROUND AND RE- LATED WORK Previous work on load balancing N-body problem, uses information about the distribution of particles to guide the static assignment of particles to processors (Singh, Holt, Totsuka, et al. 1993; Warren and Salmon 1993; Salmon and Warren 1997; Board, Causey, Jr., et al. 1992; Board, Hakura, Elliot, et al. 1995). The assignment is recomputed after each time step as particles move over time. Some of these techniques include the orthogonal recursive bisection (ORB), and the costzones methods (Warren and Salmon 1992; Singh, Holt, Totsuka, et al. 1993). Others use a hash function to build the hashed oct-tree (HOT) which employs Morton order, a space-lling numbering scheme (Warren and Salmon 1993). Random assignment of subtiles of certain size to processors has also been considered to improve performance of N-body simulations due to load imbalance (Grama, Kumar, and Sameh 1994). With random assignment, the load imbalances of individual subtiles mute each other out to some extent. Some experimentation with new scheduling schemes applied to scientic problems have been presented in (Hummel, Schonberg, and Flynn 1992; Banicescu 1996). These schemes combine static techniques that exploit data locality with dynamic techniques that improve load balancing. In these schemes, work units and their associated data are initially placed on the same processor. Each processor executes its units in decreasing size chunks to preserve load balancing. After exhausting its local work, each processor acquires decreasing size chunks of work from other processors. These decreasing chunks are represented by multidimensional subtiles of the same shape selected to maximize data reuse. The subtiles are combined in Morton order in larger subtiles, thus preserving the self-similarity property (see gure 2 in section x3). In this way, a complex history of executed subtiles do not need be maintained. In scheduling N-body simulations with Fractiling on distributed shared-address space machines, performance of parallel code has shown considerable improvement. 3 SCHEDULING N-BODY SIM- ULATIONS WITH FRACTIL- ING Initially, work is divided into P tiles, and each is initially assigned to a processor. The tile shape can be any parallelepiped. In general, N-body simulations used tiles in blocked dimensions. In Fractiling, new work is acquired by processors in the form of decreasing size 2

3 subtiles, called fractiles. The fractile sizes are chosen such that there is a high probability of their nishing before the optimal time. To simplify the allocation of decreasing-size fractiles, the size for each consecutive P fractiles scheduled in a batch is half of the previous P fractiles. This choice is a result of an approximation of factoring rules that select sizes to ensure a high probability of nishing before the optimal time based on the mean and variance of individual element execution times. Fractiling exploits both the locality and the selfsimilarity properties of the Morton ordering that is used to guide the assignment of fractiles (see gure 2). Previously, Morton ordering has been used only for rapid indexing and ecient addressing. The self-similarity property allows fractiling to eciently be applied to multidimensional problems. In addition, when a processor needs to execute subtiles from another processor's locality, counters can be used directly as indexes into the tile to indicate the start position for next subtile execution, since all subtiles are shued and linearized (see gure 1) (12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3) 4 5 Figure 1: 2-dimensional shued row-major numbering and its fractal Figure 2: Illustration of self-similarity. The eectiveness of applying fractiling on N-body simulation codes has been rst revealed by experiments with non-fractiled and fractiled codes on a KSR-1 (Banicescu and Hummel 1995b). Performance of 2-d and 3-d N- body simulation codes based on the parallelization of Greengard's Fast Multipole Algorithm (FMA) (Greengard and Rokhlin 1987) were improved by as much as 53% by fractiling on both uniform and nonuniform distributions of particles. The Greengard algorithm performs two passes over a quad-tree per time step. During the upward pass, the summary eects of particles in 6 the subtrees are propagated up the tree, and during the downward pass, the local expansions and direct interactions are computed. Although programming of this algorithm involves large amount of work, a lot of progress has been made towards the parallelization of the 3-d FMA in message passing environment in recent years (Lu and Okunbor 1997; Rankin 1995). Here we present our implementation of a parallel 3-d FMA in message passing environment using scheduling with fractiling. We implemented the straightforward parallelization of the 3- d FMA (PFMA) code and its fractiled version (Fract), and performed our experiments on both the IBM SP2 at the Maui High Performance Computing Center and the SuperMSPARC at the National Science Foundation Engineering Center for Computational Field Simulation at Mississippi State University. The IBM SP2 has a scalable distributed memory architecture using the RS/6000 family of superscalar workstations and servers. An SP2 system is composed of one or more frames, each frame containing multiple processor nodes, congured with various amounts of memory and adapter slots. IBM SP2 networks are bidirectional multistage interconnection networks with high performance crossbar switches that allow all processors to send messages simultaneously. Communication between nodes may cross over a variety of network interfaces with, for instance, 500 nanoseconds latency per switch hop. The SuperMSPARC is a 32-processor multicomputer consisting of eight 4-processor clusters (designed and constructed at the NSF Engineering Research Center for Computational Field Simulation). The clusters are tightly coupled Sun SPARCstation 10s with 90MHZ CPU upgrades. The clusters are connected using ATM or Myrinet networks, characterized by low latency and high bandwidth. Our implementation using MPI is based on the initial PVM code from the Parallel Multipole Tree Library developed by the Scientic Computing Group at Duke University (Rankin 1995). We exploited Duke's experience and modied most of the communication code to improve eciency in the MPI environment. The parallel 3-d FMA (PFMA) algorithm was implemented by using common strategies for interleaving computation with communication to reduce the overhead of sharing data in distributed multiprocessor environment. Furthermore, the use of an inverse interaction list mechanism allows processors to have a priori knowledge of which data will be needed by other processors, and send that specic data to them without being prompted. In this way, the synchronization overhead often incurred in irregular computations is avoided. Even when using the best decomposition strategies, 3

4 performance of N-body simulations may suer due to load imbalances created by nonuniform distributions of particles, boundary eects and systemic variances. Designing an ecient 3-d fractiled code was a challenging task, since in distributed message passing environment there is a high potential of overhead increase attributed to communication and synchronization. Fractiling requires consistency of a small set of shared variables. As in the distributed memory shared-address space implementation, space is discretized in P tiles (subrectangles or sub-cubes), and each tile is assigned to a processor. The leaf level of the downward pass was selected to be fractiled since it is computationally the most intensive and imbalanced step. The sparse distribution of particles was represented as a dense cell table, and auxiliary arrays contained the particles in each cell. There are a few possible implementation methods for fractiling in distributed message passing environment. Here we concentrate on a centralized management approach where one processor is selected as a master to manage global variables. In this scheme, only the master has the authority to access and modify shared variables. Thus, this scheme guarantees data consistency and less complexity of programming. However, as the number of processors increases, a bottleneck may occur as the master tries to process larger number of requests from slaves. Other possible implementations that are presently sought by our research group are beyond the scope of this paper. Master/slave communication patterns in our implementation are depicted in gure 3. At the beginning, after dividing the computation space into P tiles, one tile per processor, each processor rst works on its half subtile. When a processor nishes its subtile, it sends a FRACT ASK message to the master. The master looks up its tables, updates the set of shared variables and then it assigns a new subtile size to the requesting processor with FRACT REPLY. The requesting processor receives the answer and continues to work. If the requesting processor completes its own tile and there is work left in another processor's tile, the master assigns a subtile in that processor's tile and sends a FRACT COMM message to the processor indicating to send its data to the requesting processor. The master also sends a FRACT REPLY to the requesting processor indicating which processor is to be helped. After receiving the message from the master, the processor being helped packs its data and sends it to the helper using FRACT ORG DATA. Upon completion of work on this data, the helper processor sends a FRACT ASK to the master to announce completion and request new assignment. It also sends the results of the computation to the processor that owns the data (owner). The above steps are repeated until there is no more work left in any processor's tile. When assigning subtiles to slaves, the master processor always observes the following rules: (i) a processor will have to have all the work completed in its own tile before starting to help another processor; (ii) after completing its own tile, a processor will always work on the tile with the largest available unnished subtile size. With this combination of features, fractiling improves data locality and load imbalance. Furthermore, we may always choose the least loaded processor to serve as a master, in order to compensate for the overhead of management incurred in the master processor. Master 1 FRACT_ASK 3 2 FRACT_REPLY FRACT_FIN_DATA 5 FRACT_COMM Proc 1 Proc 2 FRACT_ORG_DATA 4 Figure 3: Master/slaves communication in the fractiling 4 EXPERIMENTAL RESULTS To test the eectiveness of fractiling, we ran the nonfractiled (PFMA) and the fractiled (Fract) versions of the parallel 3-d FMA code on uniform and nonuniform distributions of particles on both IBM SP2 (on up to 64 processors) and SuperMSPARC (on up to 32 processors). There was small variance in the parallel execution times between runs and thus, we ran each program only 5 times and averaged the results. There was considerable variance in processor nishing times in all PFMA runs. We expected performance gains in fractiled code to reect even processor nishing times within a run. The uniform and nonuniform distributions used in our experiments involved a number of particles between 10K and 100K, resulting in 4 and 5 level oct-trees (respectively) with average densities of particles per leaf box ranging from 12 to 39. The nonuniform distribution of particles, called "corner", was created by shifting a Gaussian distribution's center in one of the space octants (see gure 4). The performance of the fractiled (Fract) versus the nonfractiled (PFMA) code has been measured in terms of execution time, speedup, percent improvement, cost (= time x P). 4

5 Figure 4: Nonuniform distribution (corner) Cost in Seconds Non-Fractiling Fractiling Number of Processors Our experiments indicated that fractiled code was scalable in both uniform and nonuniform distributions. Detailed information on results may be found in (Lu 1997). In uniform distributions with a high number of processors and large problem sizes the fractiled code outperformed the nonfractiled code. For large problems and smaller number of processors improvements were not substantial due to muting eects. In nonuniform distributions fractiled code consistently outperformed the nonfractiled code on both IBM SP2 and SuperMSPARC. Figures 5 and 6 illustrate the dierence in the cost of using fractiled versus nonfractiled code on two nonuniform distributions on IBM SP2 and SuperMSPARC. Cost in Seconds pfma fract Number of Processors Figure 5: Simulation on a nonuniform distribution of 100k particles on a SuperMSPARC Experimental work conducted so far on a parallel 3-d Fast Multipole Algorithm shows that the performance of the parallel code was improved by up to 40% by Fractiling (see Figure 6). In addition, the implementation scales well and is work ecient on both uniform and nonuniform distributions of particles. It exhibits impressive load balancing improvements, over the nonfractiled code, in terms of considerable smaller processor execution variance. The recorded coecient of variation (c.o.v) for the fractiling code was always less that 0.39, whereas the one for the nonfractiled code was as high as Figure 6: Simulation on a nonuniform distribution of 50k particles on IBM SP2 5 CONCLUSION AND FUTURE WORK In general, schemes to load balance N-body simulations have primarily addressed algorithmic variability. Moreover, most of these schemes used proling by gathering data structures and load information from a previous time step. However, experience has shown that these methods incur large overhead that increases with the problem size and the number of processors. These techniques are not robust in case of distributions that change rapidly and unpredictably (i.e. the radiosity problem). In addition, these techniques do not address the problem of considerable variance in processors execution times due to unpredictable system interference. Recently, dynamic scheduling techniques based on a probabilistic analysis, such as Fractiling have been used to address these concerns on N-body simulations in distributed memory shared-address space environment such as KSR-1. Here we report on experiments used to experimentally extend the method and evaluate the benets of this approach in distributed message-passing environment. In this way, Fractiling becomes a robust competitive dynamic scheduling technique for irregular computations such as N-body simulations for both distributed memory programming paradigms: shared-address space and message passing. Our experiments with N-body simulation codes on IBM SP2 and SuperMSPARC on uniform and nonuniform distributions of particles, revealed that the fractiled code consistently outperformed the nonfractiled code. Future work involves the use of a hierarchical scheme to reduce overhead when the number processors increases. 6 ACKNOWLEDGMENTS We thank John Board and the Scientic Computing Group at Duke University for providing us with the PVM 5

6 version of the parallel 3-d FMA code, which was the basis of our earlier implementations. We are grateful to Bassem Medawar, Donna Reese and Anthony Skjellum for valuable suggestions and comments. We would like to acknowledge the Maui High Performance Computing Center and the National Science Foundation Engineering Research Center for Computational Field Simulation where the implementations and all the experiments were conducted. The support of the National Science Foundation through ASC grant and the Mississippi State University for the MSU Research Initiation Grant are gratefully acknowledged. REFERENCES Anderson, C. (1992, July). An Implementation of the Fast Multipole Method Without Multipoles. SIAM J. Sci. Stat. Comput. 13(4), 923{947. Appel, A. W. (1985). An Ecient program for Many- Body Simulations. SIAM Journal of computing 6. Banicescu, I. (1996, January). Load Balancing and Data Locality in the Parallelization of the Fast Multipole Algorithm. Ph. D. thesis, Polytechnic University. Banicescu, I. and S. F. Hummel (1995a, Feburary). Balancing Processor Loads and Exploiting Data Locality in Irregular Computations. Technical Report RC19934, IBM. Banicescu, I. and S. F. Hummel (1995b). Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations. In Proceedings of Supercomputing'95 Conference, December Barnes, J. and P. Hutt (1986). A Hierarchical O(Nlog(N)) Force Calculation Algorithm. Nature 324. Board, J. A., J.. Causey, J. F. L. Jr., et al. (1992). Acclerated molecular dynamic simulations with the parallel fast multipole algorithm. Chemical Physics Letters 198, 23{34. Board, J. A., Z. S. Hakura, W. D. Elliot, et al. (1995, February). Scalable variants of multipole-based algorithms for molecular dynamics applications. In the Proceeding of Seventh SIAM Conference on Parallel Processing for Scientic Computing, Philidelphia, pp. 295{300. SIAM. Grama, A. Y., V. Kumar, and A. Sameh (1994, November). Scalabel Parallel Formulations of Barnes- Hut Method for N-Body Simulations. In Proc. of Supercomputing'94, pp. 439{448. Greengard, L. (1987). The Rapid Evaluation of Potential Fields in Particle Systems. ACM Press. Greengard, L. and V. Rokhlin (1987, May). A fast algorithm for particle simulation. Journal of Computational Physics 73, 325{48. Hu, Y. and S. L. Johnsson (1996, November). A Data- Parallel implementation of O(N) Hierarchical N-body Methods. In Supercomputing'96. Hummel, S. F., E. Schonberg, and L. E. Flynn (1992, August). A Practical and Robust Method for Scheduling Parallel Loops. Communications of the ACM 35 (8), 90{101. Leathrum Jr., J. F. (1992). Parallelization of the Fast Multipole Algorithm: Algorithm and Architecture Design. Ph. D. thesis, Duke University. Lu, E. and D. I. Okunbor (1997, March). An Ecient Load Balancing Technique for Parallel FMA in Message Passing Environment. In Proceeding of Eighth SIAM Conference on Parallel Processing for Scientic Computing. Lu, R. (1997). Parallelization of the Fast Multipole Algorithm with Fractiling in Distributed Memory Architectures. Master's thesis, Mississippi State University. Rankin, W. T. (1995). A Distributed Implementation of the Parallel Multipole Tree Algorithm - Version Duke University, Department of Electrical Engineering. Salmon, J. and M. S. Warren (1997). Parallel, Outof-core methods for N-body Simulation. In Proceeding of 8th SIAM Conference on Parallel Processing for Scientic Computing. SIAM. Singh, J. (1993). Parallel Hierarchical N-body Methods and their Implications for Multiprocessors. Ph. D. thesis, Stanford University. Singh, J., C. Holt, T. Totsuka, et al. (1993). A Parallel Adaptive Fast multipole Algorithm. In Proc. of Supercomputing'93, pp. 54{65. Warren, M. and J. Salmon (1992). Astrophysical N- Body Simulation using Hierarchical Tree Structures. In Proc. of Supercomputing'92. Warren, M. and J. Salmon (1993). A Parallel Hashed Oct Tree N-body Algorithm. In Proceeding of Supercomputing'93, pp. 12{21. IEEE Computer Society. 6

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

TASK FLOW GRAPH MAPPING TO "ABUNDANT" CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC "LIMITED"

TASK FLOW GRAPH MAPPING TO ABUNDANT CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC LIMITED Parallel Processing Letters c World Scientic Publishing Company FUNCTIONAL ALGORITHM SIMULATION OF THE FAST MULTIPOLE METHOD: ARCHITECTURAL IMPLICATIONS MARIOS D. DIKAIAKOS Departments of Astronomy and

More information

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers Yanhong Yuan and Prithviraj Banerjee Department of Electrical and Computer Engineering

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Load Balancing for N-Body Simulations

Load Balancing for N-Body Simulations Load Balancing for N-Body Simulations Daniel Blackburn 1.) Introduction The intent of this project was to implements a simple distributed N-Body Simulation and experiment with several load balancing algorithms

More information

Application. BH_tree. Tree

Application. BH_tree. Tree A Framework for Parallel Tree-Based Scientic Simulations Pangfeng Liu Jan-Jan Wu Department ofcomputer Science Institute of Information Science National Chung Cheng University Academia Sinica Chia-yi,

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

Load Balancing in Individual-Based Spatial Applications.

Load Balancing in Individual-Based Spatial Applications. Load Balancing in Individual-Based Spatial Applications Fehmina Merchant, Lubomir F. Bic, and Michael B. Dillencourt Department of Information and Computer Science University of California, Irvine Email:

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance

Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance Hongzhang Shan and Jaswinder Pal Singh Department of Computer Science Princeton University

More information

On Partitioning Dynamic Adaptive Grid Hierarchies. Manish Parashar and James C. Browne. University of Texas at Austin

On Partitioning Dynamic Adaptive Grid Hierarchies. Manish Parashar and James C. Browne. University of Texas at Austin On Partitioning Dynamic Adaptive Grid Hierarchies Manish Parashar and James C. Browne Department of Computer Sciences University of Texas at Austin fparashar, browneg@cs.utexas.edu (To be presented at

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

100 Mbps DEC FDDI Gigaswitch

100 Mbps DEC FDDI Gigaswitch PVM Communication Performance in a Switched FDDI Heterogeneous Distributed Computing Environment Michael J. Lewis Raymond E. Cline, Jr. Distributed Computing Department Distributed Computing Department

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Fast Methods with Sieve

Fast Methods with Sieve Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley

More information

Research on outlier intrusion detection technologybased on data mining

Research on outlier intrusion detection technologybased on data mining Acta Technica 62 (2017), No. 4A, 635640 c 2017 Institute of Thermomechanics CAS, v.v.i. Research on outlier intrusion detection technologybased on data mining Liang zhu 1, 2 Abstract. With the rapid development

More information

1e+07 10^5 Node Mesh Step Number

1e+07 10^5 Node Mesh Step Number Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

Adaptive-Mesh-Refinement Pattern

Adaptive-Mesh-Refinement Pattern Adaptive-Mesh-Refinement Pattern I. Problem Data-parallelism is exposed on a geometric mesh structure (either irregular or regular), where each point iteratively communicates with nearby neighboring points

More information

A Data-Parallel Adaptive N-body Method

A Data-Parallel Adaptive N-body Method A Data-Parallel Adaptive N-body Method The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Accessed Citable Link Terms

More information

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809 PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA Laurent Lemarchand Informatique ubo University{ bp 809 f-29285, Brest { France lemarch@univ-brest.fr ea 2215, D pt ABSTRACT An ecient distributed

More information

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA GSAT and Local Consistency 3 Kalev Kask and Rina Dechter Department of Information and Computer Science University of California, Irvine, CA 92717-3425 fkkask,dechterg@ics.uci.edu Abstract It has been

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization 10 th World Congress on Structural and Multidisciplinary Optimization May 19-24, 2013, Orlando, Florida, USA Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization Sirisha Rangavajhala

More information

MOTION ESTIMATION IN MPEG-2 VIDEO ENCODING USING A PARALLEL BLOCK MATCHING ALGORITHM. Daniel Grosu, Honorius G^almeanu

MOTION ESTIMATION IN MPEG-2 VIDEO ENCODING USING A PARALLEL BLOCK MATCHING ALGORITHM. Daniel Grosu, Honorius G^almeanu MOTION ESTIMATION IN MPEG-2 VIDEO ENCODING USING A PARALLEL BLOCK MATCHING ALGORITHM Daniel Grosu, Honorius G^almeanu Multimedia Group - Department of Electronics and Computers Transilvania University

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2) Lecture 15 Multiple Processor Systems Multiple Processor Systems Multiprocessors Multicomputers Continuous need for faster computers shared memory model message passing multiprocessor wide area distributed

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers A. Srivastava E. Han V. Kumar V. Singh Information Technology Lab Dept. of Computer Science Information Technology Lab Hitachi

More information

Performance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract

Performance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract Performance Modeling of a Parallel I/O System: An Application Driven Approach y Evgenia Smirni Christopher L. Elford Daniel A. Reed Andrew A. Chien Abstract The broadening disparity between the performance

More information

Parallelizing a seismic inversion code using PVM: a poor. June 27, Abstract

Parallelizing a seismic inversion code using PVM: a poor. June 27, Abstract Parallelizing a seismic inversion code using PVM: a poor man's supercomputer June 27, 1994 Abstract This paper presents experience with parallelization using PVM of DSO, a seismic inversion code developed

More information

Hardware Implementation of GA.

Hardware Implementation of GA. Chapter 6 Hardware Implementation of GA Matti Tommiska and Jarkko Vuori Helsinki University of Technology Otakaari 5A, FIN-02150 ESPOO, Finland E-mail: Matti.Tommiska@hut.fi, Jarkko.Vuori@hut.fi Abstract.

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Relative Reduced Hops

Relative Reduced Hops GreedyDual-Size: A Cost-Aware WWW Proxy Caching Algorithm Pei Cao Sandy Irani y 1 Introduction As the World Wide Web has grown in popularity in recent years, the percentage of network trac due to HTTP

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

REDUCTION OF CODING ARTIFACTS IN LOW-BIT-RATE VIDEO CODING. Robert L. Stevenson. usually degrade edge information in the original image.

REDUCTION OF CODING ARTIFACTS IN LOW-BIT-RATE VIDEO CODING. Robert L. Stevenson. usually degrade edge information in the original image. REDUCTION OF CODING ARTIFACTS IN LOW-BIT-RATE VIDEO CODING Robert L. Stevenson Laboratory for Image and Signal Processing Department of Electrical Engineering University of Notre Dame Notre Dame, IN 46556

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra,

More information

Partition Border Charge Update. Solve Field. Partition Border Force Update

Partition Border Charge Update. Solve Field. Partition Border Force Update Plasma Simulation on Networks of Workstations using the Bulk-Synchronous Parallel Model y Mohan V. Nibhanupudi Charles D. Norton Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic

More information

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer

More information

Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.

Chapter 1. Reprinted from Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing,Norfolk, Virginia (USA), March 1993. Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel

More information

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor 1 Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar Abstract A Multistage Bus Network (MBN)

More information

Recognition. Clark F. Olson. Cornell University. work on separate feature sets can be performed in

Recognition. Clark F. Olson. Cornell University. work on separate feature sets can be performed in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 907-912, 1996. Connectionist Networks for Feature Indexing and Object Recognition Clark F. Olson Department of Computer

More information

Gaussian elds: a new criterion for 3Drigid registration

Gaussian elds: a new criterion for 3Drigid registration Pattern Recognition www.elsevier.com/locate/patcog Rapid and Brief Communication Gaussian elds: a new criterion for 3Drigid registration Faysal Boughorbel, Andreas Koschan, Besma Abidi, Mongi Abidi Imaging,

More information

Workloads with Spacelling Curves. John R. Pilkington and Scott B. Baden. January 10, Abstract

Workloads with Spacelling Curves. John R. Pilkington and Scott B. Baden. January 10, Abstract Dynamic Partitioning of Non-Uniform Structured Workloads with Spacelling Curves John R. Pilkington and Scott B. Baden January 10, 1995 Abstract We discuss Inverse Spacelling Partitioning (ISP), a partitioning

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department.

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department. PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu IBM T. J. Watson Research Center P.O.Box 704 Yorktown, NY 10598, USA email: fhhsiao, psyug@watson.ibm.com

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

Scalability of Efficient Parallel K-Means

Scalability of Efficient Parallel K-Means Scalability of Efficient Parallel K-Means David Pettinger and Giuseppe Di Fatta School of Systems Engineering The University of Reading Whiteknights, Reading, Berkshire, RG6 6AY, UK {D.G.Pettinger,G.DiFatta}@reading.ac.uk

More information

Makuhari, Chiba 273, Japan Kista , Sweden. Penny system [2] can then exploit the parallelism implicitly

Makuhari, Chiba 273, Japan Kista , Sweden. Penny system [2] can then exploit the parallelism implicitly Dynamic Scheduling in an Implicit Parallel System Haruyasu Ueda Johan Montelius Institute of Social Information Science Fujitsu Laboratories Ltd. Swedish Institute of Computer Science Makuhari, Chiba 273,

More information

Improving Error Bounds for Multipole-Based Treecodes æ

Improving Error Bounds for Multipole-Based Treecodes æ Improving Error Bounds for Multipole-Based Treecodes æ Ananth Grama, Vivek Sarin, and Ahmed Sameh Department of Computer Sciences Purdue University West Lafayette, IN 47907 fayg, sarin, samehg@cs.purdue.edu

More information

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road. Real-time communication scheduling in a multicomputer video server A. L. Narasimha Reddy Eli Upfal Texas A & M University IBM Almaden Research Center 214 Zachry 650 Harry Road College Station, TX 77843-3128

More information

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

Profile-Based Load Balancing for Heterogeneous Clusters *

Profile-Based Load Balancing for Heterogeneous Clusters * Profile-Based Load Balancing for Heterogeneous Clusters * M. Banikazemi, S. Prabhu, J. Sampathkumar, D. K. Panda, T. W. Page and P. Sadayappan Dept. of Computer and Information Science The Ohio State University

More information

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,

More information

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued) Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn

More information

Dynamic load balancing in OSIRIS

Dynamic load balancing in OSIRIS Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

Partitioning and Divide-and-Conquer Strategies

Partitioning and Divide-and-Conquer Strategies Chapter 4 Partitioning and Divide-and-Conquer Strategies Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

More information

Java Virtual Machine

Java Virtual Machine Evaluation of Java Thread Performance on Two Dierent Multithreaded Kernels Yan Gu B. S. Lee Wentong Cai School of Applied Science Nanyang Technological University Singapore 639798 guyan@cais.ntu.edu.sg,

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup. Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very

More information

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T. Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Progress in Image Analysis and Processing III, pp , World Scientic, Singapore, AUTOMATIC INTERPRETATION OF FLOOR PLANS USING

Progress in Image Analysis and Processing III, pp , World Scientic, Singapore, AUTOMATIC INTERPRETATION OF FLOOR PLANS USING Progress in Image Analysis and Processing III, pp. 233-240, World Scientic, Singapore, 1994. 1 AUTOMATIC INTERPRETATION OF FLOOR PLANS USING SPATIAL INDEXING HANAN SAMET AYA SOFFER Computer Science Department

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

MOBILE VIDEO COMMUNICATIONS IN WIRELESS ENVIRONMENTS. Jozsef Vass Shelley Zhuang Jia Yao Xinhua Zhuang. University of Missouri-Columbia

MOBILE VIDEO COMMUNICATIONS IN WIRELESS ENVIRONMENTS. Jozsef Vass Shelley Zhuang Jia Yao Xinhua Zhuang. University of Missouri-Columbia MOBILE VIDEO COMMUNICATIONS IN WIRELESS ENVIRONMENTS Jozsef Vass Shelley Zhuang Jia Yao Xinhua Zhuang Multimedia Communications and Visualization Laboratory Department of Computer Engineering & Computer

More information

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k TEMPORAL STABILITY IN SEQUENCE SEGMENTATION USING THE WATERSHED ALGORITHM FERRAN MARQU ES Dept. of Signal Theory and Communications Universitat Politecnica de Catalunya Campus Nord - Modulo D5 C/ Gran

More information

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream Agent Roles in Snapshot Assembly Delbert Hart Dept. of Computer Science Washington University in St. Louis St. Louis, MO 63130 hart@cs.wustl.edu Eileen Kraemer Dept. of Computer Science University of Georgia

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th

2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th A Load Balancing Routine for the NAG Parallel Library Rupert W. Ford 1 and Michael O'Brien 2 1 Centre for Novel Computing, Department of Computer Science, The University of Manchester, Manchester M13 9PL,

More information

Extensions to RTP to support Mobile Networking: Brown, Singh 2 within the cell. In our proposed architecture [3], we add a third level to this hierarc

Extensions to RTP to support Mobile Networking: Brown, Singh 2 within the cell. In our proposed architecture [3], we add a third level to this hierarc Extensions to RTP to support Mobile Networking Kevin Brown Suresh Singh Department of Computer Science Department of Computer Science University of South Carolina Department of South Carolina Columbia,

More information

Between PVM and TreadMarks. Honghui Lu. Sandhya Dwarkadas, Alan L. Cox, and Willy Zwaenepoel. Rice University

Between PVM and TreadMarks. Honghui Lu. Sandhya Dwarkadas, Alan L. Cox, and Willy Zwaenepoel. Rice University Quantifying the Performance Dierences Between PVM and TreadMarks Honghui Lu Department of Electrical and Computer Engineering Sandhya Dwarkadas, Alan L. Cox, and Willy Zwaenepoel Department of Computer

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

High Performance Synchronization Algorithms for. Multiprogrammed Multiprocessors. (Extended Abstract)

High Performance Synchronization Algorithms for. Multiprogrammed Multiprocessors. (Extended Abstract) High Performance Synchronization Algorithms for Multiprogrammed Multiprocessors (Extended Abstract) Robert W. Wisniewski, Leonidas Kontothanassis, and Michael L. Scott Department of Computer Science University

More information