Parallel Implementation of 3D FMA using MPI

Size: px
Start display at page:

Download "Parallel Implementation of 3D FMA using MPI"

Transcription

1 Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO Abstract The simulation of N-body system has been used extensively in biophysics and chemistry to investigate the dynamics of biomolecules and in astrophysics to study the chaotic characteristics of the galactic system. However, the longrange force calculation has a time complexity of O(N 2 ) where N is the number of particles in the system. The fast multipole algorithm (FMA), proposed by Greengard and Rokhlin, reduces the time complexity to O(N). Our goal is to build a parallel FMA library which is portable, scalable, and efficient. We use Message Passing Interface as the communication back-end. Also, an effective communication scheme to reduce communication overhead and a partitioning technique to obtain good load balancing among processors were implemented into the library. 1 Introduction We consider the simulation of N-body system using the fast multipole algorithm (FMA) on homogeneous and heterogeneous distributed parallel systems. The simulation of the N-body system has a wide range of applications and has been used extensively in biophysics and chemistry to investigate the dynamics of biomolecules [4] and in astrophysics to study the chaotic characteristics of the galactic system [3]. N-body simulations typically require long time integrations and every time step during the simulation involves force calculations among particles. The execution time of the force calculation contributes ninety percent of the total simulation time during a single time step. This is due mainly to the calculation of long-range forces such as the Coulombic and Keplarian forces which have a time complexity of O(N 2 ) where N is the number of particles in the system. This work was supported in part by NSF grant CCR Experiments were conducted using the resources of the Cornell Theory Center. y jlu@umr.edu. URL: z okunbor@umr.edu. URL: Many algorithms had been proposed to reduce the time complexity to O(N 3=2 ), O(N log N), and O(N) [3, 9, 18, 2]. The fast multipole algorithm [8, 6], proposed by Greengard and Rokhlin, exploits multipole expansions and local expansions to collectively represent the interaction force between a particle within a cluster and particles that are relatively distance from this cluster. The time complexity of the fast multipole algorithm is O(N). Tremendous amount of work had been devoted to the parallelization of the fast multipole algorithm. Greengard and Gropp [7] presented the parallel version of FMA in two dimensions (2D). Board, et al. [11] have done a lot of work in the parallelization of FMA in three dimensions (3D). Our goal is to develop a parallel FMA library which is portable, scalable, and efficient. To assure portability, Message Passing Interface (MPI) is used as the back-end communication library of the parallel FMA library MPIFMA. Since only primitive communication functions such as point-to-point communication and broadcast are implemented in the parallel FMA library, this makes it easy to port MPIFMA to other communication libraries. For efficiency, we implemented the optimum communication scheme proposed in [13]. The advantages of the optimum communication scheme include (1) minimum number of messages are communicated among processors, and (2) the computation and communication are overlapped as much as possible. Our preliminary results are remarkable. The force calculation of one million particles took approximately 2 minutes per time step on an IBM SP2 with 64 nodes. To assure that workload is equally distributed among processors when the particle system is irregular, we incorporated the weighted subtrees technique proposed in [12] to the library. The weighted subtrees technique is designed for parallel FMA in message passing environment. The advantages of this technique are (1) the communication among processors is simpler than ORB [16] and costzone [17] and (2) the communication overhead in the upward phase of FMA can be totally eliminated when the root of the subtrees is at level 2. Our results on Intel s ipsc860 show (1) the per-

2 Level 0 Level 1 Level 2 Figure 1. The 2D tree structure. Figure 2. The 3D tree structure. formance of parallel FMA with weighted subtrees scheme is approximately 3 times faster than the conventional parallel FMA on the benchmark, irregular particle system and (2) the load is about equally distributed among processors. This paper is organized as follows. In Section 2, we describe the fast multipole algorithm. For mathematical foundation of this algorithm, refer to [8, 6, 5, 10]. Section 3 describes the implementation of MPIFMA in details. Partial results are also presented. The possible enhancements of MPIFMA and the goal of this project is listed in the last section. 2 Fast multipole algorithm The fast multipole algorithm recursively divides a computational domain into 8 subdomains in 3D and 4 subdomains in 2D until either the specified tree level is reached or a domain contains one or no particle. A domain which is divided into subdomains is called the parent domain of the subdomains and the subdomain of a parent domain is called child domain of the parent domain. The 2D and 3D FMA trees of height 3 are depicted in Figures 1 and 2, respectively. The fast multipole algorithm then utilizes the tree structure to compute multipole expansions at the finest level t?1 in step 1 and shift multipole expansions of the child domains to form the multipole expansion of their parent domain from level t? 2 to 2 in step 2. In steps 3 and 4 (ie. the downward phase), local expansions are then computed based on (1) the multipole expansions, obtained from the upward phase (ie. steps 1 and 2), in the interaction list and (2) the local expansions shifted from the upper level. The following is a sequential version of the fast multipole algorithm on which our parallel algorithm is based. Initialization: Assign the number of levels of the tree and denote by t, and the number of terms in the multipole and local expansions and denote by p. Distribute each particle into its corresponding box and construct the interaction list of each box. By Greengard s definition, the interaction list of box i at level ` is the set of boxes which are children of the nearest and second nearest neighbors of i s parent and which are not nearest or second nearest neighbors of i. Step 1: Compute multipole expansion coefficients (MECs) for each box at the finest level t? 1. The computational domain at the `th level of the tree has 8` boxes. Step 2: For each level ` from level t? 2 to level 2, form MECs for each box by shifting its child boxes at level ` + 1. Step 3: For each level ` from level 2 to level t? 2, (1) construct local expansion coefficients (LECs) for each box i from the MECs of the boxes in the interaction list of box i plus the LECs shifted from the parent box of box i (at level 2, the LECs from the parent level are assumed to be zero); and (2) shift LECs of box i into its child boxes. Step 4: For each box at the finest level, compute its LECs from the MECs of the boxes in its interaction list and add the LECs shifted from its parent box. Step 5: Compute far-field force for each particle by inserting particles information into local expansion. 2

3 Step 6: For each particle j, compute the near-field force due to the particles in the same box of j and the particles in its neighbor boxes. In 2D, the neighbor boxes box b are the boxes which share a boundary point with box b. In 3D, the neighbor boxes box b are the boxes which share a boundary point with box b (called nearest neighbors) and those which share a boundary point with the nearest neighbors (called second nearest neighbors). Step 7: For each particle, sum up the near-filed and far field forces obtained in steps 5 and 6. 3 The MPIFMA library The MPIFMA library is based on the sequential code of PMTA version 4.0 developed at Duke University and uses MPICH, developed by Argonne National Laboratory and Mississippi State University, as the back-end communication library. Our goal is to develop a parallel FMA library that is portable, scalable, and efficient. The current implementation of MPIFMA include a sample main program, which provides the user interface and input checking, and a set of function calls that compute the long-range forces using the fast multipole algorithm. The program can be run as mpifma -n <number of particles> <seed> [-b] [-d] -p <number of terms> -l <length of comp. domain> -t <the height of FMA tree> where mpifma is the name of the executable program, -p denotes the number of terms used in MEs and LEs, -l denotes the length of the computational domain in three dimensions, and -t denotes the height of the FMA tree. The -d argument is optional. When -d is specified, the direct force computation among all particles will be executed. This is used to measure the accuracy of the approximation forces calculated by FMA. The force and potential errors are also printed when -d is specified. The current implementation of MPIFMA randomly generates particle positions and charges based on the seed provided in -n. Reading particle information from a file can be easily incorporated into the library and will be implemented in the future. MPIFMA generates two types of particle systems. By default, mpifma will generate particle positions within -0.5 and 0.5. On the other hand, if IRREGULAR is specified in the Makefile, particle positions O are randomly generated between?0:5 and 0:5 with the restriction that all coordinates for a particle are either positive (0 O 0:5) or negative (?0:5 O < 0). The option -b enables the weighted subtrees partitioning technique providing a better load balancing among processors when the particle system is not uniform. P P1 (a) P3 P P1 P4 P P2 (b) Figure 3. (a) The processor mapping for 4 processors in 2D. (b) The processor mapping with ghost cells for 4 processors in 2D. 3.1 Parallel Domain Decomposition In the initialization step of FMA, all boxes at the finest level of FMA tree are mapped to processors. By using parallel domain decomposition, processors receive the same number of boxes at the finest level and their ancestors at the higher levels. A box-processor mapping for 4 processors in 2D is shown in Figure 3(a). The advantages of the parallel domain decomposition include (1) the computation load among processors is about equal when the particle system is uniform (or close to uniform) and (2) the computation of MEs in steps 1 and 2 can be done without communication among processors. The latter is due to the fact that all information required for the calculation of MEs exist locally within the processor. As shown in step 6, the computation of near-field force of particle i requires the knowledge of the particles in the box containing particle i and particles in the neighbor boxes. In other words, processor 1 in Figure 3(a) requires the information of the shaded boxes from other processors to computes near-field force. This results in communication among processors. With loading extra ghost boxes (the shaded boxes 3

4 P1 P2 P3 P4 Total Near P Run ME2LE Field Comm. Eff. Time Force Cost N/A N/A % % % % Figure 4. Butterfly Communication Scheme. Table 1. Running time using optimal scheme on ipsc860. as shown in Figure 3(b)), each processor can compute nearfield force in step 6 without communicating with other processors. Parallel domain decomposition has serious load balancing problem when particles are not distributed uniformly. Weighted subtrees technique presented later is adopted to overcome this problem. 3.2 Communication and Synchronization Overhead The conversion of MEs to LEs in steps 3 and 4 requires communication among processors and this can be accomplished by using either broadcast or butterfly switch. Broadcast is easy to implement but expensive. Butterfly switch communication scheme [1] as shown in Figure 4 is better than broadcast and has time complexity of O(log 2 P), where P is the number of processors. However, it requires synchronization among processors at each level which is expensive when the number of processors is large. To minimize the communication and synchronization overhead, MEs are asynchronously transmitted right after the computation in steps 1 and 2, and then are received for computation of LEs in steps 3 and 4. This communication scheme overlaps computation and communication. MPIFMA ensures that minimum amount of messages is communicated among processors. This is based on the fact that if box j is in the interaction list of box i, then box i is also in the interaction list of box j. In this case, processor P i which contains box i sends box i s ME directly to processor P j and vice versa as long as P i 6= P j. Therefore, for each box i, one can build up a processor list which box i s MEs will be sent. Since only the required MEs are transmitted, redundancies are therefore eliminated bringing about minimal transmission. The details can be found at [13]. Experiments were conducted on an Intel ipsc860 system with 1, 2, 4, 8, and 16 nodes. The benchmark system is a 3D particle system with particle positions O randomly generated between?0:5 and 0:5. The size of the particle Total Near P Run ME2LE Field Comm. Eff. Time Force Cost % % % % Table 2. Running time using broadcast scheme on ipsc860. system is The height of the FMA tree is 4 and the number of terms used in multipole expansions and local expansions is 8. The running results are listed in Tables 1 and 2, respectively. As shown in the tables, the total force calculation time (excluding initialization time) is dominated by the conversion from multipole expansions to local expansions (ME2LE) and the computation of near-field force. From column ME2LE in Tables 1 and 2, the conversion time reduces close to a half when the number of processors is doubled. This indicates the parallel domain decomposition is load balanced in a uniform system. The communication time 1 using optimal communication scheme grows slightly when the number of processors is increased. This is an indication that the communication and computation were overlapped successfully. We also ran MPIFMA on an IBM SP2 with 64 nodes. In a system of one million particles, the force calculation took less than 2 minutes per time step using optimal communication scheme, while it took more than 14 minutes using broadcast scheme. The communication overhead of optimal communication scheme is approximately 12 seconds, while broadcast scheme is about 350 seconds. Detailed timing results are summarized in Tables 3. 1 The communication time includes the time to pack data into buffer, send, receive, unpack buffer, and assign it to proper data. 4

5 Total Scheme Run Comm. Time Cost Broadcast Optimal Table 3. One million particle simulation on 64- node SP2 with p = 8 and t = 6. The Execution Time of Force Calculation (in sec.) Sequential Factor=0.98 Factor=0.95 Factor= Number of Processors Figure 6. The performance comparisons between the parallel domain decomposition and weighted subtree technique with factors equal to 0.98, 0.95, and Proc 1 Proc 2 Proc 3 Proc 4 Figure 5. Weighted subtrees partitioning. 3.3 Weighted Subtrees As shown in the previous sections, the parallel domain decomposition technique performs well if particles are distributed uniformly. Degradation of performance results when the particles are not distributed uniformly. This is because equal number of boxes at the finest level and their ancestors at the higher levels (we call them subtrees) is assigned to each processor without recognizing that some of subtrees may be empty. We implemented the weighted subtrees technique, proposed in [12], in MPIFMA to partially alleviate the problem associated with the parallel domain decomposition mentioned above. The weighted subtrees technique keeps track of the number of particles in each box at level 2. It starts from box 1 and stops at box i if the number of particles contained in all boxes from box 1 to box i is approximately equal to N=P where P is the number of processors used in the simulation. Then, subtrees with roots at boxes 1; 2; : : : ; i are assigned to processor 1. The same procedure continues from box i + 1 and stops at box j, subtrees with roots at boxes i + 1; : : : ; j are assigned to processor 2. This procedure continues until all subtrees are assigned to processors. It is assumed that each processor is assigned at least one subtree. A 2D example is shown in Figure 5. Just like the parallel domain decomposition scheme, there is no communication in the upward phase when weighted subtrees scheme is used, while the load on each processor is more balanced. Practically, one would like to have each processor to be responsible for the same number of particles. One obvious, but impractical way to achieve this is to have at most one particle in each box at the finest level. For non-uniformly distributed systems, this is unrealizable and to have at most one particle per box at the finest level increases the levels of the hierarchical tree structure which results in an increased ME2LE conversion time. For all practical purposes, one would like to have more than one particle per box at the finest level. Experiments have shown that better performance can be obtained when there are 40 to 50 particles in each box at the finest level [15, 14]. Because each box at the finest level may contain more than one particle, it is not likely that each processor contains exactly N=P particles. Therefore, subtrees with roots at boxes i + 1; : : : ; j are assigned to a processor once the total number of particles in these boxes exceeds N P, where is a constant factor and is suggested to be in the range of 0.80 and The values of used in the experiments are arbitrarily determined. The benchmark system is a 3D particle system with particle positions O randomly generated between?0:5 and 0:5 with the restriction that all coordinates for a particle are either positive (0 O 0:5) or negative (?0:5 O < 0). The sizes of the particle systems range from 1600 to The height of the FMA tree is 4 and the number of terms used in multipole expansions and local expansions is 8. The execution time of force calculation includes the execution times of all FMA steps, except for initialization. In Figure 6, the comparison of the execution times of force calculation of a particle system of size 6400 between the parallel domain decomposition and weighted subtrees parallel FMAs with of 0.98, 0.95, and 0.90 is displayed. The execution time reduced approximately by a half when the number of processors is increased from one to 5

6 two no matter which scheme was employed. However, for the parallel domain decomposition scheme, the execution time remains almost fixed even though P is increased to 4 or 8. This is because equal number of subtrees are assigned to processors regardless of the number of particles in each subtree. At most two processors are assigned to subtrees which contain particles and only those two processors are doing computations. The other processors remain idle. When P = 16, there are four processors which were assigned to subtrees with particles. The execution time is close to what weighted subtrees scheme can achieve when P = 4 and = 0:90. For weighted subtrees scheme, the execution time decreased when the number of processors increased for all factors used in the experiments. Although not provided here, similar results were observed for systems containing 1600, 3200, 12800, and particles. The observed speedup of weighted subtrees scheme is approximately 3 times that of parallel domain decomposition. 4 Concluding Remarks The goal of this project is to provide a tool such that user can simulate dynamical systems with the input provided by the user. So far, MPIFMA provides the building block of calculation of long-range force such as Coulomb force in one time step. We plan to include the calculation of short-range force in the future. 5 Acknowledgment We would like to thank John Board at Duke University for the sequential FMA code on which our parallel code is based. We also like to thank Mark Underwood at University of Missouri-Rolla for his help to collect part of the running results shown in this paper. [6] L. Greengard. The Rapid Evaluation of Potential Fields in Particle Systems. The MIT Press, [7] L. Greengard and W. D. Gropp. A parallel version of the fast multipole method. Computer and Mathematics with Applications, 20(7):63 71, [8] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73: , [9] B. J. and H. P. A hierachical O(N log N ) force calculation algorithm. Nature, 324: , [10] J. Katzenelson. Computational structure of the N-body problem. SIAM Journal of Scientific and Statistical Computing, 10(4): , July [11] J. F. Leathrum Jr. and J. A. Board Jr. The parallel fast multipole algorithm in three dimensions. Technical Report, April Electrical Engineering Department, Duke University. [12] E. J.-L. Lu and D. I. Okunbor. An efficient load balancing technique for parallel FMA in message passing environment. submitted to Eighth IEEE Symposium on Parallel and Distributed Processing, October [13] E. J.-L. Lu and D. I. Okunbor. Massively parallel fast multipole algorithm in three dimensions. In the proceedings of Fifth IEEE International Symposium on High Performance Distributed Computing, August (to appear). [14] D. Okunbor and E. J. Lu. Parallel fast multipole algorithm using MPI. In the proceedings of MPI Developers Conference 1995, July [15] G. J. Pringle. Numerical Study of Three-Dimensional Flow using Fast Parallel Particle Algorithms. PhD thesis, Napier University, February [16] J. K. Salmon. Parallel Hierarchical N-Body Methods. PhD thesis, California Institue of Technology, December [17] J. P. Singh. Parallel Hierarchical N-Body Methods and Their Implications for Multiprocessors. PhD thesis, Stanford University, February [18] F. Zhao. An O(N ) algorithm for three-dimensional N-body simulations. Master s thesis, Department of Electrical Engineering and Computer Science, M.I.T., October References [1] S. G. Akl. The Design and Analysis of Parallel Algorithms. Pretice-Hall, [2] C. R. Anderson. An implementation of the fast multipole method without multipoles. SIAM Journal of Scientific and Statistical Computing, 13(4): , July [3] A. Appel. An efficient program for many-body simulation. SIAM Journal of Scientific and Statistical Computing, 6:85 103, [4] J. A. Board, L. Kale, K. Schulten, R. D. Skeel, and T. Schlick. Modeling biomolecules: Larger scales, longer duration. IEEE Computational Science and Engineering, pages 19 30, Winter [5] J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithm for particle simulations. SIAM Journal on Scientific and Statistical Computing, 9(4): , July

(12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3)

(12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3) EXPERIENCES WITH FRACTILING IN N-BODY SIMULATIONS Ioana Banicescu and Rong Lu Mississippi State University Department of Computer Science and NSF Engineering Research Center for Computational Field Simulation

More information

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers Yanhong Yuan and Prithviraj Banerjee Department of Electrical and Computer Engineering

More information

Fast Methods with Sieve

Fast Methods with Sieve Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE* SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL

More information

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications Xiaolin Li and Manish Parashar The Applied Software Systems Laboratory Department of

More information

MICE: A Prototype MPI Implementation in Converse Environment

MICE: A Prototype MPI Implementation in Converse Environment : A Prototype MPI Implementation in Converse Environment Milind A. Bhandarkar and Laxmikant V. Kalé Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

More information

9 Distributed Data Management II Caching

9 Distributed Data Management II Caching 9 Distributed Data Management II Caching In this section we will study the approach of using caching for the management of data in distributed systems. Caching always tries to keep data at the place where

More information

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,

More information

Parallel Prefix (Scan) Algorithms for MPI

Parallel Prefix (Scan) Algorithms for MPI Parallel Prefix (Scan) Algorithms for MPI Peter Sanders 1 and Jesper Larsson Träff 2 1 Universität Karlsruhe Am Fasanengarten 5, D-76131 Karlsruhe, Germany sanders@ira.uka.de 2 C&C Research Laboratories,

More information

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1 ExaFMM Fast multipole method software aiming for exascale systems User's Manual Rio Yokota, L. A. Barba November 2011 --- Revision 1 ExaFMM User's Manual i Revision History Name Date Notes Rio Yokota,

More information

Dynamic load balancing in OSIRIS

Dynamic load balancing in OSIRIS Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms

Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms Int. J. Advanced Networking and Applications 458 Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms Puttegowda D Department of Computer Science, Ghousia

More information

TASK FLOW GRAPH MAPPING TO "ABUNDANT" CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC "LIMITED"

TASK FLOW GRAPH MAPPING TO ABUNDANT CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC LIMITED Parallel Processing Letters c World Scientic Publishing Company FUNCTIONAL ALGORITHM SIMULATION OF THE FAST MULTIPOLE METHOD: ARCHITECTURAL IMPLICATIONS MARIOS D. DIKAIAKOS Departments of Astronomy and

More information

ScienceDirect. Analogy between immune system and sensor replacement using mobile robots on wireless sensor networks

ScienceDirect. Analogy between immune system and sensor replacement using mobile robots on wireless sensor networks Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 35 (2014 ) 1352 1359 18 th International Conference in Knowledge Based and Intelligent Information & Engineering Systems

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

Empirical Analysis of Space Filling Curves for Scientific Computing Applications

Empirical Analysis of Space Filling Curves for Scientific Computing Applications Empirical Analysis of Space Filling Curves for Scientific Computing Applications Daryl DeFord 1 Ananth Kalyanaraman 2 1 Dartmouth College Department of Mathematics 2 Washington State University School

More information

Embedding Large Complete Binary Trees in Hypercubes with Load Balancing

Embedding Large Complete Binary Trees in Hypercubes with Load Balancing JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 35, 104 109 (1996) ARTICLE NO. 0073 Embedding Large Complete Binary Trees in Hypercubes with Load Balancing KEMAL EFE Center for Advanced Computer Studies,

More information

7 Distributed Data Management II Caching

7 Distributed Data Management II Caching 7 Distributed Data Management II Caching In this section we will study the approach of using caching for the management of data in distributed systems. Caching always tries to keep data at the place where

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

Empirical Analysis of Space Filling Curves for Scientific Computing Applications

Empirical Analysis of Space Filling Curves for Scientific Computing Applications Empirical Analysis of Space Filling Curves for Scientific Computing Applications Daryl DeFord 1 Ananth Kalyanaraman 2 1 Department of Mathematics 2 School of Electrical Engineering and Computer Science

More information

NUMA-Aware Shared-Memory Collective Communication for MPI

NUMA-Aware Shared-Memory Collective Communication for MPI NUMA-Aware Shared-Memory Collective Communication for MPI Shigang Li Torsten Hoefler Marc Snir Presented By: Shafayat Rahman Motivation Number of cores per node keeps increasing So it becomes important

More information

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers A. Srivastava E. Han V. Kumar V. Singh Information Technology Lab Dept. of Computer Science Information Technology Lab Hitachi

More information

Profile-Based Load Balancing for Heterogeneous Clusters *

Profile-Based Load Balancing for Heterogeneous Clusters * Profile-Based Load Balancing for Heterogeneous Clusters * M. Banikazemi, S. Prabhu, J. Sampathkumar, D. K. Panda, T. W. Page and P. Sadayappan Dept. of Computer and Information Science The Ohio State University

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

Tree-based methods on GPUs

Tree-based methods on GPUs Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology

More information

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems

A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems Yuanxun Bill Bao May 31, 2015 1 Introduction The non-uniform fast Fourier transform (NUFFT) algorithm was originally introduced by

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

6. Concluding Remarks

6. Concluding Remarks [8] K. J. Supowit, The relative neighborhood graph with an application to minimum spanning trees, Tech. Rept., Department of Computer Science, University of Illinois, Urbana-Champaign, August 1980, also

More information

Data Sieving and Collective I/O in ROMIO

Data Sieving and Collective I/O in ROMIO Appeared in Proc. of the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999, pp. 182 189. c 1999 IEEE. Data Sieving and Collective I/O in ROMIO Rajeev Thakur William Gropp

More information

The Barnes-Hut Algorithm in MapReduce

The Barnes-Hut Algorithm in MapReduce The Barnes-Hut Algorithm in MapReduce Ross Adelman radelman@gmail.com 1. INTRODUCTION For my end-of-semester project, I implemented an N-body solver in MapReduce using Hadoop. The N-body problem is a classical

More information

Scalability of Efficient Parallel K-Means

Scalability of Efficient Parallel K-Means Scalability of Efficient Parallel K-Means David Pettinger and Giuseppe Di Fatta School of Systems Engineering The University of Reading Whiteknights, Reading, Berkshire, RG6 6AY, UK {D.G.Pettinger,G.DiFatta}@reading.ac.uk

More information

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time B + -TREES MOTIVATION An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations finish within O(log N) time The theoretical conclusion

More information

CIS265/ Trees Red-Black Trees. Some of the following material is from:

CIS265/ Trees Red-Black Trees. Some of the following material is from: CIS265/506 2-3-4 Trees Red-Black Trees Some of the following material is from: Data Structures for Java William H. Ford William R. Topp ISBN 0-13-047724-9 Chapter 27 Balanced Search Trees Bret Ford 2005,

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Load Balancing Algorithm over a Distributed Cloud Network

Load Balancing Algorithm over a Distributed Cloud Network Load Balancing Algorithm over a Distributed Cloud Network Priyank Singhal Student, Computer Department Sumiran Shah Student, Computer Department Pranit Kalantri Student, Electronics Department Abstract

More information

Partitioning and Divide-and-Conquer Strategies

Partitioning and Divide-and-Conquer Strategies Chapter 4 Partitioning and Divide-and-Conquer Strategies Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Space-filling curves for 2-simplicial meshes created with bisections and reflections

Space-filling curves for 2-simplicial meshes created with bisections and reflections Space-filling curves for 2-simplicial meshes created with bisections and reflections Dr. Joseph M. Maubach Department of Mathematics Eindhoven University of Technology Eindhoven, The Netherlands j.m.l.maubach@tue.nl

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Trees. Eric McCreath

Trees. Eric McCreath Trees Eric McCreath 2 Overview In this lecture we will explore: general trees, binary trees, binary search trees, and AVL and B-Trees. 3 Trees Trees are recursive data structures. They are useful for:

More information

Data Structures and Algorithms

Data Structures and Algorithms Data Structures and Algorithms CS245-2008S-19 B-Trees David Galles Department of Computer Science University of San Francisco 19-0: Indexing Operations: Add an element Remove an element Find an element,

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Scalable Distributed Fast Multipole Methods

Scalable Distributed Fast Multipole Methods Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University

More information

Prefix Computation and Sorting in Dual-Cube

Prefix Computation and Sorting in Dual-Cube Prefix Computation and Sorting in Dual-Cube Yamin Li and Shietung Peng Department of Computer Science Hosei University Tokyo - Japan {yamin, speng}@k.hosei.ac.jp Wanming Chu Department of Computer Hardware

More information

Location Database Clustering to Achieve Location Management Time Cost Reduction in A Mobile Computing System

Location Database Clustering to Achieve Location Management Time Cost Reduction in A Mobile Computing System Location Database Clustering to Achieve Location Management Time Cost Reduction in A Mobile Computing ystem Chen Jixiong, Li Guohui, Xu Huajie, Cai Xia*, Yang Bing chool of Computer cience & Technology,

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING

AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING Irina Bernst, Patrick Bouillon, Jörg Frochte *, Christof Kaufmann Dept. of Electrical Engineering

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

An Enhanced General Self-Organized Tree-Based Energy- Balance Routing Protocol (EGSTEB) for Wireless Sensor Network

An Enhanced General Self-Organized Tree-Based Energy- Balance Routing Protocol (EGSTEB) for Wireless Sensor Network www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 239-7242 Volume 4 Issue 8 Aug 205, Page No. 3640-3643 An Enhanced General Self-Organized Tree-Based Energy- Balance Routing

More information

UCLA UCLA Previously Published Works

UCLA UCLA Previously Published Works UCLA UCLA Previously Published Works Title Parallel Markov chain Monte Carlo simulations Permalink https://escholarship.org/uc/item/4vh518kv Authors Ren, Ruichao Orkoulas, G. Publication Date 2007-06-01

More information

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy Enzo-P / Cello Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2

More information

ISA[k] Trees: a Class of Binary Search Trees with Minimal or Near Minimal Internal Path Length

ISA[k] Trees: a Class of Binary Search Trees with Minimal or Near Minimal Internal Path Length SOFTWARE PRACTICE AND EXPERIENCE, VOL. 23(11), 1267 1283 (NOVEMBER 1993) ISA[k] Trees: a Class of Binary Search Trees with Minimal or Near Minimal Internal Path Length faris n. abuali and roger l. wainwright

More information

What is Performance for Internet/Grid Computation?

What is Performance for Internet/Grid Computation? Goals for Internet/Grid Computation? Do things you cannot otherwise do because of: Lack of Capacity Large scale computations Cost SETI Scale/Scope of communication Internet searches All of the above 9/10/2002

More information

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors Performance Evaluation of Large-scale Parallel Simulation Codes and Designing New Language Features on the (High Performance Fortran) Data-Parallel Programming Environment Project Representative Yasuo

More information

A Note on Scheduling Parallel Unit Jobs on Hypercubes

A Note on Scheduling Parallel Unit Jobs on Hypercubes A Note on Scheduling Parallel Unit Jobs on Hypercubes Ondřej Zajíček Abstract We study the problem of scheduling independent unit-time parallel jobs on hypercubes. A parallel job has to be scheduled between

More information

OpenMPI OpenMP like tool for easy programming in MPI

OpenMPI OpenMP like tool for easy programming in MPI OpenMPI OpenMP like tool for easy programming in MPI Taisuke Boku 1, Mitsuhisa Sato 1, Masazumi Matsubara 2, Daisuke Takahashi 1 1 Graduate School of Systems and Information Engineering, University of

More information

Multicast can be implemented here

Multicast can be implemented here MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu

More information

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS Jim Gasvoda and Qin Ding Department of Computer Science, Pennsylvania State University at Harrisburg, Middletown, PA 17057, USA {jmg289, qding}@psu.edu

More information

6.895 Final Project: Serial and Parallel execution of Funnel Sort

6.895 Final Project: Serial and Parallel execution of Funnel Sort 6.895 Final Project: Serial and Parallel execution of Funnel Sort Paul Youn December 17, 2003 Abstract The speed of a sorting algorithm is often measured based on the sheer number of calculations required

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications James Bordner, Michael L. Norman San Diego Supercomputer Center University of California, San Diego 15th SIAM Conference

More information

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES)

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Chapter 1 A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Piotr Berman Department of Computer Science & Engineering Pennsylvania

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Provably Efficient Non-Preemptive Task Scheduling with Cilk Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798. Abstract We consider the

More information

Constructions of hamiltonian graphs with bounded degree and diameter O(log n)

Constructions of hamiltonian graphs with bounded degree and diameter O(log n) Constructions of hamiltonian graphs with bounded degree and diameter O(log n) Aleksandar Ilić Faculty of Sciences and Mathematics, University of Niš, Serbia e-mail: aleksandari@gmail.com Dragan Stevanović

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

Tree Search for Travel Salesperson Problem Pacheco Text Book Chapt 6 T. Yang, UCSB CS140, Spring 2014

Tree Search for Travel Salesperson Problem Pacheco Text Book Chapt 6 T. Yang, UCSB CS140, Spring 2014 Tree Search for Travel Salesperson Problem Pacheco Text Book Chapt 6 T. Yang, UCSB CS140, Spring 2014 Outline Tree search for travel salesman problem. Recursive code Nonrecusive code Parallelization with

More information

arxiv: v1 [cs.dc] 2 Apr 2016

arxiv: v1 [cs.dc] 2 Apr 2016 Scalability Model Based on the Concept of Granularity Jan Kwiatkowski 1 and Lukasz P. Olech 2 arxiv:164.554v1 [cs.dc] 2 Apr 216 1 Department of Informatics, Faculty of Computer Science and Management,

More information

Using Genetic Algorithms to Solve the Box Stacking Problem

Using Genetic Algorithms to Solve the Box Stacking Problem Using Genetic Algorithms to Solve the Box Stacking Problem Jenniffer Estrada, Kris Lee, Ryan Edgar October 7th, 2010 Abstract The box stacking or strip stacking problem is exceedingly difficult to solve

More information

Foster s Methodology: Application Examples

Foster s Methodology: Application Examples Foster s Methodology: Application Examples Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 19, 2011 CPD (DEI / IST) Parallel and

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

Master s Thesis. A Construction Method of an Overlay Network for Scalable P2P Video Conferencing Systems

Master s Thesis. A Construction Method of an Overlay Network for Scalable P2P Video Conferencing Systems Master s Thesis Title A Construction Method of an Overlay Network for Scalable P2P Video Conferencing Systems Supervisor Professor Masayuki Murata Author Hideto Horiuchi February 14th, 2007 Department

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

Parallelization Strategy

Parallelization Strategy COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Dept. of Computer Science Florida State University Tallahassee, FL 32306 {karwande,xyuan}@cs.fsu.edu

More information

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Guan Wang and Matthias K. Gobbert Department of Mathematics and Statistics, University of

More information

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information

A Comparison of Two Fully-Dynamic Delaunay Triangulation Methods

A Comparison of Two Fully-Dynamic Delaunay Triangulation Methods A Comparison of Two Fully-Dynamic Delaunay Triangulation Methods Michael D Adams Department of Electrical and Computer Engineering University of Victoria Victoria, BC, V8W 3P6, Canada Web: http://wwweceuvicca/

More information

1 Serial Implementation

1 Serial Implementation Grey Ballard, Razvan Carbunescu, Andrew Gearhart, Mehrzad Tartibi CS267: Homework 2 1 Serial Implementation For n particles, the original code requires O(n 2 ) time because at each time step, the apply

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

Intro to DB CHAPTER 12 INDEXING & HASHING

Intro to DB CHAPTER 12 INDEXING & HASHING Intro to DB CHAPTER 12 INDEXING & HASHING Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing

More information

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT Amr Rekaby 1 and Mohamed Abo Rizka 2 1 Egyptian Research and Scientific Innovation Lab (ERSIL), Egypt 2 Arab Academy

More information