2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th

Size: px
Start display at page:

Download "2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th"

Transcription

1 A Load Balancing Routine for the NAG Parallel Library Rupert W. Ford 1 and Michael O'Brien 2 1 Centre for Novel Computing, Department of Computer Science, The University of Manchester, Manchester M13 9PL, U.K. rupert@cs.man.ac.uk 2 Military Aircraft and Aerostructures, British Aerospace, Warton Aerodrome, Lancashire PR4 1AX, U.K. Michael.OBrien@bae.co.uk Abstract. This paper describes a load balance routine which has been developed for the NAG Parallel Library. This routine is designed for load balance problems where each task can be computed independently and allows the user to choose from a number of dierent load balance strategies. The benets of this routine are discussed in terms of both performance and ease of use, and results are presented for a production RCS prediction code on a Cray T3D and a SGI Origin Introduction The load balance routine described in this paper has been developed for inclusion in the NAG Parallel Library [7]. This library is a collection of portable, memory scalable, parallel Fortran 77 routines for the solution of numerical and statistical problems. This work forms part of the ESPRIT Framework IV project P20018 PINEAPL (Parallel Industrial Numerical Applications and Portable Libraries). The aim of the project is to develop an application-driven, general-purpose library of parallel numerical software to signicantly extend the scope of the NAG parallel library. The project (coordinated by NAG) is driven by applications from four industrial end users, representing the needs of the numerical scientic and engineering market. In this project parallel library experts are paired with end users; in the University of Manchester's case the end user is British Aerospace (BAe). One of BAe's applications (\System AB3") involves the prediction of the radar cross section (RCS) of an aircraft's air intake duct. The particular technique requires raytracing to calculate the RCS of an arbitrary shaped duct. A ray tracer developed at the University of Manchester has been integrated into BAe's \System AB3".

2 2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, the geometry is simple enough to allow its replication on each processor. However, although rays can be independently calculated, their computational cost will vary signicantly, depending on the path a ray takes. This means that a static equal allocation of rays to processors will not necessarily give a load balanced solution. The load balance routine (called Y01CAFP) was developed to solve the above and similar load balance problems. Y01CAFP is therefore designed to minimise the elapsed time for n independent tasks, where n is xed and known, running on p processors. The solution to such a problem is often termed task farming, as tasks may be sent (farmed out) to other processors. It is primarily designed for problems where n p and the time for each task is variable and unknown, however it can be of benet for problems where the time for each task is known but the distribution of tasks is not regular. It is also useful for distributing tasks when all data is held on the root (master) processor. The next section summarises the design philosophy of the NAG Parallel library and describes the main features of its implementation. This allows a detailed description of Y01CAFP in the subsequent section. Section 4 discusses the BAe RCS application and the test case used for evaluation. Section 5 presents the results of running the test case on a Cray T3D and a SGI Origin 2000 and nally, Section 6 gives our conclusions. 2 NAG Parallel Library The routine described in this paper has been developed for inclusion in the NAG Parallel Library [7]. This library is a collection of parallel Fortran 77 routines for the solution of numerical and statistical problems. The library is divided into chapters, each devoted to a branch of numerical analysis or statistics. The library is primarily intended for distributed memory parallel machines, including networks and clusters, although it can readily be used on shared memory parallel systems that implement PVM [6] or MPI [9]. The library supports parallelism and memory scalability, and has been designed to be portable across a wide range of parallel machines. The library assumes a Single Program Multiple Data (SPMD) model of parallelism in which a single instance of the user's program executes on each of the logical processors. The NAG Parallel Library uses the Basic Linear Algebra Communication Subprograms (BLACS) [5] for the majority of the communication within the library. Implementations of the BLACS, available in both PVM and MPI, provide a higher level communication interface. However, there are a number of facilities that are not available in the BLACS, such as sending multiple data types in one message (multiple messages must be sent) and non-blocking sends and receives. There is, therefore, a clear trade-o between code portability (plus ease of maintenance) and performance. As performance is crucial in load balancing, much of the communication is written in PVM and MPI.

3 Lecture Notes in Computer Science 3 The library is designed to minimise the user's concern with use of the BLACS, PVM or MPI, and present a higher level interface using library calls. Task spawning and the denition of a logical processor grid and its context is handled by the parallel library routine Z01AAFP. On completion the library routine Z01ABFP is called to undene the grid and context. The routines Z01AAFP and Z01ABFP can be considered as left and right braces, respectively, around the parallel code. 3 Load Balance Routine (Y01CAFP) 3.1 Code integration Y01CAFP assumes the problem requiring load balancing is written in the form DO I=1,NLOCAL CALL TASK(I) END DO where all data is passed into TASK through COMMON, the index I distinguishes the actual task to be performed and all tasks are independent. The routine then replaces the above code fragment. The user must therefore modify the program so that it conforms to this specication. The user must also supply a routine (whose specication is dened in the documentation) which will pack or unpack the data required to compute a task for a range of contiguous indices. Y01CAFP will call this routine to pack and unpack data into the appropriate indices. Y01CAFP supplies pack (NAGPACK) and unpack (NAGUNPACK) routines to facilitate this task. These routines are wrappers around the PVM and MPI versions and have a similar syntax. As well as specifying how many tasks are on a particular processor (NLO- CAL) the user must also specify the maximum number of tasks that could be computed on that processor (NMAX). The load balancer will use the space between NLOCAL and NMAX to perform any required remote computation. NMAX must be large enough to allow any required remote computation to take place. The actual amount is dened in the user documentation and the program will give an error if NMAX is not large enough. In the PVM implementation Y01CAFP makes use of the system buers to buer data. As MPI does not support system buers the user must supply a buer large enough to send and receive the largest message. 3.2 Load Balancing options Y01CAFP allows the user to select one of four dierent load balancing strategies, `ASIS', `BLOCK', `CYCLIC' and `GRAB'. In addition `CYCLIC' and `GRAB' have a block size (BSIZE) which is set by the user. If all of the data is initially on the root processor (NLOCAL=0 on all other processors) then the master/slave (MASLV) option can be set. This option dedicates the root processor to communication. In this case Y01CAFP is eectively parallelising the application.

4 4 Rupert W. Ford and Michael O'Brien Note that, changing the load balancing options in Y01CAFP will not aect the results, only the load balance and therefore, solution time. Y01CAFP accepts any initial data distribution and the nal distribution will be the same as the initial distribution. Y01CAFP provides an indication of how it has performed through the TINFO and NINFO arrays. These arrays give timing and counting information respectively. ASIS: `ASIS' performs no load balancing. It is useful to test the correct working of the code when the load balancing routine is rst used. It can also be used to determine the load imbalance inherent in the problem using the NINFO and TINFO output arrays and gives a non-load balanced timing result allowing comparison with any load balanced results. BLOCK: `BLOCK' should be used when the computational costs of the tasks are the same but their distribution across processors is irregular. Note, if the distribution were regular in this case, the problem would already be load balanced. The implementation of `BLOCK' takes a given task distribution and redistributes it so that each processor has no more than dn=pe tasks. It attempts to minimise the number of messages sent by a combination of sending tasks from the most loaded processor to the least loaded processor and looking for pairs of equally overloaded and underloaded processors [10]. CYCLIC: `CYCLIC' should be used when computational costs for successive tasks (in iteration space) are similar, but the load varies over many iterations. The distributed task indices are treated as a single global index ordered by processor identier. The implementation of `CYCLIC' rstly computes the global iteration space. Secondly, processors send all local tasks which require redistribution. Note, if a processor needs to send more than one block to the same remote processor, it does so in separate messages. Thirdly, processors compute any local tasks, and nally, processors compute any remote tasks and return the results. GRAB: `GRAB' should be used when the computational costs of the tasks are unknown. In this case a regular distribution of tasks (as given by the two previous strategies) may result in load imbalance. With this option each processor performs its own computation then steals `BSIZE' tasks from any processors which are still computing until all work has completed. The implementation of `GRAB' checks for any task requests after computing each local task (of size `BSIZE'). If it receives a request and has more than `BSIZE' tasks remaining it sends these to the requesting processor, otherwise it sends a negative acknowledgement (NACK). When a processor has nished its own tasks it requests each processor in turn for work. It completes when it has received NACK's from all other processors and has sent NACK's to all other processors. MASLV: The `MASLV' option is only relevant when all tasks are on the root processor i.e: NLOCAL= 0 on all processors except root. If, in this case, MASLV is.true. the root processor does not take part in any computation, it is used

5 Lecture Notes in Computer Science 5 purely for communication. This option is useful when the cost of communication is high enough to signicantly slow the root processor. For example, if the communication costs of sending and receiving data were equal to the computation costs the root processor would take approximately twice as long as the other processors. This eect increases with the number of processors, the amount of data transfered and the speed of the processor. It decreases with the speed of the network. Output from NINFO and TINFO, helps the user these eects. 4 RCS Example 4.1 Description The purpose of this application is to predict the radar cross section (RCS) of an aircraft's air intake duct. Ducts are particularly important as they act as a waveguide propagating energy (Electro-magnetic (EM) waves) back in the receiver direction. Therefore a large portion of an aircrafts RCS is due to duct reection. Ray tracing techniques [2] are useful for RCS analysis as they allow the realistic modelling of physical systems with arbitrary shaped ducts and dierent absorption characteristics [1]. Manchester University has developed a ray tracer for inclusion into BAe's \System AB3". This code uses raytracing to calculate the RCS of an arbitrary shaped duct. The duct geometry is designed using the CAD package CATIA whose surfaces are output as parametric bi-cubic patches [2] in PATRAN [3]. A user generated \ASPECT" le controls the position, direction, frequency, angle and polarisation of the initial EM rays. Rays are exclusively directed inside the duct as this is the area of interest. These rays are then ray traced by AB3. At each ray/surface intersection the EM characteristics of the ray are modied based on the intersected surface characteristics. The rays are terminated either when they emerge from the duct or when their energy falls below an appropriate threshold. AB3 integrates the emerging rays to obtain the RCS. 4.2 Integration into the NAG Parallel library In AB3, a set of rays, whose starting points are arranged in a two dimensional grid, are traced into the duct. This was implemented as a double loop over the rays initial coordinates. To make the AB3 code conform to the load balance specication this double loop had to be changed to single index. The data was already passed into the routine using common. The packing and unpacking routine was simple to implement as all data was dependent on the index. The initial implementation added the NAG begin parallel (Z01AAFP) and end parallel (Z01ABFP) calls around the code. All non-root processors then skipped the initialisation and waited in Y01CAFP while the root processor set up all the data. In this case the load balancer distributes the work from the root to the remaining processors and acts as if it is parallelising the code. Whilst this version was a useful starting point, the memory requirements of the root processor meant that this version would not scale to large problem sizes.

6 6 Rupert W. Ford and Michael O'Brien Fig. 1. External view of a de-classied BAe duct To remove this limitation the data was pre-distributed amongst the processors. This was done in two ways. The rst (termed block) assigned the rst dn=pe rays to the rst processor, the next dn=pe rays to the second processor and so on. The second (termed cyclic) assigned the rst ray to the rst processor, the second ray to the second processor and so on, wrapping round to the rst processor after the last processor. The nal RCS prediction code section (integrating the emerging rays to obtain the RCS) has not been parallelised. The results from each processor are sent to the root processor and it performs the computation. This section is included in the timing results presented and for large numbers of processors becomes an important factor. 4.3 Test case The test case used in this paper is a duct which has the complexity of real ducts currently in use and/or being developed by BAe, but has been modied so that it can be de-classied. The external visual ray tracing of this duct has been performed using the ray tracer developed at the University of Manchester (which is a modication of krt[4] to allow bi-cubic patch intersection). This is also the ray tracer which has been modied to form part of \System AB3". The patches have been articially shaded to highlight them, see Figure 1. 5 Results In all versions described in this section the grab option uses a block size of l=(5p 2 ), where p is the number of processors and l is the total number of rays. Smaller block sizes were investigated but this made no dierence to the performance. In this section the problem sizes are given in terms of the ray density which is the number of rays per wavelength. In the example code the wavelength is 3cm and the frequency is approximately 10GHz. The total number of rays is also given for reference. At the time the following results were taken the TINFO and NINFO performance analysis arrays and the CYCLIC option described in Section 3.2 were not implemented.

7 Lecture Notes in Computer Science naive ideal lb block lb grab lb grab master/slave /t nprocs Fig. 2. SGI O2000, initial distribution all on root, ray density 13 (106,912 rays). In Figure 2 the initial data is stored on the root processor. The reciprocal of wall-clock time is given on the y-axis giving the equivalent of a speedup graph without normalising the time taken. The `naive ideal' line is simply the sequential time divided by the number of processors. The load balance block option suers from load imbalance which is improved by the load balance grab option. In these cases the root processor is both computing its own work and sending and receiving work to and from the other processors. To determine this performance penalty the grab option is repeated with the master/slave option set to true. Note, to show this overhead the root processor is not included in nprocs which in this case is the number of computing processors. This shows that much of the remaining dierence from the ideal line is due to this overhead. The data was then pre-distributed in equal sized blocks across the processors. With the load balance block option set, the result is identical to the block option in Figure 2 and is therefore not presented. This result shows that the load balance block option is ecient when the data is all on one processor. Note, in this predistributed case Y01CAFP does not have to perform any data re-distribution. The load balance grab option is given in Figure 3 and it performs as well as Y01CAFP with grab and master/slave options (originally shown in Figure 2) which is also displayed as a reference. This shows that pre-distributing the data removes the data transfer bottleneck from the root processor. The data was then pre-distributed in a cyclic manner. Figure 3 shows that with the load balance block option the performance is as good as the other two options. The load balance block option actually performs no data re-distribution in this case. This means that for this problem pre-distributing the rays in a cyclic manner gives very good load balance. The load balance grab option gives no further improvement and is therefore not included.

8 8 Rupert W. Ford and Michael O'Brien naive ideal lb grab master/slave lb block grab lb cyclic block /t nprocs Fig. 3. SGI O2000, initial distribution block and cyclic, ray density 13 (106,912 rays). Figure 4 presents results for the same test case scaling up to a much larger number of processors on a Cray T3D. The initial block distribution with the load balance block option performs worst due to load imbalance. Changing this option to grab brings the performance close to that for the initial cyclic distribution with the load balance block option. The initial cyclic distribution with the load balance grab option performs the best by a small margin. The performance improvement falls o primarily due to the sequential fraction mentioned in Section naive ideal lb block block lb cyclic block lb block grab lb cyclic grab /t nprocs Fig. 4. Cray T3D, ray density 13 (106,912 rays).

9 Lecture Notes in Computer Science naive ideal lb block block lb cyclic block lb block grab lb cyclic grab /t nprocs Fig. 5. Cray T3D, ray density 50 (1,586,356 rays). Figure 5 presents results for the same options as the previous gure with a much greater ray density of 50 (1,586,356 rays). The trends are very similar however the larger problem size is more scalable. At 256 processors the cyclic pre-distribution actually performs slightly worse with the load balance cyclic option than the load balance block option. Figure 6 again presents results for the same options as Figure 4 for a ray density of 200 (25,370,577 rays). At this density the problem will only run on 128 or more processors due to memory limitations. In this example all options scale linearly except the initial block distribution with load balance block option which suers from load imbalance. 6 Conclusions Y01CAFP has proven to be useful for distributing work from the root processor to the remaining processors. Note, the load balancer is eectively parallelising the application here. In this case the master/slave option helps reduce the communication bottleneck at the root processor by dedicating it to this task. For larger problems the data must be pre-distributed (particularly for distributed memory machines) not only for performance but also so that the memory requirements per processor is not too high. In the example problem presented in this paper a cyclic pre-distribution of the data gives a near load balanced solution (as rays close to each other follow similar paths and thus have a similar computational cost). However, for all problem sizes an initial block distribution of data with the grab load balance option gives very similar performance results. This suggests that for a dierent dataset, or dierent problem entirely, where an initial cyclic distribution is not

10 10 Rupert W. Ford and Michael O'Brien naive ideal lb block block lb cyclic block lb block grab lb cyclic grab 1/t e nprocs Fig. 6. Cray T3D, ray density 200 (25,370,577 rays). feasible or does not give a load balanced solution an initial block distribution of data with the grab load balance option will load balance the problem. The initial integration of the load balancer into the RCS code was relatively simple and BAe are now using the NAG parallel library and the Y01CAFP load balance routine in production runs with much greater ray densities and more realistic geometries than were previously possible. In summary Y01CAFP has proven to be a very powerful, exible and useful load balancing routine. References 1. Ling H. et al, Shooting and Bouncing Rays: Calculating the RCS of an Arbitrarily Shaped Cavity, IEEE Transactions on Antennas and Propagation, Vol. 37, No. 2, February 1989, pp Watt A., Fundamentals of Three Dimensional Computer Graphics, Addison Wesley, PATRAN Plus User Manual. 4. Keates M., Hubbold J., Accelerated Ray Tracing on the KSR1 UMCS J. Dongarra and R. C. Whaley, (1997) A User's Guide to the BLACS v1.1, Technical Report CS , University of Tennessee, Knoxville, Tennessee. 6. A. Geist, A. Beguelin, J. Dongarra, R. Manchek, W. Jiang, and V. Sunderam, (1994), PVM: A Users' Guide and Tutorial for Networked Parallel Computing, The MIT Press, Cambridge, Massachusetts. 7. N.A.G., (1997) N.A.G. Parallel Library Manual, Release 2, N.A.G. Ltd., Oxford. 8. N.A.G., (1997) N.A.G. Fortran Library Manual, Mark 17, N.A.G. Ltd., Oxford. 9. M.Snir, S.Otto, S.Huss-Lederman, D.Walker and J.Dongarra, (1996) MPI: The Complete Reference, The MIT Press, Cambridge, Massachusetts. 10. R.Ford (1998) A Message Minimisation Algorithm CNC Technical Report, Department Of Computer Science, The University of Manchester, Manchester, U.K.

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University

More information

CUMULVS: Collaborative Infrastructure for Developing. Abstract. by allowing them to dynamically attach to, view, and \steer" a running simulation.

CUMULVS: Collaborative Infrastructure for Developing. Abstract. by allowing them to dynamically attach to, view, and \steer a running simulation. CUMULVS: Collaborative Infrastructure for Developing Distributed Simulations James Arthur Kohl Philip M. Papadopoulos G. A. Geist, II y Abstract The CUMULVS software environment provides remote collaboration

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com

More information

100 Mbps DEC FDDI Gigaswitch

100 Mbps DEC FDDI Gigaswitch PVM Communication Performance in a Switched FDDI Heterogeneous Distributed Computing Environment Michael J. Lewis Raymond E. Cline, Jr. Distributed Computing Department Distributed Computing Department

More information

J.A.J.Hall, K.I.M.McKinnon. September 1996

J.A.J.Hall, K.I.M.McKinnon. September 1996 PARSMI, a parallel revised simplex algorithm incorporating minor iterations and Devex pricing J.A.J.Hall, K.I.M.McKinnon September 1996 MS 96-012 Supported by EPSRC research grant GR/J0842 Presented at

More information

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University Ecient Processor llocation for D ori Wenjian Qiao and Lionel M. Ni Department of Computer Science Michigan State University East Lansing, MI 4884-107 fqiaow, nig@cps.msu.edu bstract Ecient allocation of

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing

More information

Mixed Mode MPI / OpenMP Programming

Mixed Mode MPI / OpenMP Programming Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared memory architectures are gradually becoming more prominent in the HPC market,

More information

MOTION ESTIMATION IN MPEG-2 VIDEO ENCODING USING A PARALLEL BLOCK MATCHING ALGORITHM. Daniel Grosu, Honorius G^almeanu

MOTION ESTIMATION IN MPEG-2 VIDEO ENCODING USING A PARALLEL BLOCK MATCHING ALGORITHM. Daniel Grosu, Honorius G^almeanu MOTION ESTIMATION IN MPEG-2 VIDEO ENCODING USING A PARALLEL BLOCK MATCHING ALGORITHM Daniel Grosu, Honorius G^almeanu Multimedia Group - Department of Electronics and Computers Transilvania University

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information

Supporting Heterogeneous Network Computing: PVM. Jack J. Dongarra. Oak Ridge National Laboratory and University of Tennessee. G. A.

Supporting Heterogeneous Network Computing: PVM. Jack J. Dongarra. Oak Ridge National Laboratory and University of Tennessee. G. A. Supporting Heterogeneous Network Computing: PVM Jack J. Dongarra Oak Ridge National Laboratory and University of Tennessee G. A. Geist Oak Ridge National Laboratory Robert Manchek University of Tennessee

More information

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup. Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Reimplementation of the Random Forest Algorithm

Reimplementation of the Random Forest Algorithm Parallel Numerics '05, 119-125 M. Vajter²ic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 5: Optimization and Classication ISBN 961-6303-67-8 Reimplementation of the Random Forest Algorithm Goran Topi,

More information

Ray Trace Notes. Charles Rino. September 2010

Ray Trace Notes. Charles Rino. September 2010 Ray Trace Notes Charles Rino September 2010 1 Introduction A MATLAB function that performs a direct numerical integration to the ray optics equation described in book Section 1.3.2 has been implemented.

More information

The MPBench Report. Philip J. Mucci. Kevin London. March 1998

The MPBench Report. Philip J. Mucci. Kevin London.  March 1998 The MPBench Report Philip J. Mucci Kevin London mucci@cs.utk.edu london@cs.utk.edu March 1998 1 Introduction MPBench is a benchmark to evaluate the performance of MPI and PVM on MPP's and clusters of workstations.

More information

of Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial

of Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial Accelerated Learning on the Connection Machine Diane J. Cook Lawrence B. Holder University of Illinois Beckman Institute 405 North Mathews, Urbana, IL 61801 Abstract The complexity of most machine learning

More information

On Partitioning Dynamic Adaptive Grid Hierarchies. Manish Parashar and James C. Browne. University of Texas at Austin

On Partitioning Dynamic Adaptive Grid Hierarchies. Manish Parashar and James C. Browne. University of Texas at Austin On Partitioning Dynamic Adaptive Grid Hierarchies Manish Parashar and James C. Browne Department of Computer Sciences University of Texas at Austin fparashar, browneg@cs.utexas.edu (To be presented at

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message.

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message. Where's the Overlap? An Analysis of Popular MPI Implementations J.B. White III and S.W. Bova Abstract The MPI 1:1 denition includes routines for nonblocking point-to-point communication that are intended

More information

Project Plan for May Interactive Ray Tracer on the PlayStation 3. Brendan Campbell, Daniel Risse, Aaron Westphal, Sean Godinez

Project Plan for May Interactive Ray Tracer on the PlayStation 3. Brendan Campbell, Daniel Risse, Aaron Westphal, Sean Godinez Project Plan for May08-38 - Interactive Ray Tracer on the PlayStation 3 Brendan Campbell, Daniel Risse, Aaron Westphal, Sean Godinez November 25, 2007 Figure 1: IBM's irt Produces beautiful results Abstract

More information

Progress In Electromagnetics Research M, Vol. 20, 29 42, 2011

Progress In Electromagnetics Research M, Vol. 20, 29 42, 2011 Progress In Electromagnetics Research M, Vol. 20, 29 42, 2011 BEAM TRACING FOR FAST RCS PREDICTION OF ELECTRICALLY LARGE TARGETS H.-G. Park, H.-T. Kim, and K.-T. Kim * Department of Electrical Engineering,

More information

Normal mode acoustic propagation models. E.A. Vavalis. the computer code to a network of heterogeneous workstations using the Parallel

Normal mode acoustic propagation models. E.A. Vavalis. the computer code to a network of heterogeneous workstations using the Parallel Normal mode acoustic propagation models on heterogeneous networks of workstations E.A. Vavalis University of Crete, Mathematics Department, 714 09 Heraklion, GREECE and IACM, FORTH, 711 10 Heraklion, GREECE.

More information

Dynamic Process Management in an MPI Setting. William Gropp. Ewing Lusk. Abstract

Dynamic Process Management in an MPI Setting. William Gropp. Ewing Lusk.  Abstract Dynamic Process Management in an MPI Setting William Gropp Ewing Lusk Mathematics and Computer Science Division Argonne National Laboratory gropp@mcs.anl.gov lusk@mcs.anl.gov Abstract We propose extensions

More information

execution host commd

execution host commd Batch Queuing and Resource Management for Applications in a Network of Workstations Ursula Maier, Georg Stellner, Ivan Zoraja Lehrstuhl fur Rechnertechnik und Rechnerorganisation (LRR-TUM) Institut fur

More information

Lab 2: Support Vector Machines

Lab 2: Support Vector Machines Articial neural networks, advanced course, 2D1433 Lab 2: Support Vector Machines March 13, 2007 1 Background Support vector machines, when used for classication, nd a hyperplane w, x + b = 0 that separates

More information

Application. CoCheck Overlay Library. MPE Library Checkpointing Library. OS Library. Operating System

Application. CoCheck Overlay Library. MPE Library Checkpointing Library. OS Library. Operating System Managing Checkpoints for Parallel Programs Jim Pruyne and Miron Livny Department of Computer Sciences University of Wisconsin{Madison fpruyne, mirong@cs.wisc.edu Abstract Checkpointing is a valuable tool

More information

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract A High Performance Sparse holesky Factorization Algorithm For Scalable Parallel omputers George Karypis and Vipin Kumar Department of omputer Science University of Minnesota Minneapolis, MN 55455 Technical

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

sizes. Section 5 briey introduces some of the possible applications of the algorithm. Finally, we draw some conclusions in Section 6. 2 MasPar Archite

sizes. Section 5 briey introduces some of the possible applications of the algorithm. Finally, we draw some conclusions in Section 6. 2 MasPar Archite Parallelization of 3-D Range Image Segmentation on a SIMD Multiprocessor Vipin Chaudhary and Sumit Roy Bikash Sabata Parallel and Distributed Computing Laboratory SRI International Wayne State University

More information

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing

More information

Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11

Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11 Extending CRAFT Data-Distributions for Sparse Matrices G. Bandera E.L. Zapata July 996 Technical Report No: UMA-DAC-96/ Published in: 2nd. European Cray MPP Workshop Edinburgh Parallel Computing Centre,

More information

IMPROVING DEMS USING SAR INTERFEROMETRY. University of British Columbia. ABSTRACT

IMPROVING DEMS USING SAR INTERFEROMETRY. University of British Columbia.  ABSTRACT IMPROVING DEMS USING SAR INTERFEROMETRY Michael Seymour and Ian Cumming University of British Columbia 2356 Main Mall, Vancouver, B.C.,Canada V6T 1Z4 ph: +1-604-822-4988 fax: +1-604-822-5949 mseymour@mda.ca,

More information

December 28, Abstract. In this report we describe our eorts to parallelize the VGRIDSG unstructured surface

December 28, Abstract. In this report we describe our eorts to parallelize the VGRIDSG unstructured surface A Comparison of Using APPL and PVM for a Parallel Implementation of an Unstructured Grid Generation Program T. Arthur y M. Bockelie z December 28, 1992 Abstract In this report we describe our eorts to

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Research on outlier intrusion detection technologybased on data mining

Research on outlier intrusion detection technologybased on data mining Acta Technica 62 (2017), No. 4A, 635640 c 2017 Institute of Thermomechanics CAS, v.v.i. Research on outlier intrusion detection technologybased on data mining Liang zhu 1, 2 Abstract. With the rapid development

More information

Parallel Algorithm Design

Parallel Algorithm Design Chapter Parallel Algorithm Design Debugging is twice as hard as writing the code in the rst place. Therefore, if you write the code as cleverly as possible, you are, by denition, not smart enough to debug

More information

Rendering Computer Animations on a Network of Workstations

Rendering Computer Animations on a Network of Workstations Rendering Computer Animations on a Network of Workstations Timothy A. Davis Edward W. Davis Department of Computer Science North Carolina State University Abstract Rendering high-quality computer animations

More information

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press,   ISSN The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics

More information

Computer Graphics Forum (special issue on Eurographics 92), II(3), pp , Sept

Computer Graphics Forum (special issue on Eurographics 92), II(3), pp , Sept Computer Graphics Forum (special issue on Eurographics 9), II(), pp. 79-88, Sept. 99. Accurate Image Generation and Interactive Image Editing with the A-buer Wing Hung Lau and Neil Wiseman Computer Laboratory,

More information

director executor user program user program signal, breakpoint function call communication channel client library directing server

director executor user program user program signal, breakpoint function call communication channel client library directing server (appeared in Computing Systems, Vol. 8, 2, pp.107-134, MIT Press, Spring 1995.) The Dynascope Directing Server: Design and Implementation 1 Rok Sosic School of Computing and Information Technology Grith

More information

Compiler Reduction of Invalidation Trac in. Virtual Shared Memory Systems. Manchester, UK

Compiler Reduction of Invalidation Trac in. Virtual Shared Memory Systems. Manchester, UK Compiler Reduction of Invalidation Trac in Virtual Shared Memory Systems M.F.P. O'Boyle 1, R.W. Ford 2, A.P. Nisbet 2 1 Department of Computation, UMIST, Manchester, UK 2 Centre for Novel Computing, Dept.

More information

HARNESS. provides multi-level hot pluggability. virtual machines. split off mobile agents. merge multiple collaborating sites.

HARNESS. provides multi-level hot pluggability. virtual machines. split off mobile agents. merge multiple collaborating sites. HARNESS: Heterogeneous Adaptable Recongurable NEtworked SystemS Jack Dongarra { Oak Ridge National Laboratory and University of Tennessee, Knoxville Al Geist { Oak Ridge National Laboratory James Arthur

More information

Abstract. modeling of full scale airborne systems has been ported to three networked

Abstract. modeling of full scale airborne systems has been ported to three networked Distributed Computational Electromagnetics Systems Gang Cheng y Kenneth A. Hawick y Gerald Mortensen z Georey C. Fox y Abstract We describe our development of a \real world" electromagnetic application

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907 The Game of Clustering Rowena Cole and Luigi Barone Department of Computer Science, The University of Western Australia, Western Australia, 697 frowena, luigig@cs.uwa.edu.au Abstract Clustering is a technique

More information

Original citation: Steliaros, M. K., Martin, Graham R. and Packwood, R. A. (997) Parallelisation of block matching motion estimation algorithms. University of Warwick. Department of Computer Science. (Department

More information

Computation Ideal ρ/ρ Speedup High ρ No. of Processors.

Computation Ideal ρ/ρ Speedup High ρ No. of Processors. 1 A Fully Concurrent DSMC Implementation with Adaptive Domain Decomposition C.D. Robinson and J.K. Harvey Department of Aeronautics, Imperial College, London, SW7 2BY, U.K. A concurrent implementation

More information

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof Dynamic and Compiled Communication in Optical Time{Division{Multiplexed Point{to{Point Networks by Xin Yuan B.S., Shanghai Jiaotong University, 1989 M.S., Shanghai Jiaotong University, 1992 M.S., University

More information

OpenStax-CNX module: m Polarization * Bobby Bailey. Based on Polarization by OpenStax

OpenStax-CNX module: m Polarization * Bobby Bailey. Based on Polarization by OpenStax OpenStax-CNX module: m52456 1 27.9 Polarization * Bobby Bailey Based on Polarization by OpenStax This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 4.0 Abstract

More information

3-ary 2-cube. processor. consumption channels. injection channels. router

3-ary 2-cube. processor. consumption channels. injection channels. router Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths 1 Dhabaleswar K. Panda, Sanjay Singal, and Ram Kesavan Dept. of Computer and Information Science The

More information

Benchmarking the CGNS I/O performance

Benchmarking the CGNS I/O performance 46th AIAA Aerospace Sciences Meeting and Exhibit 7-10 January 2008, Reno, Nevada AIAA 2008-479 Benchmarking the CGNS I/O performance Thomas Hauser I. Introduction Linux clusters can provide a viable and

More information

[8] J. J. Dongarra and D. C. Sorensen. SCHEDULE: Programs. In D. B. Gannon L. H. Jamieson {24, August 1988.

[8] J. J. Dongarra and D. C. Sorensen. SCHEDULE: Programs. In D. B. Gannon L. H. Jamieson {24, August 1988. editor, Proceedings of Fifth SIAM Conference on Parallel Processing, Philadelphia, 1991. SIAM. [3] A. Beguelin, J. J. Dongarra, G. A. Geist, R. Manchek, and V. S. Sunderam. A users' guide to PVM parallel

More information

Load Balancing in Individual-Based Spatial Applications.

Load Balancing in Individual-Based Spatial Applications. Load Balancing in Individual-Based Spatial Applications Fehmina Merchant, Lubomir F. Bic, and Michael B. Dillencourt Department of Information and Computer Science University of California, Irvine Email:

More information

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES A. Likas, K. Blekas and A. Stafylopatis National Technical University of Athens Department

More information

Native Marshalling. Java Marshalling. Mb/s. kbytes

Native Marshalling. Java Marshalling. Mb/s. kbytes Design Issues for Ecient Implementation of MPI in Java Glenn Judd, Mark Clement, Quinn Snell Computer Science Department, Brigham Young University, Provo, USA Vladimir Getov 2 2 School of Computer Science,

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

MPI as a Coordination Layer for Communicating HPF Tasks

MPI as a Coordination Layer for Communicating HPF Tasks Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 1996 MPI as a Coordination Layer

More information

SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS IAN RAMSAY PHILP. B.S., University of North Carolina at Chapel Hill, 1988

SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS IAN RAMSAY PHILP. B.S., University of North Carolina at Chapel Hill, 1988 SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS BY IAN RAMSAY PHILP B.S., University of North Carolina at Chapel Hill, 1988 M.S., University of Florida, 1990 THESIS Submitted in partial fulllment

More information

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes Todd A. Whittaker Ohio State University whittake@cis.ohio-state.edu Kathy J. Liszka The University of Akron liszka@computer.org

More information

Contents Abstract 12 Declaration 13 Copyright 14 The Author 15 Acknowledgements 16 1 Introduction 17 2 Parallel Computing Why use parallel comp

Contents Abstract 12 Declaration 13 Copyright 14 The Author 15 Acknowledgements 16 1 Introduction 17 2 Parallel Computing Why use parallel comp PARALLEL ALGORITHMS FOR GLOBALLY ADAPTIVE QUADRATURE A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Science and Engineering January 1997 By

More information

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley Department of Computer Science Remapping Subpartitions of Hyperspace Using Iterative Genetic Search Keith Mathias and Darrell Whitley Technical Report CS-4-11 January 7, 14 Colorado State University Remapping

More information

2 Fredrik Manne, Svein Olav Andersen where an error occurs. In order to automate the process most debuggers can set conditional breakpoints (watch-poi

2 Fredrik Manne, Svein Olav Andersen where an error occurs. In order to automate the process most debuggers can set conditional breakpoints (watch-poi This is page 1 Printer: Opaque this Automating the Debugging of Large Numerical Codes Fredrik Manne Svein Olav Andersen 1 ABSTRACT The development of large numerical codes is usually carried out in an

More information

Simulation Advances for RF, Microwave and Antenna Applications

Simulation Advances for RF, Microwave and Antenna Applications Simulation Advances for RF, Microwave and Antenna Applications Bill McGinn Application Engineer 1 Overview Advanced Integrated Solver Technologies Finite Arrays with Domain Decomposition Hybrid solving:

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Tutorial for the NAG Parallel Library (MPI-based Version)

Tutorial for the NAG Parallel Library (MPI-based Version) Introduction Tutorial for the NAG Parallel Library (MPI-based Version) Contents 1 Using this Tutorial 3 2 What are MPI and BLACS? 3 2.1 MPI................................................ 3 2.2 BLACS..............................................

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

Parallel Implementation of a Unied Approach to. Image Focus and Defocus Analysis on the Parallel Virtual Machine

Parallel Implementation of a Unied Approach to. Image Focus and Defocus Analysis on the Parallel Virtual Machine Parallel Implementation of a Unied Approach to Image Focus and Defocus Analysis on the Parallel Virtual Machine Yen-Fu Liu, Nai-Wei Lo, Murali Subbarao, Bradley S. Carlson yiu@sbee.sunysb.edu, naiwei@sbee.sunysb.edu

More information

2 Martin C. Rinard and Monica S. Lam 1. INTRODUCTION Programmers have traditionally developed software for parallel machines using explicitly parallel

2 Martin C. Rinard and Monica S. Lam 1. INTRODUCTION Programmers have traditionally developed software for parallel machines using explicitly parallel The Design, Implementation, and Evaluation of Jade MARTIN C. RINARD Massachusetts Institute of Technology and MONICA S. LAM Stanford University Jade is a portable, implicitly parallel language designed

More information

Figure 1: The evaluation window. ab a b \a.b (\.y)((\.)(\.)) Epressions with nested abstractions such as \.(\y.(\.w)) can be abbreviated as \y.w. v al

Figure 1: The evaluation window. ab a b \a.b (\.y)((\.)(\.)) Epressions with nested abstractions such as \.(\y.(\.w)) can be abbreviated as \y.w. v al v: An Interactive -Calculus Tool Doug Zongker CSE 505, Autumn 1996 December 11, 1996 \Computers are better than humans at doing these things." { Gary Leavens, CSE 505 lecture 1 Introduction The -calculus

More information

pc++/streams: a Library for I/O on Complex Distributed Data-Structures

pc++/streams: a Library for I/O on Complex Distributed Data-Structures pc++/streams: a Library for I/O on Complex Distributed Data-Structures Jacob Gotwals Suresh Srinivas Dennis Gannon Department of Computer Science, Lindley Hall 215, Indiana University, Bloomington, IN

More information

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz. Blocking vs. Non-blocking Communication under MPI on a Master-Worker Problem Andre Fachat, Karl Heinz Homann Institut fur Physik TU Chemnitz D-09107 Chemnitz Germany e-mail: fachat@physik.tu-chemnitz.de

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

T H. Runable. Request. Priority Inversion. Exit. Runable. Request. Reply. For T L. For T. Reply. Exit. Request. Runable. Exit. Runable. Reply.

T H. Runable. Request. Priority Inversion. Exit. Runable. Request. Reply. For T L. For T. Reply. Exit. Request. Runable. Exit. Runable. Reply. Experience with Real-Time Mach for Writing Continuous Media Applications and Servers Tatsuo Nakajima Hiroshi Tezuka Japan Advanced Institute of Science and Technology Abstract This paper describes the

More information

Applications. Message Passing Interface(PVM, MPI, P4 etc.) Socket Interface. Low Overhead Protocols. Network dependent protocols.

Applications. Message Passing Interface(PVM, MPI, P4 etc.) Socket Interface. Low Overhead Protocols. Network dependent protocols. Exploiting Multiple Heterogeneous Networks to Reduce Communication Costs in Parallel Programs JunSeong Kim jskim@ee.umn.edu David J. Lilja lilja@ee.umn.edu Department of Electrical Engineering University

More information

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics The Compositional C++ Language Denition Peter Carlin Mani Chandy Carl Kesselman March 12, 1993 Revision 0.95 3/12/93, Comments welcome. Abstract This document gives a concise denition of the syntax and

More information

Parallel Unsupervised k-windows: An Efficient Parallel Clustering Algorithm

Parallel Unsupervised k-windows: An Efficient Parallel Clustering Algorithm Parallel Unsupervised k-windows: An Efficient Parallel Clustering Algorithm Dimitris K. Tasoulis 1,2 Panagiotis D. Alevizos 1,2, Basilis Boutsinas 2,3, and Michael N. Vrahatis 1,2 1 Department of Mathematics,

More information

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, jesusg@ac.upc.es - http://www.ac.upc.es/hpc Abstract Clusters of workstations

More information

Lecture 7: Introduction to HFSS-IE

Lecture 7: Introduction to HFSS-IE Lecture 7: Introduction to HFSS-IE 2015.0 Release ANSYS HFSS for Antenna Design 1 2015 ANSYS, Inc. HFSS-IE: Integral Equation Solver Introduction HFSS-IE: Technology An Integral Equation solver technology

More information

Application Programmer. Vienna Fortran Out-of-Core Program

Application Programmer. Vienna Fortran Out-of-Core Program Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

More information

INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO-IEC JTC1/SC29/WG11

INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO-IEC JTC1/SC29/WG11 INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO-IEC JTC1/SC29/WG11 CODING OF MOVING PICTRES AND ASSOCIATED ADIO ISO-IEC/JTC1/SC29/WG11 MPEG 95/ July 1995

More information

maximally convex volumes

maximally convex volumes appears in Computer-Aided Design, 31(2), pp. 85-100, 1999. CUSTOM-CUT: A Customizable Feature Recognizer Daniel M. Gaines gainesdm@vuse.vanderbilt.edu Computer Science Department Vanderbilt University

More information

EXPERIMENTS WITH REPARTITIONING AND LOAD BALANCING ADAPTIVE MESHES

EXPERIMENTS WITH REPARTITIONING AND LOAD BALANCING ADAPTIVE MESHES EXPERIMENTS WITH REPARTITIONING AND LOAD BALANCING ADAPTIVE MESHES RUPAK BISWAS AND LEONID OLIKER y Abstract. Mesh adaption is a powerful tool for ecient unstructured-grid computations but causes load

More information

/98 $10.00 (c) 1998 IEEE

/98 $10.00 (c) 1998 IEEE CUMULVS: Extending a Generic Steering and Visualization Middleware for lication Fault-Tolerance Philip M. Papadopoulos, phil@msr.epm.ornl.gov James Arthur Kohl, kohl@msr.epm.ornl.gov B. David Semeraro,

More information

Java Virtual Machine

Java Virtual Machine Evaluation of Java Thread Performance on Two Dierent Multithreaded Kernels Yan Gu B. S. Lee Wentong Cai School of Applied Science Nanyang Technological University Singapore 639798 guyan@cais.ntu.edu.sg,

More information

MDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science

MDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science MDP Routing in ATM Networks Using the Virtual Path Concept 1 Ren-Hung Hwang, James F. Kurose, and Don Towsley Department of Computer Science Department of Computer Science & Information Engineering University

More information

PARA++ : C++ Bindings for Message Passing Libraries

PARA++ : C++ Bindings for Message Passing Libraries PARA++ : C++ Bindings for Message Passing Libraries O. Coulaud, E. Dillon {Olivier.Coulaud, Eric.Dillon}@loria.fr INRIA-lorraine BP101, 54602 VILLERS-les-NANCY, FRANCE Abstract The aim of Para++ is to

More information

A Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland

A Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland A Study of Query Execution Strategies for Client-Server Database Systems Donald Kossmann Michael J. Franklin Department of Computer Science and UMIACS University of Maryland College Park, MD 20742 f kossmann

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

Array Decompositions for Nonuniform Computational Environments

Array Decompositions for Nonuniform Computational Environments Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 996 Array Decompositions for Nonuniform

More information

682 M. Nordén, S. Holmgren, and M. Thuné

682 M. Nordén, S. Holmgren, and M. Thuné OpenMP versus MPI for PDE Solvers Based on Regular Sparse Numerical Operators? Markus Nord n, Sverk er Holmgren, and Michael Thun Uppsala University, Information Technology, Dept. of Scientic Computing,

More information

MURDOCH RESEARCH REPOSITORY

MURDOCH RESEARCH REPOSITORY MURDOCH RESEARCH REPOSITORY http://dx.doi.org/10.1109/tencon.2000.893677 Xie, H. and Fung, C.C. (2000) Enhancing the performance of a BSP model-based parallel volume renderer with a profile visualiser.

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809 PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA Laurent Lemarchand Informatique ubo University{ bp 809 f-29285, Brest { France lemarch@univ-brest.fr ea 2215, D pt ABSTRACT An ecient distributed

More information

Parallel raytracing on the IBM SP2 and CRAY T3D

Parallel raytracing on the IBM SP2 and CRAY T3D Parallel raytracing on the IBM SP2 and CRAY T3D Igor-Sunday Pandzic, MIRALab, University of Geneva Michel Roethlisberger, IBM Geneva Nadia Magnenat - Thalmann, MIRALab, University of Geneva Abstract Raytracing

More information