GRID*p: Interactive Data-Parallel Programming on the Grid with MATLAB

Similar documents
A Data-Aware Resource Broker for Data Grids

Grid Application Development Software

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

Experiments with Scheduling Using Simulated Annealing in a Grid Environment

Compiler Technology for Problem Solving on Computational Grids

Evaluating the Performance of Skeleton-Based High Level Parallel Programs

Tools and Primitives for High Performance Graph Computation

High Performance Computing. Without a Degree in Computer Science

IOS: A Middleware for Decentralized Distributed Computing

Performance of Multicore LUP Decomposition

Optimizing Molecular Dynamics

Performance Analysis of Applying Replica Selection Technology for Data Grid Environments*

The AppLeS Parameter Sweep Template: User-level middleware for the Grid 1

Biological Sequence Alignment On The Computational Grid Using The Grads Framework

Scalable GPU Graph Traversal!

Real-time grid computing for financial applications

Component Architectures

Assignment 5. Georgia Koloniari

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

DiPerF: automated DIstributed PERformance testing Framework

visperf: Monitoring Tool for Grid Computing

Managing MPICH-G2 Jobs with WebCom-G

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

Compilers and Run-Time Systems for High-Performance Computing

Self Adaptivity in Grid Computing

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

The Grid: Feng Shui for the Terminally Rectilinear

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

UNICORE Globus: Interoperability of Grid Infrastructures

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

A Distributed Media Service System Based on Globus Data-Management Technologies1

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Knowledge Discovery Services and Tools on Grids

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Parallel Combinatorial BLAS and Applications in Graph Computations

Lecture 13: March 25

OpenMP for next generation heterogeneous clusters

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

High Performance Computing

A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach

Parallel Auction Algorithm for Linear Assignment Problem

Functional Requirements for Grid Oriented Optical Networks

THE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin

FUTURE communication networks are expected to support

Building Performance Topologies for Computational Grids UCSB Technical Report

HETEROGENEOUS COMPUTING

Graph Partitioning for Scalable Distributed Graph Computations

High Throughput WAN Data Transfer with Hadoop-based Storage

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Solving the N-Body Problem with the ALiCE Grid System

Scheduling Large Parametric Modelling Experiments on a Distributed Meta-computer

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication

Usage of LDAP in Globus

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

The Encoding Complexity of Network Coding

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

WHY PARALLEL PROCESSING? (CE-401)

Multiprocessors 2007/2008

Tutorial: Application MPI Task Placement

Sparse matrices, graphs, and tree elimination

Lecture 9: MIMD Architectures

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

CAS 703 Software Design

Online Optimization of VM Deployment in IaaS Cloud

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Mapping Vector Codes to a Stream Processor (Imagine)

Intel Math Kernel Library 10.3

Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date:

Principles of Parallel Algorithm Design: Concurrency and Mapping

Performance impact of dynamic parallelism on different clustering algorithms

Latency Hiding by Redundant Processing: A Technique for Grid enabled, Iterative, Synchronous Parallel Programs

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics

Modelling and implementation of algorithms in applied mathematics using MPI

Lecture 27: Fast Laplacian Solvers

PARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

Multicore Computing and Scientific Discovery

Highly Latency Tolerant Gaussian Elimination

Improving the Dynamic Creation of Processes in MPI-2

MOHA: Many-Task Computing Framework on Hadoop

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

Optimization solutions for the segmented sum algorithmic function

Introduction to parallel Computing

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

GRB. Grid-JQA : Grid Java based Quality of service management by Active database. L. Mohammad Khanli M. Analoui. Abstract.

C-Meter: A Framework for Performance Analysis of Computing Clouds

Improving Performance of Sparse Matrix-Vector Multiplication

Clustering and Reclustering HEP Data in Object Databases

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Scalability of Heterogeneous Computing

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Transcription:

GRID*p: Interactive Data-Parallel Programming on the Grid with MATLAB Imran Patel and John R. Gilbert Department of Computer Science University of California, Santa Barbara {imran, gilbert}@cs.ucsb.edu Abstract The Computational Grid has emerged as an attractive platform for developing large-scale distributed applications that run on heterogeneous computing resources. This scalability, however, comes at the cost of increased complexity: each application has to handle the details of resource provisioning and data and task scheduling. To address this problem, we present GRID*p - an interactive parallel system built on top of MATLAB*p and Globus that provides a MATLAB-like problem solving environment for the Grid. Applications under GRID*p achieve automatic data parallelism through the transparent use of distributed data objects, while the GRID*p runtime takes care of data partitioning, task scheduling and inter-node messaging. We evaluate the simplicity and performance of GRID*p with two different types of parallel applications consisting of matrix and graph computations - one of which investigates an open problem in combinatorial scientific computing. Our results show that GRID*p delivers promising performance for highly parallel applications, at the same time greatly simplifying their development. I. INTRODUCTION Recent advances in high performance computing have led to the availability of a large number of inexpensive processing and storage resources. This trend coupled with the availability of high-speed networks has led to the concept of the Computational Grid [1]. A Computational Grid enables the development of large-scale applications on top of looselycoupled heterogeneous computing resources using a collection of middleware services. Grid applications utilize these services to access a diverse set of resources thereby achieving high scalability and performance. But this scalability comes at the price of increased complexity in the application. In order to achieve high performance, each application has to handle complex issues such as scheduling tasks and data on its set of acquired resources. Current research in Grid computing has mostly focussed on middleware for locating, accessing and monitoring resources. A few components exist, which broker resource provisioning and contracts based on scheduling constraints of the application. But, an application programmer has to still deal with a complex abstraction layer even for a relatively simple parallel computation. As a result of the lack of a higher-level programming model and an associated runtime environment, Grid applications remain hard to develop. To address this problem, we present GRID*p an interactive parallel system built on top of MATLAB*p [2] and Globus [3] that provides a MATLAB-like problem solving environment for the Grid. Applications under GRID*p achieve automatic data parallelism while the GRID*p runtime takes care of data partitioning, task scheduling and inter-node messaging. GRID*p relies on the abstract distributed object model of MATLAB*p to provide highly intuitive and almost transparent parallel constructs to the programmer. This paper makes two key contributions. First, it presents an extremely powerful and simple-to-use interactive parallel programming environment that enables the development of implicitly data parallel applications. Users interact with GRID*p by using distributed matrices under the standard MATLAB client. These distributed objects are syntactically almost identical to the standard MATLAB matrices. However, regular MATLAB operations on these distributed matrices are automatically parallelized by GRID*p. As a result, even a less sophisticated user/programmer can achieve parallelism from the familiar interactive MATLAB shell with minimal effort. The underlying runtime achieves high performance while hiding the details of data layout and messaging even on a complex network topology. The second contribution is the demonstration of the feasibility of the system by experimentally evaluating two different flavors of parallel applications. In particular, we present a highly task-parallel stochastic simulation that has yielded improved results for an open problem in the field of combinatorial scientific computing. The rest of this paper is organized as follows: In the following section we provide background on the Globus Toolkit, which is currently the de facto software standard to develop Grid applications. We also discuss the associated MPICH-G2 [4] toolkit, which is an implementation of the MPI standard on top of Globus. This is followed by a brief discussion of the cluster-based MATLAB*p system. We note some implementation issues with the current prototype of GRID*p in Section 3. Section 4 describes two parallel applications and the results of their experimental evaluation under GRID*p. Sections 5 and 6 discuss related and future work respectively. We conclude in Section 7. A. Globus and MPICH-G2 II. BACKGROUND The Globus Toolkit is a set of middleware services that export standardized interfaces to uniformly and securely access a diverse pool of computing resources. Specifically, these services provide facilities such as process management for

remote job execution, directory services for resource discovery and high performance data management for data transfer and staging. They are implemented in the form of UNIX daemon processes and as web services. For our purposes, the most important components are the Globus Resource Allocation and Management subsystem (GRAM) and the Globus Security Infrastructure (GSI). The GRAM module is responsible for remote job execution and management in Globus. It also manages the standard input, output and error streams associated with the job. GSI provides authentication and authorization of users and their jobs using facilities such as single sign-on and delegation. It employs the X.509 PKI standard to implement per-host and per-user security certificates for identification. MPICH-G2 is an implementation of the MPI standard under Globus. It uses GRAM for launching remote jobs and messaging. It also provides unique features such as parallel TCP streams for high throughput and topology-enabled collective communications to counter high network latencies. B. MATLAB*p MATLAB*p is a MATLAB extension that provides parallel programming facilities for the MATLAB language. The MATLAB*p extension connects the MATLAB client with an MPI-based parallel backend. MATLAB*p adds only a very minimal set of constructs to the MATLAB language, allowing users to quickly parallelize their existing sequential programs with relative ease. Parallel constructs in MATLAB*p are designed around the concept of a distributed object. All parallel operations in MATLAB*p primarily operate on distributed objects. For example, dense matrices are implemented as distributed objects (called ddense) with row, column or block-cyclic distribution of elements. Most of the operations on dense matrices are implemented using highly-optimized parallel numerical libraries like SCALAPACK (also called packages) at the backend. Similarly, a sparse matrix object (named dsparse) uses a rowdistributed layout where each row is stored in the compressed row format [5]. This design leads to syntax that is simple yet very powerful and expressive. In most cases, a user just needs to specify a matrix to be of distributed type. Subsequent operations on the specified matrix, if provided by a package, achieve automatic data parallelism. In cases where the user desires embarrassingly parallel SPMD behavior using a MATLAB function, MATLAB*p provides a multi-matlab mode (mm mode). Under this mode, MATLAB*p simply invokes a MAT- LAB shell on each node executing the specified function. If one of the arguments to the function is a distributed object, MATLAB*p converts the local chunk of the object at each node to a regular MATLAB matrix and passes it to the function. If required, results from the mm mode call are returned as distributed objects. III. GRID*p: IMPLEMENTATION We initially envisioned GRID*p as a straightforward port of MATLAB*p to the Grid based on the MPICH-G2 MPI library. However, network layer complexities arising from firewalling and Network Address Translation (NAT) on the clusters made our prototype implementation much more complex. We discuss these issues later in this section. MATLAB*p Globus ScaLAPACK BLACS MPICH-G2 Fig. 1. MATLAB Matrix Mgr GSI GRID*p architecture Package Mgr Other Packages GRAM Leaving network layer issues aside, the high-level architecture of GRID*p is schematically represented in Fig 1. As can be seen, GRID*p is essentially a MATLAB*p extension that uses MPICH-G2 and Globus as a substrate. The current prototype of GRID*p supports the basic operations on ddense objects and a minimal subset of numerical linear algebra routines using the SCALAPACK parallel numerical linear algebra package and its associated BLACS communication library. We plan to integrate more packages into GRID*p in the near future. As mentioned earlier, an important feature of GRID*p is its ability to handle network communications in presence of firewalls and network address translation (NAT). This feature is incorporated in GRID*p more out of necessity than design. Most of the commodity clusters deployed today allow public IP-layer access to only one machine (called the gateway or head node). The rest of the nodes in the cluster are assigned private IP addresses and are not visible from outside the network. Unfortunately, Globus only supports machines with a public IP address and DNS name. This limitation is due to the design of the X.509 PKI based Grid security module (GSI). Each node in Globus is identified by an X.509 public key certificate that is tied to its public DNS name. This ensures that a node can be authenticated by comparing the DNS name in its certificate with the result of a reverse DNS lookup on its IP address. For example, when a node requests remote job execution on another node using GRAM, both of them perform the above mentioned check. However, this mechanism prevents internal nodes with private IPs in a cluster from offering or receiving services in Globus.

Port Fwd Table 1000:2000 Node-A 2000:3000 Node-B <G1:y,B:2119> Node-B Fig. 2. Node-C <C:x,G1:1001> MIT (G2) <G2:y,G1:1001> UCSB (G1) <A:x,G1:2001> GRID*p Messaging <G2:y,A:2119> Node-A Our workaround for this restriction consists of two techniques. First, we copy the gateway node s certificate (associated with the cluster s sole public DNS name) onto each internal node. Second, Globus services on an internal node are advertised as running on specific TCP ports on the gateway node. Service requests arriving at a particular port number at the gateway node are then relayed onto the designated internal node using port forwarding rules in the firewall. For each request, a private node presents its gateway node s X.509 certificate to the remote node and since the latter sees the gateway node as the endpoint at the IP layer, the reverse DNS check succeeds. To better explain the complex message passing, we present an example in Fig. 2. The setup roughly resembles our Grid testbed consisting of two clusters configured with NAT and firewalling enabled. It is assumed that outbound connections originating from within the cluster are allowed (which is usually the case in practice). Arrows represent network flows and the attached labels mark their network endpoint addresses in the form < srcip : port, destip : port >. Now, consider the case when a job is launched across clusters (the dotted flow). To launch a job on node A, node C connects to the head node G1 on a specific port, which is then forwarded to the Globus GRAM component (Gatekeeper) on node A. Node A presents G1 s certificate to node C and since the latter sees G1 as the endpoint of the connection, its reverse DNS check succeeds. Quite surprisingly, the traffic flow is more complex when a job is launched within the same cluster (designated by solid arrows). To see why, consider the case where A tries to launch a job on node B directly. Node A will present itself as G1 using G1 s node certificate. But node B s reverse DNS query on node A s IP address (which is private) will not match G1 s DNS name and the job will fail. Thus, to launch a job on node B, node A has to use port-forwarding through G1. Additionally, G1 alters the source address of the flow from node A, so that node B sees the connection as coming from G1. The DNS reverse check by B on the incoming connection then proceeds successfully. One of the drawbacks of our solution is that it requires addition of NAT/firewall rules on each gateway node. Moreover, the use of relays can adversely affect communication latencies. This problem is all the more acute when an intra-cluster connection is routed through the gateway node. However, we believe that this solution, although inelegant, is very effective in integrating widely deployed firewalled clusters into a Globus based Grid. We also note that this is the only solution known to us that allows Globus to be used in such a setting. IV. EXPERIMENTS AND RESULTS Our experiments were conducted on a computational Grid testbed consisting of two computer clusters: a 4-node cluster with Intel Xeon 2.0 GHz PCs at UCSB and a 16-node Intel Xeon 2.4 GHz PC cluster located at MIT. The UCSB nodes are equipped with 1 GB of RAM and the MIT nodes have 2 GB of RAM each. All the nodes run the Linux operating system. Our applications had almost exclusive access to the clusters during the experimental runs. Each cluster is configured with iptables firewall rules to use NAT with only one publically visible DNS name assigned at the gateway node. The wide area network (WAN) that connects the clusters at UCSB and MIT is part of the Abilene internet2 backbone. We observed ping latencies of about 80-90 ms across the WAN during our experiments. A. Minimum degree ordering In order to verify the scalability of GRID*p in absence of network-layer latencies, we decided to experiment with highly task-parallel applications with negligible inter-node messaging. As a representative of this class of applications, we implemented a simple Monte Carlo-like simulation under GRID*p to investigate an open problem in combinatorial scientific computing. More specifically, we experimentally evaluated the performance of minimum degree ordering [6] on a regular grid graph during the symbolic factorization phase of a sparse matrix. Since finding an ordering that minimizes fill for arbitrary graphs is computationally intractable [7], various heuristics and algorithms for specialized graphs are used. Minimum degree ordering is one such heuristic that eliminates a vertex with the minimum degree at each step. For the regular grid (and the torus), minimum degree ordering with an ideal tie-breaking scheme has an optimal fill of O(nlogn). However, Berman and Schnitger [8] devised a tie-breaking strategy that yields θ(n log 3 4 ) fill for a k k torus graph with n = k 2 vertices. But, determining the theoretical worst case and average case

1800 Grid (12 nodes) Cluster (3 nodes) TABLE I an α + b (n = 10 4 : 10 4 : 256 10 4 ) 1600 Time (sec) 1400 1200 1000 800 No. of trials a α R 2 RMSE 12 8.293 ± 0.699 1.154 ± 0.006 1 2.375 10 5 60 8.691 ± 0.611 1.15 ± 0.004 1 1.979 10 5 120 8.554 ± 0.407 1.151 ± 0.003 1 1.338 10 5 600 400 TABLE II an log n α (n = 10 4 : 10 4 : 256 10 4 ) 200 0 0 12 24 36 48 60 Number of Trials Fig. 3. Performance of min-degree ordering on a 2.5mil 2.5mil matrix No. of trials a α R 2 RMSE 12 0.1027 ± 0.019 2.177 ± 0.063 1 2.756 10 5 60 0.1079 ± 0.018 2.16 ± 0.033 1 1.449 10 5 120 0.1062 ± 0.008 2.165 ± 0.025 1 1.121 10 5 bound of minimum degree ordering for the regular grid (and the torus) still remains an interesting open problem. We empirically evaluated the average case performance of minimum degree ordering for sparse matrices corresponding to the 5-point model problem on a k k torus. Each trial of our stochastic simulation consists of computing the minimum degree ordering of an initial random permutation of the k 2 k 2 matrix. Since we use MATLAB s symmetric approximate minimum degree ordering (symamd), the initial permutation ensures randomized tie-breaking. Moreover, each instance of the symbolic factorization on a random permutation of the matrix is independent and hence can be conducted in parallel. A simple one-line mm call under GRID*p allows us to execute the trials in parallel: f = mean(mm( mdstats, k, ntrials/np, seeds)); Here, np is a built-in variable whose value is bound to the number of processors in the system and seeds is a distributed vector that holds seed values to initialize the random number generator on the nodes. The serial function mdstats performs ntrials/np symbolic factorizations in parallel on each node and returns the average fill from all nodes in a ddense vector of size np. Finally, the average fill is computed by the mean operation over the distributed vector. Each iteration is independent of the others and hence the problem is perfectly parallel. The only communication occurs in the final reduction stage when all the samples are aggregated and their mean is computed. The matrices are generated randomly and hence we do not have to deal with distribution of the input data. To evaluate the performance of our application, we ran multiple simulations with varying number of trials for a fixsized sparse matrix. We determined the size of the matrix by calculating the largest size matrix that could fit in the physical memory of a single node. We ran the experiments on a 12-node Grid testbed and a 3-node cluster. Fig 3 plots the comparative running time of the simulations on the cluster and the Grid against the number of trials for a 2.5mil 2.5mil matrix. The results prove that the highly task-parallel nature of the application is instrumental in achieving an almost linear speedup: the Grid setting has four times the nodes compared to the cluster and it performs nearly four times faster. The number of results transferred over the network during the final reduction is ksteps ntrials and hence the communication is low. To further characterize the fill as a function of the size of the input matrix, we ran simulations on matrices varying in size from n = 10 4 to n = 256 10 4. The generated data points were then used to determine the best functional forms using non-linear least-squares regression techniques. Tables I and II list the parameters and statistics of the two best functional fits: an α +b and an log n α respectively. The values of α and a are listed with 95% confidence bounds along with goodness of fit statistics. R 2 values close to one indicate that the fitted curve explains almost all of the variation in the data. Each data point computed by a simulation is a mean over all the trials and thus increasing the number of trials diminishes the effects of outliers. As a result, we observe that the quality of the curves improves for simulations with larger number of trials. Thus, based on the data, we can formulate the asymptotic average case bound as n 1.15 or n log n 2.16. We make two major observations from our results. First, we observe that, in practice, the average case fill generated by minimum degree ordering is worse than the O(n log n) optimal bound. Second, the average case fill bound is actually better than the O(n 1.26 ) fill induced by the Berman-Schnitger tiebreaking construction. The experiments conducted by Berman and Schnitger were quite limited and inconclusive due to the choice of very small matrices and trial numbers and the lack of a proper statistical methodology. Our parallel application allows us to experiment with a wider range of matrix sizes and simulation trials by respectively utilizing the entire memory capacity and aggregate CPU cycles on the Grid.

B. 2-Dimensional Fast Fourier Transform (2d FFT) Our next application computes the 2-Dimensional FFT on a dense square matrix. Calculating the 2d FFT in GRID*p involves writing four lines of simple MATLAB-like code (listed below). A = randn(n, n*p); B = mm( fft, A); C = B. ; D = mm( fft, C); F = D. ; The first line creates a matrix with n x n dimensions that is distributed column-wise (indicated by appending *p to the column dimension) on the nodes. The second and fourth lines invoke the mm mode to spawn a MATLAB instance on each node, which computes the FFT on its columnar block. The transpose operations are handled by the SCALAPACK library. Each transpose involves heavy communication between all the nodes in the system since O(n 2 ) elements of the matrix are exchanged. The messaging in this stage can be characterized as having lock-step synchronization. The matrix used in this example is generated randomly in a distributed fashion and hence there are no data input and staging issues. We first ran the application on our Grid testbed using 8 nodes split equally between UCSB and MIT. For this experiment, the MIT nodes came from a 8-node AMD Athlon 1.7 GHz PC cluster. The rest of the experimental setup was similar to the one used in the minimum degree experiments. Fig 4 plots the total execution time against the size of the input matrix. The plot also shows the decomposition of running time between the transpose and the actual FFT operations. Time (t) 2000 1500 1000 500 2D FFT on an 8-node (4 MIT + 4 UCSB) Grid Transpose FFT 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Matrix Size (x 1000) Fig. 4. Performance of 2d FFT problem on a Grid testbed The results clearly show that even though GRID*p achieves scalability, the execution times are very slow. More specifically, the running time is dominated by the transpose operations, which involve tightly-coupled communication patterns. High latencies on the WAN link cause the transpose operations to dominate the overall running time thereby negating the speedup achieved from using extra CPU resources. These results point out that tightly-coupled (or network bound) applications scale very poorly on the Grid even if they scale in terms of CPU and memory usage. Time (sec) Fig. 5. 2000 1500 1000 500 Grid (8 nodes) Cluster (4 nodes) 0 0 1 2 3 4 5 6 7 8 9 10 11 Matrix Size (x 1000) Comparative Performance of 2d FFT on a cluster and a Grid testbed For comparison, we also ran the same experiments on the 4- node cluster at UCSB. Results from these runs are compared with the results on the Grid in Fig 5. Since a local cluster computation doesn t suffer from high network latencies, it not only achieves good scalability but also much faster execution times. However, since the aggregate memory available on the Grid is significantly larger, GRID*p scales better for the memory-bound 2d FFT application. This can be seen for the case of the 11000 11000 size dense matrix in the plot - there are no results on the cluster because such a large sized matrix cannot fit in the available aggregate memory. We discuss some possible approaches to improve the performance of networkbound applications under GRID*p in the future work section. V. RELATED WORK GRID*p differs from the relatively few other programming environments on the Grid in its use of a high-level programming language and almost transparent parallelization of applications. GRID*p is also unique in that it is allows users to view their parallel computations as operations on distributed data objects, which reside in persistent memory across the grid. Thus, it markedly differs from a generalpurpose problem solving environment like NetSolve [9]. Net- Solve users launch remote computations brokered by an agent from various scientific computing environments (SCE) like MATLAB, Mathematica, etc. Even though the NetSolve agent hides the details of resource acquisition, its RPC-like model does not maintain the inputs and results of computations as distributed data handles, thereby limiting its scalability. The Cactus framework [10] is widely used in the astrophysics community to develop parallel applications (known as thorns ) on top of other Grid services. Cactus thorns are highly domain-specific and their development requires the knowledge of a low-level programming language like C or Fortran. Both the Nimrod-G [11] and AppLeS Parameter

Sweep Template (APST) [12] projects target the specific class of parametric simulations. These frameworks enable the optimal execution of multiple instances of the same simulation, which independently sweep through a different part of the parameter space. Although very domain specific, this class of applications is very similar to our example of the highly task-parallel minimum degree ordering. For the sake of completeness, we would also like to mention the Grid Application Software (GrADS) [13] project. GrADS is perhaps the most complete framework for developing Grid applications integrating various techniques such as static and dynamic optimization, grid-enabled libraries, application scheduling, contract based monitoring, etc. However, the focus of the GrADS system is on the construction of a high performance application runtime rather than high-level parallel programming models. VI. FUTURE WORK Based on the results of our experiments, we would like to further refine and extend the design of GRID*p in several areas: Although the results for highly task-parallel applications are promising, we can further optimize the performance by making GRID*p more adaptive to runtime variability of resource quality. More specifically, since the concept of implicit data parallelism is integral to GRID*p, it is imperative that data partitioning reflects the heterogeneity of the underlying resources. We would like to investigate the use of irregular data distributions to map data (and thereby computation) based on performance forecasts from a resource monitoring system like the Network Weather Service (NWS) [15]. One of the main result of our tests was that parallel applications with very fine-grained and synchronized messaging schedules perform poorly on the Grid owing to the high packet delays on WANs. Their performance can be greatly improved by coordination of application-specific parallel network locality aware algorithms with the runtime scheduler. However, we think that high-level topology-aware messaging primitives as described in [14] provide a more general solution for mitigating the effects of WAN latencies. Our immediate future goal is to incorporate these primitives (which are provided by MPICH-G2) in our experiments and verify the improvements. Another future goal is to extend of our current mm mode implementation to facilitate higher task parallelism. The current syntax of mm mode restricts users to a SPMD-like construct i.e. it allows users to supply their own data-parallel operations. However, asynchronous task-parallelism will allow GRID*p to target a wide range of loosely-coupled parallel applications on the Grid. We are currently looking at nested data-parallelism as a basis for our new design. VII. CONCLUSION In this paper, we presented GRID*p - a parallel interactive MATLAB-like programming and runtime environment built on top of MATLAB*p and the Globus Toolkit. GRID*p provides implicit data parallelism and an extremely simple primitive for achieving task parallelism, at the same time encapsulating the underlying messaging and network complexity from the user. To demonstrate the simplicity and performance of GRID*p we experimented with two different class of applications: a highly task-parallel stochastic simulation and a communication-intensive 2d FFT computation. Using the former, we characterized the average fill induced by minimum degree ordering since determining its average (and worse) case asymptotic complexity remains a long-standing open problem. In terms of performance, our results show that GRID*p achieves almost linear scalability for the highly task-parallel application with minimal effort. On the other hand, the 2d FFT application exhibits poor performance as it is bounded by network delays. In other words, a very high computation to communication ratio is required for an application to overcome the high latencies imposed by WAN links on the Grid. To conclude, our experience with GRID*p demonstrates the feasibility of Grid as a very promising platform for building loosely-coupled task-parallel applications. REFERENCES [1] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure, 2nd ed., ser. Elsevier Series in Grid Computing. Morgan Kaufmann, 2003. [2] R. Choy and A. Edelman, Parallel MATLAB: Doing it Right, Proceedings of the IEEE, Special Issue on Program Generation, Optimization, and Adaptation, vol. 93, no. 2, pp. 331 341, Feb 2005. [3] I. Foster and C. Kesselman, Globus: A Metacomputing Infrastructure Toolkit, The International Journal of Supercomputer Applications and High Performance Computing, vol. 11, no. 2, pp. 115 128, 1997. [4] N. Karonis, B. Toonen, and I. Foster, MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface, Journal of Parallel and Distributed Computing, vol. 63, no. 5, pp. 551 563, May 2003. [5] V. Shah and J. R. Gilbert, Sparse Matrices in Matlab*P: Design and Implementation, in Proc. High Performance Computing (HiPC 04), ser. Lecture Notes in Computer Science, vol. 3296. Springer, Dec. 2004, pp. 144 155. [6] A. George and J. W. H. Liu, The Evolution of the Minimum Degree Ordering Algorithm, SIAM Review, vol. 31, no. 1, pp. 1 19, Mar. 1989. [7] M. Yannakakis, Computing the minimal fill-in is NP-complete, SIAM J. Alg. Disc. Meth., vol. 2, no. 1, pp. 77 79, Mar. 1981. [8] P. Berman and G. Schnitger, On the performance of the minimum degree ordering for Gaussian elimination, SIAM J. Matrix Anal. Appl., vol. 11, no. 1, pp. 83 88, Jan. 1990. [9] D. Arnold, S. Agrawal, S. Blackford, J. Dongarra, M. Miller, K. Seymour, K. Sagi, Z. Shi, and S. Vadhiyar, Users Guide to NetSolve V1.4.1, University of Tennessee, Knoxville, TN, Innovative Computing Dept. Technical Report ICL-UT-02-05, June 2002. [10] G. Allen, T. Dramlitsch, I. Foster, N. T. Karonis, M. Ripeanu, E. Seidel, and B. Toonen, Supporting efficient execution in heterogeneous distributed computing environments with Cactus and Globus, in Proc. Supercomputing (SC 01), 2001, pp. 52 52. [11] D. Abramson, J. Giddy, and L. Kotler, High performance parametric modeling with Nimrod/G: killer application for the global Grid? in Proc. International Parallel and Distributed Processing Symposium (IPDPS 00), May 2000, pp. 520 528. [12] H. Casanova, G. Obertelli, F. Berman, and R. Wolski, The AppLeS Parameter Sweep Template: User-level middleware for the grid, in Proc. Supercomputing (SC 00), Nov. 2000, pp. 75 76. [13] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon, L. Johnsson, K. Kennedy, C. Kesselman, J. Mellor-Crummey, D. Reed, L. Torczon, and R. Wolski, The GrADS Project: Software Support for High-Level Grid Application Development, The International Journal of High Performance Computing Applications, vol. 15, no. 4, pp. 327 344, 2001.

[14] N. T. Karonis, B. R. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan, Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance, in Proc. International Parallel and Distributed Processing Symposium (IPDPS 00), May 2000, pp. 377 384. [15] R. Wolski, N. Spring, and J. Hayes, The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing, Journal of Future Generation Computing Systems, vol. 15, no. 5-6, pp. 757 768, Oct 1999.