GRID*p: Interactive Data-Parallel Programming on the Grid with MATLAB

Size: px
Start display at page:

Download "GRID*p: Interactive Data-Parallel Programming on the Grid with MATLAB"

Transcription

1 GRID*p: Interactive Data-Parallel Programming on the Grid with MATLAB Imran Patel and John R. Gilbert Department of Computer Science University of California, Santa Barbara {imran, Abstract The Computational Grid has emerged as an attractive platform for developing large-scale distributed applications that run on heterogeneous computing resources. This scalability, however, comes at the cost of increased complexity: each application has to handle the details of resource provisioning and data and task scheduling. To address this problem, we present GRID*p - an interactive parallel system built on top of MATLAB*p and Globus that provides a MATLAB-like problem solving environment for the Grid. Applications under GRID*p achieve automatic data parallelism through the transparent use of distributed data objects, while the GRID*p runtime takes care of data partitioning, task scheduling and inter-node messaging. We evaluate the simplicity and performance of GRID*p with two different types of parallel applications consisting of matrix and graph computations - one of which investigates an open problem in combinatorial scientific computing. Our results show that GRID*p delivers promising performance for highly parallel applications, at the same time greatly simplifying their development. I. INTRODUCTION Recent advances in high performance computing have led to the availability of a large number of inexpensive processing and storage resources. This trend coupled with the availability of high-speed networks has led to the concept of the Computational Grid [1]. A Computational Grid enables the development of large-scale applications on top of looselycoupled heterogeneous computing resources using a collection of middleware services. Grid applications utilize these services to access a diverse set of resources thereby achieving high scalability and performance. But this scalability comes at the price of increased complexity in the application. In order to achieve high performance, each application has to handle complex issues such as scheduling tasks and data on its set of acquired resources. Current research in Grid computing has mostly focussed on middleware for locating, accessing and monitoring resources. A few components exist, which broker resource provisioning and contracts based on scheduling constraints of the application. But, an application programmer has to still deal with a complex abstraction layer even for a relatively simple parallel computation. As a result of the lack of a higher-level programming model and an associated runtime environment, Grid applications remain hard to develop. To address this problem, we present GRID*p an interactive parallel system built on top of MATLAB*p [2] and Globus [3] that provides a MATLAB-like problem solving environment for the Grid. Applications under GRID*p achieve automatic data parallelism while the GRID*p runtime takes care of data partitioning, task scheduling and inter-node messaging. GRID*p relies on the abstract distributed object model of MATLAB*p to provide highly intuitive and almost transparent parallel constructs to the programmer. This paper makes two key contributions. First, it presents an extremely powerful and simple-to-use interactive parallel programming environment that enables the development of implicitly data parallel applications. Users interact with GRID*p by using distributed matrices under the standard MATLAB client. These distributed objects are syntactically almost identical to the standard MATLAB matrices. However, regular MATLAB operations on these distributed matrices are automatically parallelized by GRID*p. As a result, even a less sophisticated user/programmer can achieve parallelism from the familiar interactive MATLAB shell with minimal effort. The underlying runtime achieves high performance while hiding the details of data layout and messaging even on a complex network topology. The second contribution is the demonstration of the feasibility of the system by experimentally evaluating two different flavors of parallel applications. In particular, we present a highly task-parallel stochastic simulation that has yielded improved results for an open problem in the field of combinatorial scientific computing. The rest of this paper is organized as follows: In the following section we provide background on the Globus Toolkit, which is currently the de facto software standard to develop Grid applications. We also discuss the associated MPICH-G2 [4] toolkit, which is an implementation of the MPI standard on top of Globus. This is followed by a brief discussion of the cluster-based MATLAB*p system. We note some implementation issues with the current prototype of GRID*p in Section 3. Section 4 describes two parallel applications and the results of their experimental evaluation under GRID*p. Sections 5 and 6 discuss related and future work respectively. We conclude in Section 7. A. Globus and MPICH-G2 II. BACKGROUND The Globus Toolkit is a set of middleware services that export standardized interfaces to uniformly and securely access a diverse pool of computing resources. Specifically, these services provide facilities such as process management for

2 remote job execution, directory services for resource discovery and high performance data management for data transfer and staging. They are implemented in the form of UNIX daemon processes and as web services. For our purposes, the most important components are the Globus Resource Allocation and Management subsystem (GRAM) and the Globus Security Infrastructure (GSI). The GRAM module is responsible for remote job execution and management in Globus. It also manages the standard input, output and error streams associated with the job. GSI provides authentication and authorization of users and their jobs using facilities such as single sign-on and delegation. It employs the X.509 PKI standard to implement per-host and per-user security certificates for identification. MPICH-G2 is an implementation of the MPI standard under Globus. It uses GRAM for launching remote jobs and messaging. It also provides unique features such as parallel TCP streams for high throughput and topology-enabled collective communications to counter high network latencies. B. MATLAB*p MATLAB*p is a MATLAB extension that provides parallel programming facilities for the MATLAB language. The MATLAB*p extension connects the MATLAB client with an MPI-based parallel backend. MATLAB*p adds only a very minimal set of constructs to the MATLAB language, allowing users to quickly parallelize their existing sequential programs with relative ease. Parallel constructs in MATLAB*p are designed around the concept of a distributed object. All parallel operations in MATLAB*p primarily operate on distributed objects. For example, dense matrices are implemented as distributed objects (called ddense) with row, column or block-cyclic distribution of elements. Most of the operations on dense matrices are implemented using highly-optimized parallel numerical libraries like SCALAPACK (also called packages) at the backend. Similarly, a sparse matrix object (named dsparse) uses a rowdistributed layout where each row is stored in the compressed row format [5]. This design leads to syntax that is simple yet very powerful and expressive. In most cases, a user just needs to specify a matrix to be of distributed type. Subsequent operations on the specified matrix, if provided by a package, achieve automatic data parallelism. In cases where the user desires embarrassingly parallel SPMD behavior using a MATLAB function, MATLAB*p provides a multi-matlab mode (mm mode). Under this mode, MATLAB*p simply invokes a MAT- LAB shell on each node executing the specified function. If one of the arguments to the function is a distributed object, MATLAB*p converts the local chunk of the object at each node to a regular MATLAB matrix and passes it to the function. If required, results from the mm mode call are returned as distributed objects. III. GRID*p: IMPLEMENTATION We initially envisioned GRID*p as a straightforward port of MATLAB*p to the Grid based on the MPICH-G2 MPI library. However, network layer complexities arising from firewalling and Network Address Translation (NAT) on the clusters made our prototype implementation much more complex. We discuss these issues later in this section. MATLAB*p Globus ScaLAPACK BLACS MPICH-G2 Fig. 1. MATLAB Matrix Mgr GSI GRID*p architecture Package Mgr Other Packages GRAM Leaving network layer issues aside, the high-level architecture of GRID*p is schematically represented in Fig 1. As can be seen, GRID*p is essentially a MATLAB*p extension that uses MPICH-G2 and Globus as a substrate. The current prototype of GRID*p supports the basic operations on ddense objects and a minimal subset of numerical linear algebra routines using the SCALAPACK parallel numerical linear algebra package and its associated BLACS communication library. We plan to integrate more packages into GRID*p in the near future. As mentioned earlier, an important feature of GRID*p is its ability to handle network communications in presence of firewalls and network address translation (NAT). This feature is incorporated in GRID*p more out of necessity than design. Most of the commodity clusters deployed today allow public IP-layer access to only one machine (called the gateway or head node). The rest of the nodes in the cluster are assigned private IP addresses and are not visible from outside the network. Unfortunately, Globus only supports machines with a public IP address and DNS name. This limitation is due to the design of the X.509 PKI based Grid security module (GSI). Each node in Globus is identified by an X.509 public key certificate that is tied to its public DNS name. This ensures that a node can be authenticated by comparing the DNS name in its certificate with the result of a reverse DNS lookup on its IP address. For example, when a node requests remote job execution on another node using GRAM, both of them perform the above mentioned check. However, this mechanism prevents internal nodes with private IPs in a cluster from offering or receiving services in Globus.

3 Port Fwd Table 1000:2000 Node-A 2000:3000 Node-B <G1:y,B:2119> Node-B Fig. 2. Node-C <C:x,G1:1001> MIT (G2) <G2:y,G1:1001> UCSB (G1) <A:x,G1:2001> GRID*p Messaging <G2:y,A:2119> Node-A Our workaround for this restriction consists of two techniques. First, we copy the gateway node s certificate (associated with the cluster s sole public DNS name) onto each internal node. Second, Globus services on an internal node are advertised as running on specific TCP ports on the gateway node. Service requests arriving at a particular port number at the gateway node are then relayed onto the designated internal node using port forwarding rules in the firewall. For each request, a private node presents its gateway node s X.509 certificate to the remote node and since the latter sees the gateway node as the endpoint at the IP layer, the reverse DNS check succeeds. To better explain the complex message passing, we present an example in Fig. 2. The setup roughly resembles our Grid testbed consisting of two clusters configured with NAT and firewalling enabled. It is assumed that outbound connections originating from within the cluster are allowed (which is usually the case in practice). Arrows represent network flows and the attached labels mark their network endpoint addresses in the form < srcip : port, destip : port >. Now, consider the case when a job is launched across clusters (the dotted flow). To launch a job on node A, node C connects to the head node G1 on a specific port, which is then forwarded to the Globus GRAM component (Gatekeeper) on node A. Node A presents G1 s certificate to node C and since the latter sees G1 as the endpoint of the connection, its reverse DNS check succeeds. Quite surprisingly, the traffic flow is more complex when a job is launched within the same cluster (designated by solid arrows). To see why, consider the case where A tries to launch a job on node B directly. Node A will present itself as G1 using G1 s node certificate. But node B s reverse DNS query on node A s IP address (which is private) will not match G1 s DNS name and the job will fail. Thus, to launch a job on node B, node A has to use port-forwarding through G1. Additionally, G1 alters the source address of the flow from node A, so that node B sees the connection as coming from G1. The DNS reverse check by B on the incoming connection then proceeds successfully. One of the drawbacks of our solution is that it requires addition of NAT/firewall rules on each gateway node. Moreover, the use of relays can adversely affect communication latencies. This problem is all the more acute when an intra-cluster connection is routed through the gateway node. However, we believe that this solution, although inelegant, is very effective in integrating widely deployed firewalled clusters into a Globus based Grid. We also note that this is the only solution known to us that allows Globus to be used in such a setting. IV. EXPERIMENTS AND RESULTS Our experiments were conducted on a computational Grid testbed consisting of two computer clusters: a 4-node cluster with Intel Xeon 2.0 GHz PCs at UCSB and a 16-node Intel Xeon 2.4 GHz PC cluster located at MIT. The UCSB nodes are equipped with 1 GB of RAM and the MIT nodes have 2 GB of RAM each. All the nodes run the Linux operating system. Our applications had almost exclusive access to the clusters during the experimental runs. Each cluster is configured with iptables firewall rules to use NAT with only one publically visible DNS name assigned at the gateway node. The wide area network (WAN) that connects the clusters at UCSB and MIT is part of the Abilene internet2 backbone. We observed ping latencies of about ms across the WAN during our experiments. A. Minimum degree ordering In order to verify the scalability of GRID*p in absence of network-layer latencies, we decided to experiment with highly task-parallel applications with negligible inter-node messaging. As a representative of this class of applications, we implemented a simple Monte Carlo-like simulation under GRID*p to investigate an open problem in combinatorial scientific computing. More specifically, we experimentally evaluated the performance of minimum degree ordering [6] on a regular grid graph during the symbolic factorization phase of a sparse matrix. Since finding an ordering that minimizes fill for arbitrary graphs is computationally intractable [7], various heuristics and algorithms for specialized graphs are used. Minimum degree ordering is one such heuristic that eliminates a vertex with the minimum degree at each step. For the regular grid (and the torus), minimum degree ordering with an ideal tie-breaking scheme has an optimal fill of O(nlogn). However, Berman and Schnitger [8] devised a tie-breaking strategy that yields θ(n log 3 4 ) fill for a k k torus graph with n = k 2 vertices. But, determining the theoretical worst case and average case

4 1800 Grid (12 nodes) Cluster (3 nodes) TABLE I an α + b (n = 10 4 : 10 4 : ) 1600 Time (sec) No. of trials a α R 2 RMSE ± ± ± ± ± ± TABLE II an log n α (n = 10 4 : 10 4 : ) Number of Trials Fig. 3. Performance of min-degree ordering on a 2.5mil 2.5mil matrix No. of trials a α R 2 RMSE ± ± ± ± ± ± bound of minimum degree ordering for the regular grid (and the torus) still remains an interesting open problem. We empirically evaluated the average case performance of minimum degree ordering for sparse matrices corresponding to the 5-point model problem on a k k torus. Each trial of our stochastic simulation consists of computing the minimum degree ordering of an initial random permutation of the k 2 k 2 matrix. Since we use MATLAB s symmetric approximate minimum degree ordering (symamd), the initial permutation ensures randomized tie-breaking. Moreover, each instance of the symbolic factorization on a random permutation of the matrix is independent and hence can be conducted in parallel. A simple one-line mm call under GRID*p allows us to execute the trials in parallel: f = mean(mm( mdstats, k, ntrials/np, seeds)); Here, np is a built-in variable whose value is bound to the number of processors in the system and seeds is a distributed vector that holds seed values to initialize the random number generator on the nodes. The serial function mdstats performs ntrials/np symbolic factorizations in parallel on each node and returns the average fill from all nodes in a ddense vector of size np. Finally, the average fill is computed by the mean operation over the distributed vector. Each iteration is independent of the others and hence the problem is perfectly parallel. The only communication occurs in the final reduction stage when all the samples are aggregated and their mean is computed. The matrices are generated randomly and hence we do not have to deal with distribution of the input data. To evaluate the performance of our application, we ran multiple simulations with varying number of trials for a fixsized sparse matrix. We determined the size of the matrix by calculating the largest size matrix that could fit in the physical memory of a single node. We ran the experiments on a 12-node Grid testbed and a 3-node cluster. Fig 3 plots the comparative running time of the simulations on the cluster and the Grid against the number of trials for a 2.5mil 2.5mil matrix. The results prove that the highly task-parallel nature of the application is instrumental in achieving an almost linear speedup: the Grid setting has four times the nodes compared to the cluster and it performs nearly four times faster. The number of results transferred over the network during the final reduction is ksteps ntrials and hence the communication is low. To further characterize the fill as a function of the size of the input matrix, we ran simulations on matrices varying in size from n = 10 4 to n = The generated data points were then used to determine the best functional forms using non-linear least-squares regression techniques. Tables I and II list the parameters and statistics of the two best functional fits: an α +b and an log n α respectively. The values of α and a are listed with 95% confidence bounds along with goodness of fit statistics. R 2 values close to one indicate that the fitted curve explains almost all of the variation in the data. Each data point computed by a simulation is a mean over all the trials and thus increasing the number of trials diminishes the effects of outliers. As a result, we observe that the quality of the curves improves for simulations with larger number of trials. Thus, based on the data, we can formulate the asymptotic average case bound as n 1.15 or n log n We make two major observations from our results. First, we observe that, in practice, the average case fill generated by minimum degree ordering is worse than the O(n log n) optimal bound. Second, the average case fill bound is actually better than the O(n 1.26 ) fill induced by the Berman-Schnitger tiebreaking construction. The experiments conducted by Berman and Schnitger were quite limited and inconclusive due to the choice of very small matrices and trial numbers and the lack of a proper statistical methodology. Our parallel application allows us to experiment with a wider range of matrix sizes and simulation trials by respectively utilizing the entire memory capacity and aggregate CPU cycles on the Grid.

5 B. 2-Dimensional Fast Fourier Transform (2d FFT) Our next application computes the 2-Dimensional FFT on a dense square matrix. Calculating the 2d FFT in GRID*p involves writing four lines of simple MATLAB-like code (listed below). A = randn(n, n*p); B = mm( fft, A); C = B. ; D = mm( fft, C); F = D. ; The first line creates a matrix with n x n dimensions that is distributed column-wise (indicated by appending *p to the column dimension) on the nodes. The second and fourth lines invoke the mm mode to spawn a MATLAB instance on each node, which computes the FFT on its columnar block. The transpose operations are handled by the SCALAPACK library. Each transpose involves heavy communication between all the nodes in the system since O(n 2 ) elements of the matrix are exchanged. The messaging in this stage can be characterized as having lock-step synchronization. The matrix used in this example is generated randomly in a distributed fashion and hence there are no data input and staging issues. We first ran the application on our Grid testbed using 8 nodes split equally between UCSB and MIT. For this experiment, the MIT nodes came from a 8-node AMD Athlon 1.7 GHz PC cluster. The rest of the experimental setup was similar to the one used in the minimum degree experiments. Fig 4 plots the total execution time against the size of the input matrix. The plot also shows the decomposition of running time between the transpose and the actual FFT operations. Time (t) D FFT on an 8-node (4 MIT + 4 UCSB) Grid Transpose FFT Matrix Size (x 1000) Fig. 4. Performance of 2d FFT problem on a Grid testbed The results clearly show that even though GRID*p achieves scalability, the execution times are very slow. More specifically, the running time is dominated by the transpose operations, which involve tightly-coupled communication patterns. High latencies on the WAN link cause the transpose operations to dominate the overall running time thereby negating the speedup achieved from using extra CPU resources. These results point out that tightly-coupled (or network bound) applications scale very poorly on the Grid even if they scale in terms of CPU and memory usage. Time (sec) Fig Grid (8 nodes) Cluster (4 nodes) Matrix Size (x 1000) Comparative Performance of 2d FFT on a cluster and a Grid testbed For comparison, we also ran the same experiments on the 4- node cluster at UCSB. Results from these runs are compared with the results on the Grid in Fig 5. Since a local cluster computation doesn t suffer from high network latencies, it not only achieves good scalability but also much faster execution times. However, since the aggregate memory available on the Grid is significantly larger, GRID*p scales better for the memory-bound 2d FFT application. This can be seen for the case of the size dense matrix in the plot - there are no results on the cluster because such a large sized matrix cannot fit in the available aggregate memory. We discuss some possible approaches to improve the performance of networkbound applications under GRID*p in the future work section. V. RELATED WORK GRID*p differs from the relatively few other programming environments on the Grid in its use of a high-level programming language and almost transparent parallelization of applications. GRID*p is also unique in that it is allows users to view their parallel computations as operations on distributed data objects, which reside in persistent memory across the grid. Thus, it markedly differs from a generalpurpose problem solving environment like NetSolve [9]. Net- Solve users launch remote computations brokered by an agent from various scientific computing environments (SCE) like MATLAB, Mathematica, etc. Even though the NetSolve agent hides the details of resource acquisition, its RPC-like model does not maintain the inputs and results of computations as distributed data handles, thereby limiting its scalability. The Cactus framework [10] is widely used in the astrophysics community to develop parallel applications (known as thorns ) on top of other Grid services. Cactus thorns are highly domain-specific and their development requires the knowledge of a low-level programming language like C or Fortran. Both the Nimrod-G [11] and AppLeS Parameter

6 Sweep Template (APST) [12] projects target the specific class of parametric simulations. These frameworks enable the optimal execution of multiple instances of the same simulation, which independently sweep through a different part of the parameter space. Although very domain specific, this class of applications is very similar to our example of the highly task-parallel minimum degree ordering. For the sake of completeness, we would also like to mention the Grid Application Software (GrADS) [13] project. GrADS is perhaps the most complete framework for developing Grid applications integrating various techniques such as static and dynamic optimization, grid-enabled libraries, application scheduling, contract based monitoring, etc. However, the focus of the GrADS system is on the construction of a high performance application runtime rather than high-level parallel programming models. VI. FUTURE WORK Based on the results of our experiments, we would like to further refine and extend the design of GRID*p in several areas: Although the results for highly task-parallel applications are promising, we can further optimize the performance by making GRID*p more adaptive to runtime variability of resource quality. More specifically, since the concept of implicit data parallelism is integral to GRID*p, it is imperative that data partitioning reflects the heterogeneity of the underlying resources. We would like to investigate the use of irregular data distributions to map data (and thereby computation) based on performance forecasts from a resource monitoring system like the Network Weather Service (NWS) [15]. One of the main result of our tests was that parallel applications with very fine-grained and synchronized messaging schedules perform poorly on the Grid owing to the high packet delays on WANs. Their performance can be greatly improved by coordination of application-specific parallel network locality aware algorithms with the runtime scheduler. However, we think that high-level topology-aware messaging primitives as described in [14] provide a more general solution for mitigating the effects of WAN latencies. Our immediate future goal is to incorporate these primitives (which are provided by MPICH-G2) in our experiments and verify the improvements. Another future goal is to extend of our current mm mode implementation to facilitate higher task parallelism. The current syntax of mm mode restricts users to a SPMD-like construct i.e. it allows users to supply their own data-parallel operations. However, asynchronous task-parallelism will allow GRID*p to target a wide range of loosely-coupled parallel applications on the Grid. We are currently looking at nested data-parallelism as a basis for our new design. VII. CONCLUSION In this paper, we presented GRID*p - a parallel interactive MATLAB-like programming and runtime environment built on top of MATLAB*p and the Globus Toolkit. GRID*p provides implicit data parallelism and an extremely simple primitive for achieving task parallelism, at the same time encapsulating the underlying messaging and network complexity from the user. To demonstrate the simplicity and performance of GRID*p we experimented with two different class of applications: a highly task-parallel stochastic simulation and a communication-intensive 2d FFT computation. Using the former, we characterized the average fill induced by minimum degree ordering since determining its average (and worse) case asymptotic complexity remains a long-standing open problem. In terms of performance, our results show that GRID*p achieves almost linear scalability for the highly task-parallel application with minimal effort. On the other hand, the 2d FFT application exhibits poor performance as it is bounded by network delays. In other words, a very high computation to communication ratio is required for an application to overcome the high latencies imposed by WAN links on the Grid. To conclude, our experience with GRID*p demonstrates the feasibility of Grid as a very promising platform for building loosely-coupled task-parallel applications. REFERENCES [1] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure, 2nd ed., ser. Elsevier Series in Grid Computing. Morgan Kaufmann, [2] R. Choy and A. Edelman, Parallel MATLAB: Doing it Right, Proceedings of the IEEE, Special Issue on Program Generation, Optimization, and Adaptation, vol. 93, no. 2, pp , Feb [3] I. Foster and C. Kesselman, Globus: A Metacomputing Infrastructure Toolkit, The International Journal of Supercomputer Applications and High Performance Computing, vol. 11, no. 2, pp , [4] N. Karonis, B. Toonen, and I. Foster, MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface, Journal of Parallel and Distributed Computing, vol. 63, no. 5, pp , May [5] V. Shah and J. R. Gilbert, Sparse Matrices in Matlab*P: Design and Implementation, in Proc. High Performance Computing (HiPC 04), ser. Lecture Notes in Computer Science, vol Springer, Dec. 2004, pp [6] A. George and J. W. H. Liu, The Evolution of the Minimum Degree Ordering Algorithm, SIAM Review, vol. 31, no. 1, pp. 1 19, Mar [7] M. Yannakakis, Computing the minimal fill-in is NP-complete, SIAM J. Alg. Disc. Meth., vol. 2, no. 1, pp , Mar [8] P. Berman and G. Schnitger, On the performance of the minimum degree ordering for Gaussian elimination, SIAM J. Matrix Anal. Appl., vol. 11, no. 1, pp , Jan [9] D. Arnold, S. Agrawal, S. Blackford, J. Dongarra, M. Miller, K. Seymour, K. Sagi, Z. Shi, and S. Vadhiyar, Users Guide to NetSolve V1.4.1, University of Tennessee, Knoxville, TN, Innovative Computing Dept. Technical Report ICL-UT-02-05, June [10] G. Allen, T. Dramlitsch, I. Foster, N. T. Karonis, M. Ripeanu, E. Seidel, and B. Toonen, Supporting efficient execution in heterogeneous distributed computing environments with Cactus and Globus, in Proc. Supercomputing (SC 01), 2001, pp [11] D. Abramson, J. Giddy, and L. Kotler, High performance parametric modeling with Nimrod/G: killer application for the global Grid? in Proc. International Parallel and Distributed Processing Symposium (IPDPS 00), May 2000, pp [12] H. Casanova, G. Obertelli, F. Berman, and R. Wolski, The AppLeS Parameter Sweep Template: User-level middleware for the grid, in Proc. Supercomputing (SC 00), Nov. 2000, pp [13] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon, L. Johnsson, K. Kennedy, C. Kesselman, J. Mellor-Crummey, D. Reed, L. Torczon, and R. Wolski, The GrADS Project: Software Support for High-Level Grid Application Development, The International Journal of High Performance Computing Applications, vol. 15, no. 4, pp , 2001.

7 [14] N. T. Karonis, B. R. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan, Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance, in Proc. International Parallel and Distributed Processing Symposium (IPDPS 00), May 2000, pp [15] R. Wolski, N. Spring, and J. Hayes, The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing, Journal of Future Generation Computing Systems, vol. 15, no. 5-6, pp , Oct 1999.

A Data-Aware Resource Broker for Data Grids

A Data-Aware Resource Broker for Data Grids A Data-Aware Resource Broker for Data Grids Huy Le, Paul Coddington, and Andrew L. Wendelborn School of Computer Science, University of Adelaide Adelaide, SA 5005, Australia {paulc,andrew}@cs.adelaide.edu.au

More information

Grid Application Development Software

Grid Application Development Software Grid Application Development Software Department of Computer Science University of Houston, Houston, Texas GrADS Vision Goals Approach Status http://www.hipersoft.cs.rice.edu/grads GrADS Team (PIs) Ken

More information

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP (extended abstract) Mitsuhisa Sato 1, Motonari Hirano 2, Yoshio Tanaka 2 and Satoshi Sekiguchi 2 1 Real World Computing Partnership,

More information

Experiments with Scheduling Using Simulated Annealing in a Grid Environment

Experiments with Scheduling Using Simulated Annealing in a Grid Environment Experiments with Scheduling Using Simulated Annealing in a Grid Environment Asim YarKhan Computer Science Department University of Tennessee yarkhan@cs.utk.edu Jack J. Dongarra Computer Science Department

More information

Compiler Technology for Problem Solving on Computational Grids

Compiler Technology for Problem Solving on Computational Grids Compiler Technology for Problem Solving on Computational Grids An Overview of Programming Support Research in the GrADS Project Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/gridcompilers.pdf

More information

Evaluating the Performance of Skeleton-Based High Level Parallel Programs

Evaluating the Performance of Skeleton-Based High Level Parallel Programs Evaluating the Performance of Skeleton-Based High Level Parallel Programs Anne Benoit, Murray Cole, Stephen Gilmore, and Jane Hillston School of Informatics, The University of Edinburgh, James Clerk Maxwell

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

High Performance Computing. Without a Degree in Computer Science

High Performance Computing. Without a Degree in Computer Science High Performance Computing Without a Degree in Computer Science Smalley s Top Ten 1. energy 2. water 3. food 4. environment 5. poverty 6. terrorism and war 7. disease 8. education 9. democracy 10. population

More information

IOS: A Middleware for Decentralized Distributed Computing

IOS: A Middleware for Decentralized Distributed Computing IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Optimizing Molecular Dynamics

Optimizing Molecular Dynamics Optimizing Molecular Dynamics This chapter discusses performance tuning of parallel and distributed molecular dynamics (MD) simulations, which involves both: (1) intranode optimization within each node

More information

Performance Analysis of Applying Replica Selection Technology for Data Grid Environments*

Performance Analysis of Applying Replica Selection Technology for Data Grid Environments* Performance Analysis of Applying Replica Selection Technology for Data Grid Environments* Chao-Tung Yang 1,, Chun-Hsiang Chen 1, Kuan-Ching Li 2, and Ching-Hsien Hsu 3 1 High-Performance Computing Laboratory,

More information

The AppLeS Parameter Sweep Template: User-level middleware for the Grid 1

The AppLeS Parameter Sweep Template: User-level middleware for the Grid 1 111 The AppLeS Parameter Sweep Template: User-level middleware for the Grid 1 Henri Casanova a, Graziano Obertelli a, Francine Berman a, and Richard Wolski b a Computer Science and Engineering Department,

More information

Biological Sequence Alignment On The Computational Grid Using The Grads Framework

Biological Sequence Alignment On The Computational Grid Using The Grads Framework Biological Sequence Alignment On The Computational Grid Using The Grads Framework Asim YarKhan (yarkhan@cs.utk.edu) Computer Science Department, University of Tennessee Jack J. Dongarra (dongarra@cs.utk.edu)

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Real-time grid computing for financial applications

Real-time grid computing for financial applications CNR-INFM Democritos and EGRID project E-mail: cozzini@democritos.it Riccardo di Meo, Ezio Corso EGRID project ICTP E-mail: {dimeo,ecorso}@egrid.it We describe the porting of a test case financial application

More information

Component Architectures

Component Architectures Component Architectures Rapid Prototyping in a Networked Environment Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/lacsicomponentssv01.pdf Participants Ruth Aydt Bradley Broom Zoran

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM Szabolcs Pota 1, Gergely Sipos 2, Zoltan Juhasz 1,3 and Peter Kacsuk 2 1 Department of Information Systems, University of Veszprem, Hungary 2 Laboratory

More information

DiPerF: automated DIstributed PERformance testing Framework

DiPerF: automated DIstributed PERformance testing Framework DiPerF: automated DIstributed PERformance testing Framework Ioan Raicu, Catalin Dumitrescu, Matei Ripeanu, Ian Foster Distributed Systems Laboratory Computer Science Department University of Chicago Introduction

More information

visperf: Monitoring Tool for Grid Computing

visperf: Monitoring Tool for Grid Computing visperf: Monitoring Tool for Grid Computing DongWoo Lee 1, Jack J. Dongarra 2, and R.S. Ramakrishna 1 1 Department of Information and Communication Kwangju Institute of Science and Technology, Republic

More information

Managing MPICH-G2 Jobs with WebCom-G

Managing MPICH-G2 Jobs with WebCom-G Managing MPICH-G2 Jobs with WebCom-G Padraig J. O Dowd, Adarsh Patil and John P. Morrison Computer Science Dept., University College Cork, Ireland {p.odowd, adarsh, j.morrison}@cs.ucc.ie Abstract This

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

Compilers and Run-Time Systems for High-Performance Computing

Compilers and Run-Time Systems for High-Performance Computing Compilers and Run-Time Systems for High-Performance Computing Blurring the Distinction between Compile-Time and Run-Time Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/compilerruntime.pdf

More information

Self Adaptivity in Grid Computing

Self Adaptivity in Grid Computing CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2004; 00:1 26 [Version: 2002/09/19 v2.02] Self Adaptivity in Grid Computing Sathish S. Vadhiyar 1, and Jack J.

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

The Grid: Feng Shui for the Terminally Rectilinear

The Grid: Feng Shui for the Terminally Rectilinear The Grid: Feng Shui for the Terminally Rectilinear Martha Stewart Introduction While the rapid evolution of The Internet continues to define a new medium for the sharing and management of information,

More information

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 8, NO. 6, DECEMBER 2000 747 A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks Yuhong Zhu, George N. Rouskas, Member,

More information

UNICORE Globus: Interoperability of Grid Infrastructures

UNICORE Globus: Interoperability of Grid Infrastructures UNICORE : Interoperability of Grid Infrastructures Michael Rambadt Philipp Wieder Central Institute for Applied Mathematics (ZAM) Research Centre Juelich D 52425 Juelich, Germany Phone: +49 2461 612057

More information

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,

More information

A Distributed Media Service System Based on Globus Data-Management Technologies1

A Distributed Media Service System Based on Globus Data-Management Technologies1 A Distributed Media Service System Based on Globus Data-Management Technologies1 Xiang Yu, Shoubao Yang, and Yu Hong Dept. of Computer Science, University of Science and Technology of China, Hefei 230026,

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

Knowledge Discovery Services and Tools on Grids

Knowledge Discovery Services and Tools on Grids Knowledge Discovery Services and Tools on Grids DOMENICO TALIA DEIS University of Calabria ITALY talia@deis.unical.it Symposium ISMIS 2003, Maebashi City, Japan, Oct. 29, 2003 OUTLINE Introduction Grid

More information

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach

A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach Shridhar Diwan, Dennis Gannon Department of Computer Science Indiana University Bloomington,

More information

Parallel Auction Algorithm for Linear Assignment Problem

Parallel Auction Algorithm for Linear Assignment Problem Parallel Auction Algorithm for Linear Assignment Problem Xin Jin 1 Introduction The (linear) assignment problem is one of classic combinatorial optimization problems, first appearing in the studies on

More information

Functional Requirements for Grid Oriented Optical Networks

Functional Requirements for Grid Oriented Optical Networks Functional Requirements for Grid Oriented Optical s Luca Valcarenghi Internal Workshop 4 on Photonic s and Technologies Scuola Superiore Sant Anna Pisa June 3-4, 2003 1 Motivations Grid networking connection

More information

THE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE

THE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE THE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE Wei Li, Zhiwei Xu, Bingchen Li, Yili Gong Institute of Computing Technology of Chinese Academy of Sciences Beijing China, 100080 {zxu, liwei, libingchen,

More information

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin A Decoupled Scheduling Approach for the GrADS Program Development Environment DCSL Ahmed Amin Outline Introduction Related Work Scheduling Architecture Scheduling Algorithm Testbench Results Conclusions

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

Building Performance Topologies for Computational Grids UCSB Technical Report

Building Performance Topologies for Computational Grids UCSB Technical Report Building Performance Topologies for Computational Grids UCSB Technical Report 2002-11 Martin Swany and Rich Wolski Department of Computer Science University of California Santa Barbara, CA 93106 {swany,rich}@cs..edu

More information

HETEROGENEOUS COMPUTING

HETEROGENEOUS COMPUTING HETEROGENEOUS COMPUTING Shoukat Ali, Tracy D. Braun, Howard Jay Siegel, and Anthony A. Maciejewski School of Electrical and Computer Engineering, Purdue University Heterogeneous computing is a set of techniques

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

Solving the N-Body Problem with the ALiCE Grid System

Solving the N-Body Problem with the ALiCE Grid System Solving the N-Body Problem with the ALiCE Grid System Dac Phuong Ho 1, Yong Meng Teo 2 and Johan Prawira Gozali 2 1 Department of Computer Network, Vietnam National University of Hanoi 144 Xuan Thuy Street,

More information

Scheduling Large Parametric Modelling Experiments on a Distributed Meta-computer

Scheduling Large Parametric Modelling Experiments on a Distributed Meta-computer Scheduling Large Parametric Modelling Experiments on a Distributed Meta-computer David Abramson and Jon Giddy Department of Digital Systems, CRC for Distributed Systems Technology Monash University, Gehrmann

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication

Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication Micah Beck Dorian Arnold Alessandro Bassi Fran Berman Henri Casanova Jack Dongarra Terence Moore Graziano Obertelli

More information

Usage of LDAP in Globus

Usage of LDAP in Globus Usage of LDAP in Globus Gregor von Laszewski and Ian Foster Mathematics and Computer Science Division Argonne National Laboratory, Argonne, IL 60439 gregor@mcs.anl.gov Abstract: This short note describes

More information

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project Introduction to GT3 The Globus Project Argonne National Laboratory USC Information Sciences Institute Copyright (C) 2003 University of Chicago and The University of Southern California. All Rights Reserved.

More information

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme Yue Zhang 1 and Yunxia Pei 2 1 Department of Math and Computer Science Center of Network, Henan Police College, Zhengzhou,

More information

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT The Monte Carlo Method: Versatility Unbounded in a Dynamic Computing World Chattanooga, Tennessee, April 17-21, 2005, on CD-ROM, American Nuclear Society, LaGrange Park, IL (2005) MONTE CARLO SIMULATION

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Generic Topology Mapping Strategies for Large-scale Parallel Architectures Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Multiprocessors 2007/2008

Multiprocessors 2007/2008 Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several

More information

Tutorial: Application MPI Task Placement

Tutorial: Application MPI Task Placement Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How

More information

Sparse matrices, graphs, and tree elimination

Sparse matrices, graphs, and tree elimination Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and Jed Brown Graph Laplacian Matrices Covered by other speakers (hopefully) Useful in a variety of areas Graphs are getting very big Facebook

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

CAS 703 Software Design

CAS 703 Software Design Dr. Ridha Khedri Department of Computing and Software, McMaster University Canada L8S 4L7, Hamilton, Ontario Acknowledgments: Material based on Software by Tao et al. (Chapters 9 and 10) (SOA) 1 Interaction

More information

Online Optimization of VM Deployment in IaaS Cloud

Online Optimization of VM Deployment in IaaS Cloud Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date:

Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date: Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date: 8-17-5 Table of Contents Table of Contents...1 Table of Figures...1 1 Overview...4 2 Experiment Description...4

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Latency Hiding by Redundant Processing: A Technique for Grid enabled, Iterative, Synchronous Parallel Programs

Latency Hiding by Redundant Processing: A Technique for Grid enabled, Iterative, Synchronous Parallel Programs Latency Hiding by Redundant Processing: A Technique for Grid enabled, Iterative, Synchronous Parallel Programs Jeremy F. Villalobos University of North Carolina at Charlote 921 University City Blvd Charlotte,

More information

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics N. Melab, T-V. Luong, K. Boufaras and E-G. Talbi Dolphin Project INRIA Lille Nord Europe - LIFL/CNRS UMR 8022 - Université

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

PARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM

PARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM PARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM Scott Wu Montgomery Blair High School Silver Spring, Maryland Paul Kienzle Center for Neutron Research, National Institute of Standards and Technology

More information

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL Fengguang Song, Jack Dongarra, and Shirley Moore Computer Science Department University of Tennessee Knoxville, Tennessee 37996, USA email:

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Highly Latency Tolerant Gaussian Elimination

Highly Latency Tolerant Gaussian Elimination Highly Latency Tolerant Gaussian Elimination Toshio Endo University of Tokyo endo@logos.ic.i.u-tokyo.ac.jp Kenjiro Taura University of Tokyo/PRESTO, JST tau@logos.ic.i.u-tokyo.ac.jp Abstract Large latencies

More information

Improving the Dynamic Creation of Processes in MPI-2

Improving the Dynamic Creation of Processes in MPI-2 Improving the Dynamic Creation of Processes in MPI-2 Márcia C. Cera, Guilherme P. Pezzi, Elton N. Mathias, Nicolas Maillard, and Philippe O. A. Navaux Universidade Federal do Rio Grande do Sul, Instituto

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme Yue Zhang, Yunxia Pei To cite this version: Yue Zhang, Yunxia Pei. A Resource Discovery Algorithm in Mobile Grid Computing

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

GRB. Grid-JQA : Grid Java based Quality of service management by Active database. L. Mohammad Khanli M. Analoui. Abstract.

GRB. Grid-JQA : Grid Java based Quality of service management by Active database. L. Mohammad Khanli M. Analoui. Abstract. Grid-JQA : Grid Java based Quality of service management by Active database L. Mohammad Khanli M. Analoui Ph.D. student C.E. Dept. IUST Tehran, Iran Khanli@iust.ac.ir Assistant professor C.E. Dept. IUST

More information

C-Meter: A Framework for Performance Analysis of Computing Clouds

C-Meter: A Framework for Performance Analysis of Computing Clouds 9th IEEE/ACM International Symposium on Cluster Computing and the Grid C-Meter: A Framework for Performance Analysis of Computing Clouds Nezih Yigitbasi, Alexandru Iosup, and Dick Epema Delft University

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Clustering and Reclustering HEP Data in Object Databases

Clustering and Reclustering HEP Data in Object Databases Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,

More information