GRID*p: Interactive Data-Parallel Programming on the Grid with MATLAB
|
|
- Griselda George
- 5 years ago
- Views:
Transcription
1 GRID*p: Interactive Data-Parallel Programming on the Grid with MATLAB Imran Patel and John R. Gilbert Department of Computer Science University of California, Santa Barbara {imran, Abstract The Computational Grid has emerged as an attractive platform for developing large-scale distributed applications that run on heterogeneous computing resources. This scalability, however, comes at the cost of increased complexity: each application has to handle the details of resource provisioning and data and task scheduling. To address this problem, we present GRID*p - an interactive parallel system built on top of MATLAB*p and Globus that provides a MATLAB-like problem solving environment for the Grid. Applications under GRID*p achieve automatic data parallelism through the transparent use of distributed data objects, while the GRID*p runtime takes care of data partitioning, task scheduling and inter-node messaging. We evaluate the simplicity and performance of GRID*p with two different types of parallel applications consisting of matrix and graph computations - one of which investigates an open problem in combinatorial scientific computing. Our results show that GRID*p delivers promising performance for highly parallel applications, at the same time greatly simplifying their development. I. INTRODUCTION Recent advances in high performance computing have led to the availability of a large number of inexpensive processing and storage resources. This trend coupled with the availability of high-speed networks has led to the concept of the Computational Grid [1]. A Computational Grid enables the development of large-scale applications on top of looselycoupled heterogeneous computing resources using a collection of middleware services. Grid applications utilize these services to access a diverse set of resources thereby achieving high scalability and performance. But this scalability comes at the price of increased complexity in the application. In order to achieve high performance, each application has to handle complex issues such as scheduling tasks and data on its set of acquired resources. Current research in Grid computing has mostly focussed on middleware for locating, accessing and monitoring resources. A few components exist, which broker resource provisioning and contracts based on scheduling constraints of the application. But, an application programmer has to still deal with a complex abstraction layer even for a relatively simple parallel computation. As a result of the lack of a higher-level programming model and an associated runtime environment, Grid applications remain hard to develop. To address this problem, we present GRID*p an interactive parallel system built on top of MATLAB*p [2] and Globus [3] that provides a MATLAB-like problem solving environment for the Grid. Applications under GRID*p achieve automatic data parallelism while the GRID*p runtime takes care of data partitioning, task scheduling and inter-node messaging. GRID*p relies on the abstract distributed object model of MATLAB*p to provide highly intuitive and almost transparent parallel constructs to the programmer. This paper makes two key contributions. First, it presents an extremely powerful and simple-to-use interactive parallel programming environment that enables the development of implicitly data parallel applications. Users interact with GRID*p by using distributed matrices under the standard MATLAB client. These distributed objects are syntactically almost identical to the standard MATLAB matrices. However, regular MATLAB operations on these distributed matrices are automatically parallelized by GRID*p. As a result, even a less sophisticated user/programmer can achieve parallelism from the familiar interactive MATLAB shell with minimal effort. The underlying runtime achieves high performance while hiding the details of data layout and messaging even on a complex network topology. The second contribution is the demonstration of the feasibility of the system by experimentally evaluating two different flavors of parallel applications. In particular, we present a highly task-parallel stochastic simulation that has yielded improved results for an open problem in the field of combinatorial scientific computing. The rest of this paper is organized as follows: In the following section we provide background on the Globus Toolkit, which is currently the de facto software standard to develop Grid applications. We also discuss the associated MPICH-G2 [4] toolkit, which is an implementation of the MPI standard on top of Globus. This is followed by a brief discussion of the cluster-based MATLAB*p system. We note some implementation issues with the current prototype of GRID*p in Section 3. Section 4 describes two parallel applications and the results of their experimental evaluation under GRID*p. Sections 5 and 6 discuss related and future work respectively. We conclude in Section 7. A. Globus and MPICH-G2 II. BACKGROUND The Globus Toolkit is a set of middleware services that export standardized interfaces to uniformly and securely access a diverse pool of computing resources. Specifically, these services provide facilities such as process management for
2 remote job execution, directory services for resource discovery and high performance data management for data transfer and staging. They are implemented in the form of UNIX daemon processes and as web services. For our purposes, the most important components are the Globus Resource Allocation and Management subsystem (GRAM) and the Globus Security Infrastructure (GSI). The GRAM module is responsible for remote job execution and management in Globus. It also manages the standard input, output and error streams associated with the job. GSI provides authentication and authorization of users and their jobs using facilities such as single sign-on and delegation. It employs the X.509 PKI standard to implement per-host and per-user security certificates for identification. MPICH-G2 is an implementation of the MPI standard under Globus. It uses GRAM for launching remote jobs and messaging. It also provides unique features such as parallel TCP streams for high throughput and topology-enabled collective communications to counter high network latencies. B. MATLAB*p MATLAB*p is a MATLAB extension that provides parallel programming facilities for the MATLAB language. The MATLAB*p extension connects the MATLAB client with an MPI-based parallel backend. MATLAB*p adds only a very minimal set of constructs to the MATLAB language, allowing users to quickly parallelize their existing sequential programs with relative ease. Parallel constructs in MATLAB*p are designed around the concept of a distributed object. All parallel operations in MATLAB*p primarily operate on distributed objects. For example, dense matrices are implemented as distributed objects (called ddense) with row, column or block-cyclic distribution of elements. Most of the operations on dense matrices are implemented using highly-optimized parallel numerical libraries like SCALAPACK (also called packages) at the backend. Similarly, a sparse matrix object (named dsparse) uses a rowdistributed layout where each row is stored in the compressed row format [5]. This design leads to syntax that is simple yet very powerful and expressive. In most cases, a user just needs to specify a matrix to be of distributed type. Subsequent operations on the specified matrix, if provided by a package, achieve automatic data parallelism. In cases where the user desires embarrassingly parallel SPMD behavior using a MATLAB function, MATLAB*p provides a multi-matlab mode (mm mode). Under this mode, MATLAB*p simply invokes a MAT- LAB shell on each node executing the specified function. If one of the arguments to the function is a distributed object, MATLAB*p converts the local chunk of the object at each node to a regular MATLAB matrix and passes it to the function. If required, results from the mm mode call are returned as distributed objects. III. GRID*p: IMPLEMENTATION We initially envisioned GRID*p as a straightforward port of MATLAB*p to the Grid based on the MPICH-G2 MPI library. However, network layer complexities arising from firewalling and Network Address Translation (NAT) on the clusters made our prototype implementation much more complex. We discuss these issues later in this section. MATLAB*p Globus ScaLAPACK BLACS MPICH-G2 Fig. 1. MATLAB Matrix Mgr GSI GRID*p architecture Package Mgr Other Packages GRAM Leaving network layer issues aside, the high-level architecture of GRID*p is schematically represented in Fig 1. As can be seen, GRID*p is essentially a MATLAB*p extension that uses MPICH-G2 and Globus as a substrate. The current prototype of GRID*p supports the basic operations on ddense objects and a minimal subset of numerical linear algebra routines using the SCALAPACK parallel numerical linear algebra package and its associated BLACS communication library. We plan to integrate more packages into GRID*p in the near future. As mentioned earlier, an important feature of GRID*p is its ability to handle network communications in presence of firewalls and network address translation (NAT). This feature is incorporated in GRID*p more out of necessity than design. Most of the commodity clusters deployed today allow public IP-layer access to only one machine (called the gateway or head node). The rest of the nodes in the cluster are assigned private IP addresses and are not visible from outside the network. Unfortunately, Globus only supports machines with a public IP address and DNS name. This limitation is due to the design of the X.509 PKI based Grid security module (GSI). Each node in Globus is identified by an X.509 public key certificate that is tied to its public DNS name. This ensures that a node can be authenticated by comparing the DNS name in its certificate with the result of a reverse DNS lookup on its IP address. For example, when a node requests remote job execution on another node using GRAM, both of them perform the above mentioned check. However, this mechanism prevents internal nodes with private IPs in a cluster from offering or receiving services in Globus.
3 Port Fwd Table 1000:2000 Node-A 2000:3000 Node-B <G1:y,B:2119> Node-B Fig. 2. Node-C <C:x,G1:1001> MIT (G2) <G2:y,G1:1001> UCSB (G1) <A:x,G1:2001> GRID*p Messaging <G2:y,A:2119> Node-A Our workaround for this restriction consists of two techniques. First, we copy the gateway node s certificate (associated with the cluster s sole public DNS name) onto each internal node. Second, Globus services on an internal node are advertised as running on specific TCP ports on the gateway node. Service requests arriving at a particular port number at the gateway node are then relayed onto the designated internal node using port forwarding rules in the firewall. For each request, a private node presents its gateway node s X.509 certificate to the remote node and since the latter sees the gateway node as the endpoint at the IP layer, the reverse DNS check succeeds. To better explain the complex message passing, we present an example in Fig. 2. The setup roughly resembles our Grid testbed consisting of two clusters configured with NAT and firewalling enabled. It is assumed that outbound connections originating from within the cluster are allowed (which is usually the case in practice). Arrows represent network flows and the attached labels mark their network endpoint addresses in the form < srcip : port, destip : port >. Now, consider the case when a job is launched across clusters (the dotted flow). To launch a job on node A, node C connects to the head node G1 on a specific port, which is then forwarded to the Globus GRAM component (Gatekeeper) on node A. Node A presents G1 s certificate to node C and since the latter sees G1 as the endpoint of the connection, its reverse DNS check succeeds. Quite surprisingly, the traffic flow is more complex when a job is launched within the same cluster (designated by solid arrows). To see why, consider the case where A tries to launch a job on node B directly. Node A will present itself as G1 using G1 s node certificate. But node B s reverse DNS query on node A s IP address (which is private) will not match G1 s DNS name and the job will fail. Thus, to launch a job on node B, node A has to use port-forwarding through G1. Additionally, G1 alters the source address of the flow from node A, so that node B sees the connection as coming from G1. The DNS reverse check by B on the incoming connection then proceeds successfully. One of the drawbacks of our solution is that it requires addition of NAT/firewall rules on each gateway node. Moreover, the use of relays can adversely affect communication latencies. This problem is all the more acute when an intra-cluster connection is routed through the gateway node. However, we believe that this solution, although inelegant, is very effective in integrating widely deployed firewalled clusters into a Globus based Grid. We also note that this is the only solution known to us that allows Globus to be used in such a setting. IV. EXPERIMENTS AND RESULTS Our experiments were conducted on a computational Grid testbed consisting of two computer clusters: a 4-node cluster with Intel Xeon 2.0 GHz PCs at UCSB and a 16-node Intel Xeon 2.4 GHz PC cluster located at MIT. The UCSB nodes are equipped with 1 GB of RAM and the MIT nodes have 2 GB of RAM each. All the nodes run the Linux operating system. Our applications had almost exclusive access to the clusters during the experimental runs. Each cluster is configured with iptables firewall rules to use NAT with only one publically visible DNS name assigned at the gateway node. The wide area network (WAN) that connects the clusters at UCSB and MIT is part of the Abilene internet2 backbone. We observed ping latencies of about ms across the WAN during our experiments. A. Minimum degree ordering In order to verify the scalability of GRID*p in absence of network-layer latencies, we decided to experiment with highly task-parallel applications with negligible inter-node messaging. As a representative of this class of applications, we implemented a simple Monte Carlo-like simulation under GRID*p to investigate an open problem in combinatorial scientific computing. More specifically, we experimentally evaluated the performance of minimum degree ordering [6] on a regular grid graph during the symbolic factorization phase of a sparse matrix. Since finding an ordering that minimizes fill for arbitrary graphs is computationally intractable [7], various heuristics and algorithms for specialized graphs are used. Minimum degree ordering is one such heuristic that eliminates a vertex with the minimum degree at each step. For the regular grid (and the torus), minimum degree ordering with an ideal tie-breaking scheme has an optimal fill of O(nlogn). However, Berman and Schnitger [8] devised a tie-breaking strategy that yields θ(n log 3 4 ) fill for a k k torus graph with n = k 2 vertices. But, determining the theoretical worst case and average case
4 1800 Grid (12 nodes) Cluster (3 nodes) TABLE I an α + b (n = 10 4 : 10 4 : ) 1600 Time (sec) No. of trials a α R 2 RMSE ± ± ± ± ± ± TABLE II an log n α (n = 10 4 : 10 4 : ) Number of Trials Fig. 3. Performance of min-degree ordering on a 2.5mil 2.5mil matrix No. of trials a α R 2 RMSE ± ± ± ± ± ± bound of minimum degree ordering for the regular grid (and the torus) still remains an interesting open problem. We empirically evaluated the average case performance of minimum degree ordering for sparse matrices corresponding to the 5-point model problem on a k k torus. Each trial of our stochastic simulation consists of computing the minimum degree ordering of an initial random permutation of the k 2 k 2 matrix. Since we use MATLAB s symmetric approximate minimum degree ordering (symamd), the initial permutation ensures randomized tie-breaking. Moreover, each instance of the symbolic factorization on a random permutation of the matrix is independent and hence can be conducted in parallel. A simple one-line mm call under GRID*p allows us to execute the trials in parallel: f = mean(mm( mdstats, k, ntrials/np, seeds)); Here, np is a built-in variable whose value is bound to the number of processors in the system and seeds is a distributed vector that holds seed values to initialize the random number generator on the nodes. The serial function mdstats performs ntrials/np symbolic factorizations in parallel on each node and returns the average fill from all nodes in a ddense vector of size np. Finally, the average fill is computed by the mean operation over the distributed vector. Each iteration is independent of the others and hence the problem is perfectly parallel. The only communication occurs in the final reduction stage when all the samples are aggregated and their mean is computed. The matrices are generated randomly and hence we do not have to deal with distribution of the input data. To evaluate the performance of our application, we ran multiple simulations with varying number of trials for a fixsized sparse matrix. We determined the size of the matrix by calculating the largest size matrix that could fit in the physical memory of a single node. We ran the experiments on a 12-node Grid testbed and a 3-node cluster. Fig 3 plots the comparative running time of the simulations on the cluster and the Grid against the number of trials for a 2.5mil 2.5mil matrix. The results prove that the highly task-parallel nature of the application is instrumental in achieving an almost linear speedup: the Grid setting has four times the nodes compared to the cluster and it performs nearly four times faster. The number of results transferred over the network during the final reduction is ksteps ntrials and hence the communication is low. To further characterize the fill as a function of the size of the input matrix, we ran simulations on matrices varying in size from n = 10 4 to n = The generated data points were then used to determine the best functional forms using non-linear least-squares regression techniques. Tables I and II list the parameters and statistics of the two best functional fits: an α +b and an log n α respectively. The values of α and a are listed with 95% confidence bounds along with goodness of fit statistics. R 2 values close to one indicate that the fitted curve explains almost all of the variation in the data. Each data point computed by a simulation is a mean over all the trials and thus increasing the number of trials diminishes the effects of outliers. As a result, we observe that the quality of the curves improves for simulations with larger number of trials. Thus, based on the data, we can formulate the asymptotic average case bound as n 1.15 or n log n We make two major observations from our results. First, we observe that, in practice, the average case fill generated by minimum degree ordering is worse than the O(n log n) optimal bound. Second, the average case fill bound is actually better than the O(n 1.26 ) fill induced by the Berman-Schnitger tiebreaking construction. The experiments conducted by Berman and Schnitger were quite limited and inconclusive due to the choice of very small matrices and trial numbers and the lack of a proper statistical methodology. Our parallel application allows us to experiment with a wider range of matrix sizes and simulation trials by respectively utilizing the entire memory capacity and aggregate CPU cycles on the Grid.
5 B. 2-Dimensional Fast Fourier Transform (2d FFT) Our next application computes the 2-Dimensional FFT on a dense square matrix. Calculating the 2d FFT in GRID*p involves writing four lines of simple MATLAB-like code (listed below). A = randn(n, n*p); B = mm( fft, A); C = B. ; D = mm( fft, C); F = D. ; The first line creates a matrix with n x n dimensions that is distributed column-wise (indicated by appending *p to the column dimension) on the nodes. The second and fourth lines invoke the mm mode to spawn a MATLAB instance on each node, which computes the FFT on its columnar block. The transpose operations are handled by the SCALAPACK library. Each transpose involves heavy communication between all the nodes in the system since O(n 2 ) elements of the matrix are exchanged. The messaging in this stage can be characterized as having lock-step synchronization. The matrix used in this example is generated randomly in a distributed fashion and hence there are no data input and staging issues. We first ran the application on our Grid testbed using 8 nodes split equally between UCSB and MIT. For this experiment, the MIT nodes came from a 8-node AMD Athlon 1.7 GHz PC cluster. The rest of the experimental setup was similar to the one used in the minimum degree experiments. Fig 4 plots the total execution time against the size of the input matrix. The plot also shows the decomposition of running time between the transpose and the actual FFT operations. Time (t) D FFT on an 8-node (4 MIT + 4 UCSB) Grid Transpose FFT Matrix Size (x 1000) Fig. 4. Performance of 2d FFT problem on a Grid testbed The results clearly show that even though GRID*p achieves scalability, the execution times are very slow. More specifically, the running time is dominated by the transpose operations, which involve tightly-coupled communication patterns. High latencies on the WAN link cause the transpose operations to dominate the overall running time thereby negating the speedup achieved from using extra CPU resources. These results point out that tightly-coupled (or network bound) applications scale very poorly on the Grid even if they scale in terms of CPU and memory usage. Time (sec) Fig Grid (8 nodes) Cluster (4 nodes) Matrix Size (x 1000) Comparative Performance of 2d FFT on a cluster and a Grid testbed For comparison, we also ran the same experiments on the 4- node cluster at UCSB. Results from these runs are compared with the results on the Grid in Fig 5. Since a local cluster computation doesn t suffer from high network latencies, it not only achieves good scalability but also much faster execution times. However, since the aggregate memory available on the Grid is significantly larger, GRID*p scales better for the memory-bound 2d FFT application. This can be seen for the case of the size dense matrix in the plot - there are no results on the cluster because such a large sized matrix cannot fit in the available aggregate memory. We discuss some possible approaches to improve the performance of networkbound applications under GRID*p in the future work section. V. RELATED WORK GRID*p differs from the relatively few other programming environments on the Grid in its use of a high-level programming language and almost transparent parallelization of applications. GRID*p is also unique in that it is allows users to view their parallel computations as operations on distributed data objects, which reside in persistent memory across the grid. Thus, it markedly differs from a generalpurpose problem solving environment like NetSolve [9]. Net- Solve users launch remote computations brokered by an agent from various scientific computing environments (SCE) like MATLAB, Mathematica, etc. Even though the NetSolve agent hides the details of resource acquisition, its RPC-like model does not maintain the inputs and results of computations as distributed data handles, thereby limiting its scalability. The Cactus framework [10] is widely used in the astrophysics community to develop parallel applications (known as thorns ) on top of other Grid services. Cactus thorns are highly domain-specific and their development requires the knowledge of a low-level programming language like C or Fortran. Both the Nimrod-G [11] and AppLeS Parameter
6 Sweep Template (APST) [12] projects target the specific class of parametric simulations. These frameworks enable the optimal execution of multiple instances of the same simulation, which independently sweep through a different part of the parameter space. Although very domain specific, this class of applications is very similar to our example of the highly task-parallel minimum degree ordering. For the sake of completeness, we would also like to mention the Grid Application Software (GrADS) [13] project. GrADS is perhaps the most complete framework for developing Grid applications integrating various techniques such as static and dynamic optimization, grid-enabled libraries, application scheduling, contract based monitoring, etc. However, the focus of the GrADS system is on the construction of a high performance application runtime rather than high-level parallel programming models. VI. FUTURE WORK Based on the results of our experiments, we would like to further refine and extend the design of GRID*p in several areas: Although the results for highly task-parallel applications are promising, we can further optimize the performance by making GRID*p more adaptive to runtime variability of resource quality. More specifically, since the concept of implicit data parallelism is integral to GRID*p, it is imperative that data partitioning reflects the heterogeneity of the underlying resources. We would like to investigate the use of irregular data distributions to map data (and thereby computation) based on performance forecasts from a resource monitoring system like the Network Weather Service (NWS) [15]. One of the main result of our tests was that parallel applications with very fine-grained and synchronized messaging schedules perform poorly on the Grid owing to the high packet delays on WANs. Their performance can be greatly improved by coordination of application-specific parallel network locality aware algorithms with the runtime scheduler. However, we think that high-level topology-aware messaging primitives as described in [14] provide a more general solution for mitigating the effects of WAN latencies. Our immediate future goal is to incorporate these primitives (which are provided by MPICH-G2) in our experiments and verify the improvements. Another future goal is to extend of our current mm mode implementation to facilitate higher task parallelism. The current syntax of mm mode restricts users to a SPMD-like construct i.e. it allows users to supply their own data-parallel operations. However, asynchronous task-parallelism will allow GRID*p to target a wide range of loosely-coupled parallel applications on the Grid. We are currently looking at nested data-parallelism as a basis for our new design. VII. CONCLUSION In this paper, we presented GRID*p - a parallel interactive MATLAB-like programming and runtime environment built on top of MATLAB*p and the Globus Toolkit. GRID*p provides implicit data parallelism and an extremely simple primitive for achieving task parallelism, at the same time encapsulating the underlying messaging and network complexity from the user. To demonstrate the simplicity and performance of GRID*p we experimented with two different class of applications: a highly task-parallel stochastic simulation and a communication-intensive 2d FFT computation. Using the former, we characterized the average fill induced by minimum degree ordering since determining its average (and worse) case asymptotic complexity remains a long-standing open problem. In terms of performance, our results show that GRID*p achieves almost linear scalability for the highly task-parallel application with minimal effort. On the other hand, the 2d FFT application exhibits poor performance as it is bounded by network delays. In other words, a very high computation to communication ratio is required for an application to overcome the high latencies imposed by WAN links on the Grid. To conclude, our experience with GRID*p demonstrates the feasibility of Grid as a very promising platform for building loosely-coupled task-parallel applications. REFERENCES [1] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure, 2nd ed., ser. Elsevier Series in Grid Computing. Morgan Kaufmann, [2] R. Choy and A. Edelman, Parallel MATLAB: Doing it Right, Proceedings of the IEEE, Special Issue on Program Generation, Optimization, and Adaptation, vol. 93, no. 2, pp , Feb [3] I. Foster and C. Kesselman, Globus: A Metacomputing Infrastructure Toolkit, The International Journal of Supercomputer Applications and High Performance Computing, vol. 11, no. 2, pp , [4] N. Karonis, B. Toonen, and I. Foster, MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface, Journal of Parallel and Distributed Computing, vol. 63, no. 5, pp , May [5] V. Shah and J. R. Gilbert, Sparse Matrices in Matlab*P: Design and Implementation, in Proc. High Performance Computing (HiPC 04), ser. Lecture Notes in Computer Science, vol Springer, Dec. 2004, pp [6] A. George and J. W. H. Liu, The Evolution of the Minimum Degree Ordering Algorithm, SIAM Review, vol. 31, no. 1, pp. 1 19, Mar [7] M. Yannakakis, Computing the minimal fill-in is NP-complete, SIAM J. Alg. Disc. Meth., vol. 2, no. 1, pp , Mar [8] P. Berman and G. Schnitger, On the performance of the minimum degree ordering for Gaussian elimination, SIAM J. Matrix Anal. Appl., vol. 11, no. 1, pp , Jan [9] D. Arnold, S. Agrawal, S. Blackford, J. Dongarra, M. Miller, K. Seymour, K. Sagi, Z. Shi, and S. Vadhiyar, Users Guide to NetSolve V1.4.1, University of Tennessee, Knoxville, TN, Innovative Computing Dept. Technical Report ICL-UT-02-05, June [10] G. Allen, T. Dramlitsch, I. Foster, N. T. Karonis, M. Ripeanu, E. Seidel, and B. Toonen, Supporting efficient execution in heterogeneous distributed computing environments with Cactus and Globus, in Proc. Supercomputing (SC 01), 2001, pp [11] D. Abramson, J. Giddy, and L. Kotler, High performance parametric modeling with Nimrod/G: killer application for the global Grid? in Proc. International Parallel and Distributed Processing Symposium (IPDPS 00), May 2000, pp [12] H. Casanova, G. Obertelli, F. Berman, and R. Wolski, The AppLeS Parameter Sweep Template: User-level middleware for the grid, in Proc. Supercomputing (SC 00), Nov. 2000, pp [13] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon, L. Johnsson, K. Kennedy, C. Kesselman, J. Mellor-Crummey, D. Reed, L. Torczon, and R. Wolski, The GrADS Project: Software Support for High-Level Grid Application Development, The International Journal of High Performance Computing Applications, vol. 15, no. 4, pp , 2001.
7 [14] N. T. Karonis, B. R. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan, Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance, in Proc. International Parallel and Distributed Processing Symposium (IPDPS 00), May 2000, pp [15] R. Wolski, N. Spring, and J. Hayes, The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing, Journal of Future Generation Computing Systems, vol. 15, no. 5-6, pp , Oct 1999.
A Data-Aware Resource Broker for Data Grids
A Data-Aware Resource Broker for Data Grids Huy Le, Paul Coddington, and Andrew L. Wendelborn School of Computer Science, University of Adelaide Adelaide, SA 5005, Australia {paulc,andrew}@cs.adelaide.edu.au
More informationGrid Application Development Software
Grid Application Development Software Department of Computer Science University of Houston, Houston, Texas GrADS Vision Goals Approach Status http://www.hipersoft.cs.rice.edu/grads GrADS Team (PIs) Ken
More informationOmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP
OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP (extended abstract) Mitsuhisa Sato 1, Motonari Hirano 2, Yoshio Tanaka 2 and Satoshi Sekiguchi 2 1 Real World Computing Partnership,
More informationExperiments with Scheduling Using Simulated Annealing in a Grid Environment
Experiments with Scheduling Using Simulated Annealing in a Grid Environment Asim YarKhan Computer Science Department University of Tennessee yarkhan@cs.utk.edu Jack J. Dongarra Computer Science Department
More informationCompiler Technology for Problem Solving on Computational Grids
Compiler Technology for Problem Solving on Computational Grids An Overview of Programming Support Research in the GrADS Project Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/gridcompilers.pdf
More informationEvaluating the Performance of Skeleton-Based High Level Parallel Programs
Evaluating the Performance of Skeleton-Based High Level Parallel Programs Anne Benoit, Murray Cole, Stephen Gilmore, and Jane Hillston School of Informatics, The University of Edinburgh, James Clerk Maxwell
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationHigh Performance Computing. Without a Degree in Computer Science
High Performance Computing Without a Degree in Computer Science Smalley s Top Ten 1. energy 2. water 3. food 4. environment 5. poverty 6. terrorism and war 7. disease 8. education 9. democracy 10. population
More informationIOS: A Middleware for Decentralized Distributed Computing
IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationOptimizing Molecular Dynamics
Optimizing Molecular Dynamics This chapter discusses performance tuning of parallel and distributed molecular dynamics (MD) simulations, which involves both: (1) intranode optimization within each node
More informationPerformance Analysis of Applying Replica Selection Technology for Data Grid Environments*
Performance Analysis of Applying Replica Selection Technology for Data Grid Environments* Chao-Tung Yang 1,, Chun-Hsiang Chen 1, Kuan-Ching Li 2, and Ching-Hsien Hsu 3 1 High-Performance Computing Laboratory,
More informationThe AppLeS Parameter Sweep Template: User-level middleware for the Grid 1
111 The AppLeS Parameter Sweep Template: User-level middleware for the Grid 1 Henri Casanova a, Graziano Obertelli a, Francine Berman a, and Richard Wolski b a Computer Science and Engineering Department,
More informationBiological Sequence Alignment On The Computational Grid Using The Grads Framework
Biological Sequence Alignment On The Computational Grid Using The Grads Framework Asim YarKhan (yarkhan@cs.utk.edu) Computer Science Department, University of Tennessee Jack J. Dongarra (dongarra@cs.utk.edu)
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationReal-time grid computing for financial applications
CNR-INFM Democritos and EGRID project E-mail: cozzini@democritos.it Riccardo di Meo, Ezio Corso EGRID project ICTP E-mail: {dimeo,ecorso}@egrid.it We describe the porting of a test case financial application
More informationComponent Architectures
Component Architectures Rapid Prototyping in a Networked Environment Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/lacsicomponentssv01.pdf Participants Ruth Aydt Bradley Broom Zoran
More informationAssignment 5. Georgia Koloniari
Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last
More informationPARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM
PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM Szabolcs Pota 1, Gergely Sipos 2, Zoltan Juhasz 1,3 and Peter Kacsuk 2 1 Department of Information Systems, University of Veszprem, Hungary 2 Laboratory
More informationDiPerF: automated DIstributed PERformance testing Framework
DiPerF: automated DIstributed PERformance testing Framework Ioan Raicu, Catalin Dumitrescu, Matei Ripeanu, Ian Foster Distributed Systems Laboratory Computer Science Department University of Chicago Introduction
More informationvisperf: Monitoring Tool for Grid Computing
visperf: Monitoring Tool for Grid Computing DongWoo Lee 1, Jack J. Dongarra 2, and R.S. Ramakrishna 1 1 Department of Information and Communication Kwangju Institute of Science and Technology, Republic
More informationManaging MPICH-G2 Jobs with WebCom-G
Managing MPICH-G2 Jobs with WebCom-G Padraig J. O Dowd, Adarsh Patil and John P. Morrison Computer Science Dept., University College Cork, Ireland {p.odowd, adarsh, j.morrison}@cs.ucc.ie Abstract This
More informationQUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose
QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,
More informationCompilers and Run-Time Systems for High-Performance Computing
Compilers and Run-Time Systems for High-Performance Computing Blurring the Distinction between Compile-Time and Run-Time Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/compilerruntime.pdf
More informationSelf Adaptivity in Grid Computing
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2004; 00:1 26 [Version: 2002/09/19 v2.02] Self Adaptivity in Grid Computing Sathish S. Vadhiyar 1, and Jack J.
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationThe Grid: Feng Shui for the Terminally Rectilinear
The Grid: Feng Shui for the Terminally Rectilinear Martha Stewart Introduction While the rapid evolution of The Internet continues to define a new medium for the sharing and management of information,
More informationA Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks
IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 8, NO. 6, DECEMBER 2000 747 A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks Yuhong Zhu, George N. Rouskas, Member,
More informationUNICORE Globus: Interoperability of Grid Infrastructures
UNICORE : Interoperability of Grid Infrastructures Michael Rambadt Philipp Wieder Central Institute for Applied Mathematics (ZAM) Research Centre Juelich D 52425 Juelich, Germany Phone: +49 2461 612057
More informationOptimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning
Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,
More informationA Distributed Media Service System Based on Globus Data-Management Technologies1
A Distributed Media Service System Based on Globus Data-Management Technologies1 Xiang Yu, Shoubao Yang, and Yu Hong Dept. of Computer Science, University of Science and Technology of China, Hefei 230026,
More informationFractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures
Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin
More informationKnowledge Discovery Services and Tools on Grids
Knowledge Discovery Services and Tools on Grids DOMENICO TALIA DEIS University of Calabria ITALY talia@deis.unical.it Symposium ISMIS 2003, Maebashi City, Japan, Oct. 29, 2003 OUTLINE Introduction Grid
More informationMachine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham
Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand
More informationParallel Combinatorial BLAS and Applications in Graph Computations
Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationJULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING
JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationA Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach
A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach Shridhar Diwan, Dennis Gannon Department of Computer Science Indiana University Bloomington,
More informationParallel Auction Algorithm for Linear Assignment Problem
Parallel Auction Algorithm for Linear Assignment Problem Xin Jin 1 Introduction The (linear) assignment problem is one of classic combinatorial optimization problems, first appearing in the studies on
More informationFunctional Requirements for Grid Oriented Optical Networks
Functional Requirements for Grid Oriented Optical s Luca Valcarenghi Internal Workshop 4 on Photonic s and Technologies Scuola Superiore Sant Anna Pisa June 3-4, 2003 1 Motivations Grid networking connection
More informationTHE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE
THE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE Wei Li, Zhiwei Xu, Bingchen Li, Yili Gong Institute of Computing Technology of Chinese Academy of Sciences Beijing China, 100080 {zxu, liwei, libingchen,
More informationA Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin
A Decoupled Scheduling Approach for the GrADS Program Development Environment DCSL Ahmed Amin Outline Introduction Related Work Scheduling Architecture Scheduling Algorithm Testbench Results Conclusions
More informationFUTURE communication networks are expected to support
1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,
More informationBuilding Performance Topologies for Computational Grids UCSB Technical Report
Building Performance Topologies for Computational Grids UCSB Technical Report 2002-11 Martin Swany and Rich Wolski Department of Computer Science University of California Santa Barbara, CA 93106 {swany,rich}@cs..edu
More informationHETEROGENEOUS COMPUTING
HETEROGENEOUS COMPUTING Shoukat Ali, Tracy D. Braun, Howard Jay Siegel, and Anthony A. Maciejewski School of Electrical and Computer Engineering, Purdue University Heterogeneous computing is a set of techniques
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationHigh Throughput WAN Data Transfer with Hadoop-based Storage
High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationSolving the N-Body Problem with the ALiCE Grid System
Solving the N-Body Problem with the ALiCE Grid System Dac Phuong Ho 1, Yong Meng Teo 2 and Johan Prawira Gozali 2 1 Department of Computer Network, Vietnam National University of Hanoi 144 Xuan Thuy Street,
More informationScheduling Large Parametric Modelling Experiments on a Distributed Meta-computer
Scheduling Large Parametric Modelling Experiments on a Distributed Meta-computer David Abramson and Jon Giddy Department of Digital Systems, CRC for Distributed Systems Technology Monash University, Gehrmann
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationLogistical Computing and Internetworking: Middleware for the Use of Storage in Communication
Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication Micah Beck Dorian Arnold Alessandro Bassi Fran Berman Henri Casanova Jack Dongarra Terence Moore Graziano Obertelli
More informationUsage of LDAP in Globus
Usage of LDAP in Globus Gregor von Laszewski and Ian Foster Mathematics and Computer Science Division Argonne National Laboratory, Argonne, IL 60439 gregor@mcs.anl.gov Abstract: This short note describes
More informationIntroduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project
Introduction to GT3 The Globus Project Argonne National Laboratory USC Information Sciences Institute Copyright (C) 2003 University of Chicago and The University of Southern California. All Rights Reserved.
More informationA Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme
A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme Yue Zhang 1 and Yunxia Pei 2 1 Department of Math and Computer Science Center of Network, Henan Police College, Zhengzhou,
More informationMONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT
The Monte Carlo Method: Versatility Unbounded in a Dynamic Computing World Chattanooga, Tennessee, April 17-21, 2005, on CD-ROM, American Nuclear Society, LaGrange Park, IL (2005) MONTE CARLO SIMULATION
More informationThe Encoding Complexity of Network Coding
The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network
More informationGeneric Topology Mapping Strategies for Large-scale Parallel Architectures
Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationMultiprocessors 2007/2008
Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several
More informationTutorial: Application MPI Task Placement
Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How
More informationSparse matrices, graphs, and tree elimination
Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationA Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown
A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and Jed Brown Graph Laplacian Matrices Covered by other speakers (hopefully) Useful in a variety of areas Graphs are getting very big Facebook
More informationPROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP
ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge
More informationCAS 703 Software Design
Dr. Ridha Khedri Department of Computing and Software, McMaster University Canada L8S 4L7, Hamilton, Ontario Acknowledgments: Material based on Software by Tao et al. (Chapters 9 and 10) (SOA) 1 Interaction
More informationOnline Optimization of VM Deployment in IaaS Cloud
Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationMapping Vector Codes to a Stream Processor (Imagine)
Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationCharacterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date:
Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date: 8-17-5 Table of Contents Table of Contents...1 Table of Figures...1 1 Overview...4 2 Experiment Description...4
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationLatency Hiding by Redundant Processing: A Technique for Grid enabled, Iterative, Synchronous Parallel Programs
Latency Hiding by Redundant Processing: A Technique for Grid enabled, Iterative, Synchronous Parallel Programs Jeremy F. Villalobos University of North Carolina at Charlote 921 University City Blvd Charlotte,
More informationTowards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics
Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics N. Melab, T-V. Luong, K. Boufaras and E-G. Talbi Dolphin Project INRIA Lille Nord Europe - LIFL/CNRS UMR 8022 - Université
More informationModelling and implementation of algorithms in applied mathematics using MPI
Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More informationPARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM
PARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM Scott Wu Montgomery Blair High School Silver Spring, Maryland Paul Kienzle Center for Neutron Research, National Institute of Standards and Technology
More informationEXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL
EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL Fengguang Song, Jack Dongarra, and Shirley Moore Computer Science Department University of Tennessee Knoxville, Tennessee 37996, USA email:
More informationMulticore Computing and Scientific Discovery
scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research
More informationHighly Latency Tolerant Gaussian Elimination
Highly Latency Tolerant Gaussian Elimination Toshio Endo University of Tokyo endo@logos.ic.i.u-tokyo.ac.jp Kenjiro Taura University of Tokyo/PRESTO, JST tau@logos.ic.i.u-tokyo.ac.jp Abstract Large latencies
More informationImproving the Dynamic Creation of Processes in MPI-2
Improving the Dynamic Creation of Processes in MPI-2 Márcia C. Cera, Guilherme P. Pezzi, Elton N. Mathias, Nicolas Maillard, and Philippe O. A. Navaux Universidade Federal do Rio Grande do Sul, Instituto
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationA Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme
A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme Yue Zhang, Yunxia Pei To cite this version: Yue Zhang, Yunxia Pei. A Resource Discovery Algorithm in Mobile Grid Computing
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationExperiences with the Parallel Virtual File System (PVFS) in Linux Clusters
Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationGRB. Grid-JQA : Grid Java based Quality of service management by Active database. L. Mohammad Khanli M. Analoui. Abstract.
Grid-JQA : Grid Java based Quality of service management by Active database L. Mohammad Khanli M. Analoui Ph.D. student C.E. Dept. IUST Tehran, Iran Khanli@iust.ac.ir Assistant professor C.E. Dept. IUST
More informationC-Meter: A Framework for Performance Analysis of Computing Clouds
9th IEEE/ACM International Symposium on Cluster Computing and the Grid C-Meter: A Framework for Performance Analysis of Computing Clouds Nezih Yigitbasi, Alexandru Iosup, and Dick Epema Delft University
More informationImproving Performance of Sparse Matrix-Vector Multiplication
Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign
More informationClustering and Reclustering HEP Data in Object Databases
Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationScalability of Heterogeneous Computing
Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor
More informationPerformance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors
Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,
More information