Tracing Internal Communication in MPI and MPI-I/O

Size: px
Start display at page:

Download "Tracing Internal Communication in MPI and MPI-I/O"

Transcription

1 Tracing Internal Communication in MPI and MPI-I/O Julian M. Kunkel, Yuichi Tsujita, Olga Mordvinova, Thomas Ludwig Abstract MPI implementations can realize MPI operations with any algorithm that fulfills the specified semantics. To provide optimal efficiency the MPI implementation might choose the algorithm dynamically, depending on the parameters given to the function call. However, the selection is not transparent to the user. While this is appropriate for common users, achieving best performance with fixed parameter sets requires knowledge of internal processing. Also, for developers of collective operations it might be useful to understand timing issues inside the communication or I/O call. In this paper we extended the PIOviz environment to trace MPI internal communication. Thus, this allows the user to see PVFS server behavior together with the behavior in the MPI application and inside MPI itself. We present some analysis results for these capabilites for MPICH2 on a Beowulf Cluster. 1 Introduction The Message Passing Interface (MPI) [1] is state-of-the-art in programming distributed memory architectures. This interface offers an abstraction of the underlying communication infrastructure to the programmer. The MPI specification defines a wide range of operations with their corresponding semantics. Collective operations, for instance, allow a group of processes to exchange information among them. While there are many possible implementations to provide the semantics as defined in MPI, a fast execution on the hardware is favorable. Therefore, the MPI definition allows vendors of a supercomputer to tune processing by adapting specific algorithms to their architecture. Depending on the operation definition and parameters given by participating processes, the implementation might choose the appropriate algorithm dynamically to provide best performance on the architecture. Internally the selected algorithm can induce complex communication patterns. Ruprecht-Karls-Universität, Heidelberg, INF 348, Heidelberg, Germany Kinki University, 1 Umenobe, Takaya, Higashi-Hiroshima, Hiroshima , Japan Universität Hamburg, c/o DKRZ, Bundesstrasse 55, Hamburg, Germany 1

2 As an example broadcasting data among all processes could be done by sending data sequentially from a root process to peer processes. While this might be acceptable for small messages and a small number of processes, the single sender is likely to be the bottleneck of the communication. With a switched interconnection information exchange can be performed in a binary tree fashion to address this issue. Thereby, the root process sends to another process and then each of them is a root for half of the processes. An extension to the MPI standard offers routines for input/output operations, similar to communication definitions, MPI-I/O provides independent and collective functions. Collective I/O operations are a candidate for optimization, because I/O is typically at least an order of magnitude slower than communication. Optimization for hard disk based subsystems could be done for example by bundling small non-contiguous I/O operations together to large contiguous accesses. Therefore, depending on the underlying file system the interplay between MPI and file system could be very easy, or in case of a parallel file system the interplay could be extensive. In general the user is interested to reduce runtime of his application and not in MPI internal processing. However, in some cases such internal information might be useful, on one hand for tuning an application, on the other hand for optimizing the MPI library itself. For instance, by knowing the algorithm used for a particular communication call the programmer might know in advance which processes need more time to execute the call. Then the user could realize static load balancing by assigning less work to these processes. While in most cases it is not favorable to tune the application to the MPI implementation, it might be necessary to get maximum performance for a well known parameter set. By understanding the interaction between MPI and MPI-I/O better, the provided I/O infrastructure could be optimized. In the following we present an attempt to visualize internal communication in MPI and interaction between MPI and parallel file system. This paper is structured as follows: In section 2 state-of-the-art in visualizing MPI internals and parallel file systems is provided. Then we briefly discuss our modifications which enable us to visualize MPI processing in section 3. In section 4 we evaluate our work by presenting results obtained for collective communication and for MPI-I/O interactions. 2 State-of-the-art and Related Work Optimization of collective operations in MPI is a hot research topic and is important for implementing HPC applications on various HPC systems [2, 3, 4]. Visualization of application behavior assists to optimize MPI programs. Two analysis approaches are used for this purpose: on-line analysis tools and tracing tools for postmortem analysis. On-line tools, e. g. Paradyn [5], use a monitoring system or instrumentation to gather data of the running application. This data is immediately available for several purposes, for example to display run-time performance, and it is typically not stored for futher analysis. Integration of monitoring tools is state-of-the-art 2

3 in parallel file systems. In GPFS a monitoring of client I/O performance is possible with a separate command line tool (mmpmon) [6]. Lustre offers monitoring via the /proc interface (LustreProc [7]). [8] presents performance data visualisation in Lustre based on debugging output and the /proc interface with the help of Ganglia. The parallel file system PVFS [9] embeds a performance monitor in the server process, which counts the number of metadata and I/O operations, and the amount of data accessed. It s data can be fetched with the command line tool pvfs2-perf-mon-example directly from the server process. In contrast off-line tools are typically applicable after program completion. In order to do so the monitoring system and instrumentation write event traces or profiles to files that are available afterwards. Representatives for this group are Intel Trace Analyzer & Collector [10], Paraver [11], Jumpshot/MPE [12] and Vampir/VampirTrace. Jumpshot/MPE works in cooperation with the MPICH2 library. It supports to analyze MPI functions for data communications and parallel I/Os by tracing every MPI call. Vampire visualizes MPI calls and records performance data according to the Open Trace Format [13]. Although Vampir- Trace supports tracing of internal operations such as remote memory access and I/O performance it supports them only on a client side [14]. Another framework SCALASCA [15] supports runtime summarization of measurements during execution and event trace collection for postmortem trace analysis. TAU [16] provides a robust, flexible, and portable approach for tracing and visualization of applications, CPU internal hardware counters. However, these tools do not support tracing activities of parallel file systems such as PVFS in conjunction with MPI-I/O calls. PIOviz [17, 18] is a trace-based environment which traces MPI calls and PVFS server internals such as network communication and I/O subsystem activity in conjunction with MPI-I/O calls. In addition, it also collects statistics of CPU usage and PVFS internal statistics [18]. As PIOviz uses MPICH s SLOG2 format a user can analyze trace information with Jumpshot. In comparison to tools described above, PIOvis combines an explicit tracing and visualization of the I/O system s behavior and a correlation of program events and induced system events. The latest PIOviz version described here enables tracing inside collective operations on server side as well as on the client side. To allow this ROMIO [19] and MPICH2 were extended as shown in the following section. 3 Tracing MPI Internals Modifications to enable tracing of MPI internals are split into two parts. First, instrumentation of MPICH2 is required to trace communication inside collective operations. Second, ROMIO is extended to allow tracing of PVFS calls and to show used MPI functions inside MPI-I/O. As a consequence, the linked MPI program will depend on the tracing library from MPE. Note that PIOviz uses version 1.0.5p4 of MPICH2 and version of PVFS. In MPICH2 each collective function calls only a set of internal functions to perform blocking or non-blocking point-to-point communication, that means 3

4 either (i)send, (i)receive or sendreceive operations. These functions are bundled in a single file and can be easily instrumented. Once modified, then processing inside the collective operation becomes visible. PIOviz uses a PMPI wrapper provided in MPE to trace the function calls. Once a user executes an MPI program, trace files are created in client and server side. With a normal building procedure ROMIO changes MPI calls to their PMPI pendants, therefore collective function calls inside ROMIO are hidden. However, the developers provide a precompiler macro to remove the redirection. Furthermore, we put MPE instrumentation around interesting parts in ROMIO, for instance around PVFS calls. While the latter piece of instrumentation is similar to the one shipped with ROMIO, the modified instrumentation in ROMIO works together with the existing PIOviz environment. Consequently the modifications allow us to visualize I/O calls, MPI calls made inside these I/O calls, internal communication in MPI calls, and corresponding operations in PVFS servers. 4 Evaluation Two PC cluster systems consisting of 9 nodes each were used to evaluate PIOviz. Both clusters use COTS components and Gigabit Ethernet for interconnection. The PVS cluster is an older 32 Bit Ubuntu 8.04 cluster, in contrast the Kindai cluster uses a 64 Bit CentOS 4.4. We decided to perform experiments with PI- Oviz on both clusters to neglect the influence of software versions and hardware. It turned out that in some cases observed behavior differs. For a qualitative evaluation precise hardware details are unnecessary and thus spared. In section 4.1 we analyze some collective operations performed on the PVS cluster. Provided results are comparable to the ones measured on the Kindai cluster, however the computational part during the collective calls can be observed better on the PVS cluster. The I/O intense HPIO benchmark [20] is run on the Kindai cluster in section 4.2 to assess a performance degradation we found on this cluster. This anomaly does not manifest on the PVS cluster. 4.1 Collective Communication In the following several collective MPI operations are assessed with PIOviz. A more detailed analysis of MPI Allreduce will show potential for improvement in MPICH2. Internal processing in MPI Scatter and MPI Gather is briefly discussed and might be considered from application programmers perspective. In the first experiment a single MPI allreduce call is performed to sum an array of 10 million double values (80 MByte of data respectively). Figure 1 shows the average, minimal and maximal time of the operation for 10 runs of the program. It is noticable that three processes need more time than four, similar behavior can be observed by comparing execution time for 5, 6, 7 and 8 processes. One might expect that the time needed to perform a collective allreduce with a lower number of processes takes less or equal the time than 4

5 with more processes. The reason is that MPICH2 uses a normal binary tree for process numbers equal to a power of 2 (see [2] for details). If the number of processes in the communicator is not a power of 2 then the algorithm exchanges data between processes to merge additional processes to apply the original binary tree algorithm. That matches the observations, two processes take about one second, four processes two seconds and 8 processes three seconds. Screenshots of the internal communication for three and four processes are given in figures 2 and 3. These screenshots show the internal activities of each process in a separate (time) line. However, not all optimizations mentioned in [2] are incorporated in MPICH2, thus the performance is suboptimal. Compare the observable behavior for 13 processes (figure 4) with the schema provided in [2]. Next we will briefly look at MPI Scatter and MPI Gather. For a configuration with 9 clients and 8 MByte of data, screenshots of the internal processing are provided in figure 5 and in figure 6. By looking at the internal processing one might ask why both algorithms work similar to a binary tree. It suggests that the root process sends data for multiple processes to another process. However, compared to broadcast each process gets individual data, therefore the intermediate node just forwards the data. Also the nodes forwarding data are blocked while sending and receiving the additional data. There are cases in which the forwarding might be useful, for instance for small messages where the setup takes longer than sending the message. In this case work could be shared by forwarding messages from the root node. However, in general the algorithm should be optimized for larger messages. From the user s perspective the gather algorithm provides potential for static load balancing (in case it is called frequently). Of course the parameters fed in gather must be known in advance, then the internal processing and dependencies are known. In the example in figure 6 the user could assign more work to odd processes. On the PVS cluster computation takes about 0.5 seconds. Maybe the application runs frequently with a specific parameter set and does not perform dynamic load-balancing. However, figuring out internal dependencies of the algorithm between the processes is hard without visualization. 4.2 Tracing of MPI-I/O To examine internal collective communications in collective MPI-I/O operations, the MPI-I/O HPIO benchmark was used. It supports contiguous and non-contiguous data access patterns in both collective and independent operations. For non-contiguous data accesses, derived data types are created with an ensemble of region size, region count, and region space, where region stands for a data area. Figure 7 illustrates an example of a data pattern for two processes. In this figure, we assume that data is stored in memory contiguously and non-contiguously in data file. Gaps between data regions are specified by region space of this benchmark. According to a file view created by a derived data type, each client process accesses the data file as shown in this figure. In the collective I/O with derived data types, two phase I/O is used in ROMIO [21] to make contiguous data region as much as possible. In this pa- 5

6 Figure 1: Time for allreduce with a variable number of processes Figure 2: Allreduce for three processes Figure 3: Allreduce for four processes Figure 4: Allreduce for 13 processes 6

7 Figure 5: Scatter for 9 Clients and 8 MByte of data Figure 6: Gather for 9 Clients and 8 MByte of data per, we ran collective write operations. An example of PIOviz screenshot for two-phase I/O with four client processes is shown in Figure 8 with text explanations. This figure shows internal MPI communications and PVFS I/O calls with the help of the modified PIOviz. Before starting I/O operations, client processes exchange information about offset and data length to calculate sizes of an associated memory and file domains. After this operation, client processes read data by using PVFS sys read from assigned file domains and copy them to their collective buffers. Later, data to be written is exchanged among them by using non-blocking MPI calls (MPI Isend and MPI Irecv) and overwritten on the buffers based on the file view described by a derived data type. Finally, the modified data in the buffers is written back by using PVFS sys write. If the collective buffer size is not sufficient to manage the whole assigned data, this sequence is repeated until whole data is manipulated. Figure 7: Example of a derived data type in the HPIO benchmark 7

8 Figure 8: Screenshot of two-phase I/O in MPI File write all In this evaluation, we could see effectiveness of client side tracing in conjunction with tracing PVFS server internals. Figures 9 (a) and (b) show screenshots of typical inefficient access patterns obtained by previous and current PIOviz with some text explanations, respectively. The upper four time lines stand for client processes from rank 0 to 3, and the lower five time lines consists of one meta-server and data servers for PVFS in downward order. Note that every PVFS server was waiting for requests from clients long time after the first pair of read and write operations for a PVFS file system. The previous PIOviz does not show any internal MPI communications and PVFS I/O calls, while the current PIOviz does. With the previous PIOviz, we can not determine the point which has delay. On the other hand, the current release can assist to check problematic points. In this case, we can see that one of the client processes takes long time in PVFS sys read. As there is no PVFS operations on PVFS servers while a client process of rank 1 is issuing PVFS sys read, there might be inefficient operations between MPI and PVFS layers on the client process. 5 Conclusions and Future Work The paper showed that insights in MPI and in attached file systems assist to identify inefficient processing. This knowledge could be used by developers to tune internal layers. Also, knowledge of internal processing and dependencies allows the application programmer to optimize it towards the MPI implementation. While in general application users are discouraged from latter optimization, for fixed parameter sets this will improve insight into static load balancing. In complicated MPI-I/O accesses, internal behavior on Jumpshot screenshots reveal bottlenecks in internal operations. Evaluated examples of several MPI calls indicate room for optimization of MPICH2 in cluster environments. However, the modifications made are experimental and not yet suitable for every application. In the future we will improve stability of the environment, and to update it to the current versions of PVFS and MPICH. 8

9 (a) obtained by previous PIOviz (b) obtained by current PIOviz References Figure 9: Screenshots of inefficient I/O patterns [1] Message Passing Interface Forum, MPI: A message-passing interface standard. Version 2.1, June [2] R. Thakur, R. Rabenseifner, and W. Gropp, Optimization of collective communication operations in MPICH, The International Journal of High Performance Computing Applications, vol. 19, pp , Spring [3] M. Kühnemann, T. Rauber, and G. Rünger, Optimizing MPI collective communication by orthogonal structures, Cluster Computing, vol. 9, no. 3, pp , [4] M. K. Velamati, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, R. Sharma, S. Kapoor, and A. Srinivasan, Optimization of collective communication in intra-cell MPI, in HiPC, pp , [5] Paradyn Parallel Performance Tools. [6] IBM, General Parallel File System - Advanced Administration Guide V

10 [7] Sun Microsystems Inc., Lustre 1.6 Manual. [8] Sun Microsystems Inc., Profiling tools for IO. index.php?title=profiling Tools for IO. [9] W. Ligon and R. Ross, PVFS: Parallel Virtual File System, in Beowulf Cluster Computing with Linux (T. Sterling, ed.), Scientific and Engineering Computation, ch. 17, pp , Cambridge, Massachusetts: The MIT Press, Nov [10] Intel Trace Analyzer & Collector. products/asmo-na/eng/cluster/tanalyzer/, [11] J. Labarta, J. Giménez, E. Martínez, P. González, H. Servat, G. Llort, and X. Aguilar, Scalability of visualization and tracing tools, in Proc. of the ParCo 2005, vol. 33 of NIC Series, pp , John von Neumann Institute for Computing, Jülich, [12] A. Chan, W. Gropp, and E. Lusk, An efficient format for nearly constanttime access to arbitrary time intervals in large trace files, Scientific Programming, vol. 16, no. 2-3, pp , [13] A. Knüpfer, R. Brendel, H. Brunst, H. Mix, and W. E. Nagel, Introducing the Open Trace Format (OTF), vol of LNCS, pp , Springer, [14] H. Mickler, A. Knüpfer, M. Kluge, M. S. Müller, and W. E. Nagel, Trace- Based Analysis and Optimization for the Semtex CFD Application Hidden Remote Memory Accesses and I/O Performance, pp , [15] Z. Szebenyi, B. J. N. Wylie, and F. Wolf, SCALASCA parallel performance analyses of SPEC MPI2007 applications, vol of LNCS, pp , Springer, [16] S. S. Shende and A. D. Malony, The TAU parallel performance system, The International Journal of High Performance Computing Applications, vol. 20, pp , Summer [17] T. Ludwig, S. Krempel, M. Kuhn, J. M. Kunkel, and C. Lohse, Analysis of the MPI-IO optimization levels with the PIOViz jumpshot enhancement, vol of LNCS, pp , Springer, [18] J. M. Kunkel and T. Ludwig, Bottleneck detection in parallel file systems with trace-based performance monitoring, vol of LNCS, pp , Springer, [19] R. Thakur, E. Lusk, and W. Gropp, Users Guide for ROMIO: A High- Performance, Portable MPI-IO Implementation, technical memorandum anl/mcs-tm-234, Mathematics and Computer Science Division, Argonne National Laboratory, USA, [20] A. Ching, A. Choudhary, W. keng Liao, L. Ward, and N. Pundit, Evaluating I/O characteristics and methods for storing structured scientific data, in 20th IEEE International Parallel and Distributed Processing Symposium, p. 49, IEEE Computer Society, April [21] R. Thakur, W. Gropp, and E. Lusk, Optimizing noncontiguous accesses in MPI-IO, Parallel Computing, vol. 28, no. 1, pp ,

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu

More information

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3

More information

Hint Controlled Distribution with Parallel File Systems

Hint Controlled Distribution with Parallel File Systems Hint Controlled Distribution with Parallel File Systems Hipolito Vasquez Lucas and Thomas Ludwig Parallele und Verteilte Systeme, Institut für Informatik, Ruprecht-Karls-Universität Heidelberg, 6912 Heidelberg,

More information

Feasibility Study of Effective Remote I/O Using a Parallel NetCDF Interface in a Long-Latency Network

Feasibility Study of Effective Remote I/O Using a Parallel NetCDF Interface in a Long-Latency Network Feasibility Study of Effective Remote I/O Using a Parallel NetCDF Interface in a Long-Latency Network Yuichi Tsujita Abstract NetCDF provides portable and selfdescribing I/O data format for array-oriented

More information

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences 1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,

More information

Making Resonance a Common Case: A High-Performance Implementation of Collective I/O on Parallel File Systems

Making Resonance a Common Case: A High-Performance Implementation of Collective I/O on Parallel File Systems Making Resonance a Common Case: A High-Performance Implementation of Collective on Parallel File Systems Xuechen Zhang 1, Song Jiang 1, and Kei Davis 2 1 ECE Department 2 Computer and Computational Sciences

More information

Pattern-Aware File Reorganization in MPI-IO

Pattern-Aware File Reorganization in MPI-IO Pattern-Aware File Reorganization in MPI-IO Jun He, Huaiming Song, Xian-He Sun, Yanlong Yin Computer Science Department Illinois Institute of Technology Chicago, Illinois 60616 {jhe24, huaiming.song, sun,

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

A Compact Computing Environment For A Windows PC Cluster Towards Seamless Molecular Dynamics Simulations

A Compact Computing Environment For A Windows PC Cluster Towards Seamless Molecular Dynamics Simulations A Compact Computing Environment For A Windows PC Cluster Towards Seamless Molecular Dynamics Simulations Yuichi Tsujita Abstract A Windows PC cluster is focused for its high availabilities and fruitful

More information

Tracing the Cache Behavior of Data Structures in Fortran Applications

Tracing the Cache Behavior of Data Structures in Fortran Applications John von Neumann Institute for Computing Tracing the Cache Behavior of Data Structures in Fortran Applications L. Barabas, R. Müller-Pfefferkorn, W.E. Nagel, R. Neumann published in Parallel Computing:

More information

ECE7995 (7) Parallel I/O

ECE7995 (7) Parallel I/O ECE7995 (7) Parallel I/O 1 Parallel I/O From user s perspective: Multiple processes or threads of a parallel program accessing data concurrently from a common file From system perspective: - Files striped

More information

Towards the Performance Visualization of Web-Service Based Applications

Towards the Performance Visualization of Web-Service Based Applications Towards the Performance Visualization of Web-Service Based Applications Marian Bubak 1,2, Wlodzimierz Funika 1,MarcinKoch 1, Dominik Dziok 1, Allen D. Malony 3,MarcinSmetek 1, and Roland Wismüller 4 1

More information

Automatic Adaption of the Sampling Frequency for Detailed Performance Analysis

Automatic Adaption of the Sampling Frequency for Detailed Performance Analysis Automatic Adaption of the for Detailed Performance Analysis Michael Wagner and Andreas Knüpfer Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain Center for Information Services and High Performance

More information

The SCALASCA performance toolset architecture

The SCALASCA performance toolset architecture The SCALASCA performance toolset architecture Markus Geimer 1, Felix Wolf 1,2, Brian J.N. Wylie 1, Erika Ábrahám 1, Daniel Becker 1,2, Bernd Mohr 1 1 Forschungszentrum Jülich 2 RWTH Aachen University Jülich

More information

MPICH on Clusters: Future Directions

MPICH on Clusters: Future Directions MPICH on Clusters: Future Directions Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory thakur@mcs.anl.gov http://www.mcs.anl.gov/~thakur Introduction Linux clusters are

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Tracing and Visualization of Energy Related Metrics

Tracing and Visualization of Energy Related Metrics Tracing and Visualization of Energy Related Metrics 8th Workshop on High-Performance, Power-Aware Computing 2012, Shanghai Timo Minartz, Julian Kunkel, Thomas Ludwig timo.minartz@informatik.uni-hamburg.de

More information

Boosting Application-specific Parallel I/O Optimization using IOSIG

Boosting Application-specific Parallel I/O Optimization using IOSIG 22 2th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Boosting Application-specific Parallel I/O Optimization using IOSIG Yanlong Yin yyin2@iit.edu Surendra Byna 2 sbyna@lbl.gov

More information

Challenges in HPC I/O

Challenges in HPC I/O Challenges in HPC I/O Universität Basel Julian M. Kunkel German Climate Computing Center / Universität Hamburg 10. October 2014 Outline 1 High-Performance Computing 2 Parallel File Systems and Challenges

More information

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität

More information

A Trace-Scaling Agent for Parallel Application Tracing 1

A Trace-Scaling Agent for Parallel Application Tracing 1 A Trace-Scaling Agent for Parallel Application Tracing 1 Felix Freitag, Jordi Caubet, Jesus Labarta Computer Architecture Department (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat

More information

JULEA: A Flexible Storage Framework for HPC

JULEA: A Flexible Storage Framework for HPC JULEA: A Flexible Storage Framework for HPC Workshop on Performance and Scalability of Storage Systems Michael Kuhn Research Group Scientific Computing Department of Informatics Universität Hamburg 2017-06-22

More information

Ruprecht-Karls-Universität Heidelberg Institut für Informatik Arbeitsgruppe Parallele und Verteilte Systeme Masterarbeit

Ruprecht-Karls-Universität Heidelberg Institut für Informatik Arbeitsgruppe Parallele und Verteilte Systeme Masterarbeit Ruprecht-Karls-Universität Heidelberg Institut für Informatik Arbeitsgruppe Parallele und Verteilte Systeme Masterarbeit Towards Automatic Load Balancing of a Parallel File System with Subfile Based Migration

More information

Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO,

Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO, Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO, 2013-04-03 Andreas Knüpfer, Thomas William TU Dresden, Germany Overview Introduction Vampir displays GPGPU

More information

Performance Cockpit: An Extensible GUI Platform for Performance Tools

Performance Cockpit: An Extensible GUI Platform for Performance Tools Performance Cockpit: An Extensible GUI Platform for Performance Tools Tianchao Li and Michael Gerndt Institut für Informatik, Technische Universität München, Boltzmannstr. 3, D-85748 Garching bei Mu nchen,

More information

High Performance Supercomputing using Infiniband based Clustered Servers

High Performance Supercomputing using Infiniband based Clustered Servers High Performance Supercomputing using Infiniband based Clustered Servers M.J. Johnson A.L.C. Barczak C.H. Messom Institute of Information and Mathematical Sciences Massey University Auckland, New Zealand.

More information

Leveraging Burst Buffer Coordination to Prevent I/O Interference

Leveraging Burst Buffer Coordination to Prevent I/O Interference Leveraging Burst Buffer Coordination to Prevent I/O Interference Anthony Kougkas akougkas@hawk.iit.edu Matthieu Dorier, Rob Latham, Rob Ross, Xian-He Sun Wednesday, October 26th Baltimore, USA Outline

More information

A Scalable Parallel HITS Algorithm for Page Ranking

A Scalable Parallel HITS Algorithm for Page Ranking A Scalable Parallel HITS Algorithm for Page Ranking Matthew Bennett, Julie Stone, Chaoyang Zhang School of Computing. University of Southern Mississippi. Hattiesburg, MS 39406 matthew.bennett@usm.edu,

More information

POCCS: A Parallel Out-of-Core Computing System for Linux Clusters

POCCS: A Parallel Out-of-Core Computing System for Linux Clusters POCCS: A Parallel Out-of-Core System for Linux Clusters JIANQI TANG BINXING FANG MINGZENG HU HONGLI ZHANG Department of Computer Science and Engineering Harbin Institute of Technology No.92, West Dazhi

More information

A High Performance Implementation of MPI-IO for a Lustre File System Environment

A High Performance Implementation of MPI-IO for a Lustre File System Environment A High Performance Implementation of MPI-IO for a Lustre File System Environment Phillip M. Dickens and Jeremy Logan Department of Computer Science, University of Maine Orono, Maine, USA dickens@umcs.maine.edu

More information

Data Sieving and Collective I/O in ROMIO

Data Sieving and Collective I/O in ROMIO Appeared in Proc. of the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999, pp. 182 189. c 1999 IEEE. Data Sieving and Collective I/O in ROMIO Rajeev Thakur William Gropp

More information

Characterizing the I/O Behavior of Scientific Applications on the Cray XT

Characterizing the I/O Behavior of Scientific Applications on the Cray XT Characterizing the I/O Behavior of Scientific Applications on the Cray XT Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN 37831 rothpc@ornl.gov ABSTRACT

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH Heike Jagode, Shirley Moore, Dan Terpstra, Jack Dongarra The University of Tennessee, USA [jagode shirley terpstra

More information

Germán Llort

Germán Llort Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,

More information

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

DATA access is one of the critical performance bottlenecks

DATA access is one of the critical performance bottlenecks This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 1.119/TC.216.2637353,

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Performance Analysis of Large-scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

Performance Analysis of Large-scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG Performance Analysis of Large-scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG Holger Brunst 1 and Bernd Mohr 2 1 Center for High Performance Computing Dresden University of Technology Dresden,

More information

Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method

Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method Wei-keng Liao 1, Avery Ching 1, Kenin Coloma 1, Alok Choudhary 1, and Mahmut Kandemir 2 1 Northwestern University

More information

Optimization of Collective Reduction Operations

Optimization of Collective Reduction Operations Published in International Conference on Computational Science, June 7 9, Krakow, Poland, LNCS, Springer-Verlag, 004. c Springer-Verlag, http://www.springer.de/comp/lncs/index.html Optimization of Collective

More information

Vampir and Lustre. Understanding Boundaries in I/O Intensive Applications

Vampir and Lustre. Understanding Boundaries in I/O Intensive Applications Center for Information Services and High Performance Computing (ZIH) Vampir and Lustre Understanding Boundaries in I/O Intensive Applications Zellescher Weg 14 Treffz-Bau (HRSK-Anbau) - HRSK 151 Tel. +49

More information

Building Library Components That Can Use Any MPI Implementation

Building Library Components That Can Use Any MPI Implementation Building Library Components That Can Use Any MPI Implementation William Gropp Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL gropp@mcs.anl.gov http://www.mcs.anl.gov/~gropp

More information

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de

More information

Parallel Programming

Parallel Programming Parallel Programming for Multicore and Cluster Systems von Thomas Rauber, Gudula Rünger 1. Auflage Parallel Programming Rauber / Rünger schnell und portofrei erhältlich bei beck-shop.de DIE FACHBUCHHANDLUNG

More information

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Xin Yuan Scott Daniels Ahmad Faraj Amit Karwande Department of Computer Science, Florida State University, Tallahassee,

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Institute for Advanced Simulation Introduction to Parallel Computing Bernd Mohr published in Multiscale Simulation Methods in Molecular Sciences, J. Grotendorst, N. Attig, S. Blügel, D. Marx (Eds.), Institute

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002 HDF5 I/O Performance HDF and HDF-EOS Workshop VI December 5, 2002 1 Goal of this talk Give an overview of the HDF5 Library tuning knobs for sequential and parallel performance 2 Challenging task HDF5 Library

More information

Implementing Byte-Range Locks Using MPI One-Sided Communication

Implementing Byte-Range Locks Using MPI One-Sided Communication Implementing Byte-Range Locks Using MPI One-Sided Communication Rajeev Thakur, Robert Ross, and Robert Latham Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA

More information

From Cluster Monitoring to Grid Monitoring Based on GRM *

From Cluster Monitoring to Grid Monitoring Based on GRM * From Cluster Monitoring to Grid Monitoring Based on GRM * Zoltán Balaton, Péter Kacsuk, Norbert Podhorszki and Ferenc Vajda MTA SZTAKI H-1518 Budapest, P.O.Box 63. Hungary {balaton, kacsuk, pnorbert, vajda}@sztaki.hu

More information

Revealing Applications Access Pattern in Collective I/O for Cache Management

Revealing Applications Access Pattern in Collective I/O for Cache Management Revealing Applications Access Pattern in for Yin Lu 1, Yong Chen 1, Rob Latham 2 and Yu Zhuang 1 Presented by Philip Roth 3 1 Department of Computer Science Texas Tech University 2 Mathematics and Computer

More information

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Lucas M. Schnorr, Arnaud Legrand, and Jean-Marc Vincent e-mail : Firstname.Lastname@imag.fr Laboratoire d Informatique

More information

I/O Analysis and Optimization for an AMR Cosmology Application

I/O Analysis and Optimization for an AMR Cosmology Application I/O Analysis and Optimization for an AMR Cosmology Application Jianwei Li Wei-keng Liao Alok Choudhary Valerie Taylor ECE Department, Northwestern University {jianwei, wkliao, choudhar, taylor}@ece.northwestern.edu

More information

Towards a Portable Cluster Computing Environment Supporting Single System Image

Towards a Portable Cluster Computing Environment Supporting Single System Image Towards a Portable Cluster Computing Environment Supporting Single System Image Tatsuya Asazu y Bernady O. Apduhan z Itsujiro Arita z Department of Artificial Intelligence Kyushu Institute of Technology

More information

Integrating Parallel Application Development with Performance Analysis in Periscope

Integrating Parallel Application Development with Performance Analysis in Periscope Technische Universität München Integrating Parallel Application Development with Performance Analysis in Periscope V. Petkov, M. Gerndt Technische Universität München 19 April 2010 Atlanta, GA, USA Motivation

More information

Exploiting Shared Memory to Improve Parallel I/O Performance

Exploiting Shared Memory to Improve Parallel I/O Performance Exploiting Shared Memory to Improve Parallel I/O Performance Andrew B. Hastings 1 and Alok Choudhary 2 1 Sun Microsystems, Inc. andrew.hastings@sun.com 2 Northwestern University choudhar@ece.northwestern.edu

More information

Parallel Programming with MPI on Clusters

Parallel Programming with MPI on Clusters Parallel Programming with MPI on Clusters Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory (The rest of our group: Bill Gropp, Rob Ross, David Ashton, Brian Toonen, Anthony

More information

Parallel I/O Libraries and Techniques

Parallel I/O Libraries and Techniques Parallel I/O Libraries and Techniques Mark Howison User Services & Support I/O for scientifc data I/O is commonly used by scientific applications to: Store numerical output from simulations Load initial

More information

Optimization of non-contiguous MPI-I/O operations

Optimization of non-contiguous MPI-I/O operations Optimization of non-contiguous MPI-I/O operations Enno Zickler Arbeitsbereich Wissenschaftliches Rechnen Fachbereich Informatik Fakultät für Mathematik, Informatik und Naturwissenschaften Universität Hamburg

More information

Improving the Scalability of Performance Evaluation Tools

Improving the Scalability of Performance Evaluation Tools Improving the Scalability of Performance Evaluation Tools Sameer Suresh Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory Department of Computer and Information Science University

More information

Benefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous IO

Benefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous IO Benefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous IO Weikuan Yu Dhabaleswar K. Panda Network-Based Computing Lab Dept. of Computer Science & Engineering The Ohio State University {yuw,panda}@cse.ohio-state.edu

More information

Optimization of Collective Communication in Intra- Cell MPI

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI M. K. Velamati 1, A. Kumar 1, N. Jayam 1, G. Senthilkumar 1, P.K. Baruah 1, R. Sharma 1, S. Kapoor 2, and A. Srinivasan 3 1 Dept. of Mathematics

More information

Evaluating I/O Characteristics and Methods for Storing Structured Scientific Data

Evaluating I/O Characteristics and Methods for Storing Structured Scientific Data Evaluating I/O Characteristics and Methods for Storing Structured Scientific Data Avery Ching 1, Alok Choudhary 1, Wei-keng Liao 1,LeeWard, and Neil Pundit 1 Northwestern University Sandia National Laboratories

More information

Event-based Measurement and Analysis of One-sided Communication

Event-based Measurement and Analysis of One-sided Communication Event-based Measurement and Analysis of One-sided Communication Marc-André Hermanns 1, Bernd Mohr 1, and Felix Wolf 2 1 Forschungszentrum Jülich, Zentralinstitut für Angewandte Mathematik, 52425 Jülich,

More information

Automated Tracing of I/O Stack

Automated Tracing of I/O Stack Automated Tracing of I/O Stack Seong Jo Kim 1, Yuanrui Zhang 1, Seung Woo Son 2, Ramya Prabhakar 1, Mahmut Kandemir 1, Christina Patrick 1, Wei-keng Liao 3, and Alok Choudhary 3 1 Department of Computer

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

Meta-data Management System for High-Performance Large-Scale Scientific Data Access

Meta-data Management System for High-Performance Large-Scale Scientific Data Access Meta-data Management System for High-Performance Large-Scale Scientific Data Access Wei-keng Liao, Xaiohui Shen, and Alok Choudhary Department of Electrical and Computer Engineering Northwestern University

More information

[Scalasca] Tool Integrations

[Scalasca] Tool Integrations Mitglied der Helmholtz-Gemeinschaft [Scalasca] Tool Integrations Aug 2011 Bernd Mohr CScADS Performance Tools Workshop Lake Tahoe Contents Current integration of various direct measurement tools Paraver

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

Iteration Based Collective I/O Strategy for Parallel I/O Systems

Iteration Based Collective I/O Strategy for Parallel I/O Systems Iteration Based Collective I/O Strategy for Parallel I/O Systems Zhixiang Wang, Xuanhua Shi, Hai Jin, Song Wu Services Computing Technology and System Lab Cluster and Grid Computing Lab Huazhong University

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

MPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters

MPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters MPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters Alexey Lastovetsky Vladimir Rychkov Maureen O Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen.OFlynn}@ucd.ie

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

Early Experiments with the OpenMP/MPI Hybrid Programming Model

Early Experiments with the OpenMP/MPI Hybrid Programming Model Early Experiments with the OpenMP/MPI Hybrid Programming Model Ewing Lusk 1 and Anthony Chan 2 1 Mathematics and Computer Science Division Argonne National Laboratory 2 ASCI FLASH Center University of

More information

Optimizing Assignment of Threads to SPEs on the Cell BE Processor

Optimizing Assignment of Threads to SPEs on the Cell BE Processor Optimizing Assignment of Threads to SPEs on the Cell BE Processor T. Nagaraju P.K. Baruah Ashok Srinivasan Abstract The Cell is a heterogeneous multicore processor that has attracted much attention in

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

Enabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division

Enabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division Enabling Active Storage on Parallel I/O Software Stacks Seung Woo Son sson@mcs.anl.gov Mathematics and Computer Science Division MSST 2010, Incline Village, NV May 7, 2010 Performing analysis on large

More information

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems Minesh I. Patel Ph.D. 1, Karl Jordan 1, Mattew Clark Ph.D. 1, and Devesh Bhatt

More information

Distribution of Periscope Analysis Agents on ALTIX 4700

Distribution of Periscope Analysis Agents on ALTIX 4700 John von Neumann Institute for Computing Distribution of Periscope Analysis Agents on ALTIX 4700 Michael Gerndt, Sebastian Strohhäcker published in Parallel Computing: Architectures, Algorithms and Applications,

More information

Multicast can be implemented here

Multicast can be implemented here MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu

More information

A First Implementation of Parallel IO in Chapel for Block Data Distribution 1

A First Implementation of Parallel IO in Chapel for Block Data Distribution 1 A First Implementation of Parallel IO in Chapel for Block Data Distribution 1 Rafael LARROSA a, Rafael ASENJO a Angeles NAVARRO a and Bradford L. CHAMBERLAIN b a Dept. of Compt. Architect. Univ. of Malaga,

More information

Implementing MPI-IO Shared File Pointers without File System Support

Implementing MPI-IO Shared File Pointers without File System Support Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,

More information

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System RAIDIX Data Storage Solution Clustered Data Storage Based on the RAIDIX Software and GPFS File System 2017 Contents Synopsis... 2 Introduction... 3 Challenges and the Solution... 4 Solution Architecture...

More information

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT: HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com

More information

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters Xuechen Zhang 1 Jianqiang Ou 2 Kei Davis 3 Song Jiang 2 1 Georgia Institute of Technology, 2 Wayne State University,

More information

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein Parallel & Cluster Computing cs 6260 professor: elise de doncker by: lina hussein 1 Topics Covered : Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016 Analyzing the High Performance Parallel I/O on LRZ HPC systems Sandra Méndez. HPC Group, LRZ. June 23, 2016 Outline SuperMUC supercomputer User Projects Monitoring Tool I/O Software Stack I/O Analysis

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Early Experiences with KTAU on the IBM BG/L

Early Experiences with KTAU on the IBM BG/L Early Experiences with KTAU on the IBM BG/L Aroon Nataraj, Allen D. Malony, Alan Morris, and Sameer Shende Performance Research Laboratory, Department of Computer and Information Science University of

More information

A Visual Network Analysis Method for Large Scale Parallel I/O Systems

A Visual Network Analysis Method for Large Scale Parallel I/O Systems A Visual Network Analysis Method for Large Scale Parallel I/O Systems Carmen Sigovan, Chris Muelder, Kwan-Liu Ma University of California Davis {cmsigovan, cwmuelder, klma}@ucdavis.edu Jason Cope, Kamil

More information

A Buffered-Mode MPI Implementation for the Cell BE Processor

A Buffered-Mode MPI Implementation for the Cell BE Processor A Buffered-Mode MPI Implementation for the Cell BE Processor Arun Kumar 1, Ganapathy Senthilkumar 1, Murali Krishna 1, Naresh Jayam 1, Pallav K Baruah 1, Raghunath Sharma 1, Ashok Srinivasan 2, Shakti

More information

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS) Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS) Understanding I/O Performance Behavior (UIOP) 2017 Sebastian Oeste, Mehmet Soysal, Marc-André Vef, Michael Kluge, Wolfgang E.

More information

Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview

Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur BenchIT Project Overview Nöthnitzer Straße 46 Raum INF 1041 Tel. +49 351-463 - 38458 (stefan.pflueger@tu-dresden.de)

More information

Efficiency Evaluation of the Input/Output System on Computer Clusters

Efficiency Evaluation of the Input/Output System on Computer Clusters Efficiency Evaluation of the Input/Output System on Computer Clusters Sandra Méndez, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de

More information

Online Remote Trace Analysis of Parallel Applications on High-Performance Clusters

Online Remote Trace Analysis of Parallel Applications on High-Performance Clusters Online Remote Trace Analysis of Parallel Applications on High-Performance Clusters Holger Brunst, Allen D. Malony, Sameer S. Shende, and Robert Bell Department for Computer and Information Science University

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information