Tracing Internal Communication in MPI and MPI-I/O
|
|
- Amie Mills
- 6 years ago
- Views:
Transcription
1 Tracing Internal Communication in MPI and MPI-I/O Julian M. Kunkel, Yuichi Tsujita, Olga Mordvinova, Thomas Ludwig Abstract MPI implementations can realize MPI operations with any algorithm that fulfills the specified semantics. To provide optimal efficiency the MPI implementation might choose the algorithm dynamically, depending on the parameters given to the function call. However, the selection is not transparent to the user. While this is appropriate for common users, achieving best performance with fixed parameter sets requires knowledge of internal processing. Also, for developers of collective operations it might be useful to understand timing issues inside the communication or I/O call. In this paper we extended the PIOviz environment to trace MPI internal communication. Thus, this allows the user to see PVFS server behavior together with the behavior in the MPI application and inside MPI itself. We present some analysis results for these capabilites for MPICH2 on a Beowulf Cluster. 1 Introduction The Message Passing Interface (MPI) [1] is state-of-the-art in programming distributed memory architectures. This interface offers an abstraction of the underlying communication infrastructure to the programmer. The MPI specification defines a wide range of operations with their corresponding semantics. Collective operations, for instance, allow a group of processes to exchange information among them. While there are many possible implementations to provide the semantics as defined in MPI, a fast execution on the hardware is favorable. Therefore, the MPI definition allows vendors of a supercomputer to tune processing by adapting specific algorithms to their architecture. Depending on the operation definition and parameters given by participating processes, the implementation might choose the appropriate algorithm dynamically to provide best performance on the architecture. Internally the selected algorithm can induce complex communication patterns. Ruprecht-Karls-Universität, Heidelberg, INF 348, Heidelberg, Germany Kinki University, 1 Umenobe, Takaya, Higashi-Hiroshima, Hiroshima , Japan Universität Hamburg, c/o DKRZ, Bundesstrasse 55, Hamburg, Germany 1
2 As an example broadcasting data among all processes could be done by sending data sequentially from a root process to peer processes. While this might be acceptable for small messages and a small number of processes, the single sender is likely to be the bottleneck of the communication. With a switched interconnection information exchange can be performed in a binary tree fashion to address this issue. Thereby, the root process sends to another process and then each of them is a root for half of the processes. An extension to the MPI standard offers routines for input/output operations, similar to communication definitions, MPI-I/O provides independent and collective functions. Collective I/O operations are a candidate for optimization, because I/O is typically at least an order of magnitude slower than communication. Optimization for hard disk based subsystems could be done for example by bundling small non-contiguous I/O operations together to large contiguous accesses. Therefore, depending on the underlying file system the interplay between MPI and file system could be very easy, or in case of a parallel file system the interplay could be extensive. In general the user is interested to reduce runtime of his application and not in MPI internal processing. However, in some cases such internal information might be useful, on one hand for tuning an application, on the other hand for optimizing the MPI library itself. For instance, by knowing the algorithm used for a particular communication call the programmer might know in advance which processes need more time to execute the call. Then the user could realize static load balancing by assigning less work to these processes. While in most cases it is not favorable to tune the application to the MPI implementation, it might be necessary to get maximum performance for a well known parameter set. By understanding the interaction between MPI and MPI-I/O better, the provided I/O infrastructure could be optimized. In the following we present an attempt to visualize internal communication in MPI and interaction between MPI and parallel file system. This paper is structured as follows: In section 2 state-of-the-art in visualizing MPI internals and parallel file systems is provided. Then we briefly discuss our modifications which enable us to visualize MPI processing in section 3. In section 4 we evaluate our work by presenting results obtained for collective communication and for MPI-I/O interactions. 2 State-of-the-art and Related Work Optimization of collective operations in MPI is a hot research topic and is important for implementing HPC applications on various HPC systems [2, 3, 4]. Visualization of application behavior assists to optimize MPI programs. Two analysis approaches are used for this purpose: on-line analysis tools and tracing tools for postmortem analysis. On-line tools, e. g. Paradyn [5], use a monitoring system or instrumentation to gather data of the running application. This data is immediately available for several purposes, for example to display run-time performance, and it is typically not stored for futher analysis. Integration of monitoring tools is state-of-the-art 2
3 in parallel file systems. In GPFS a monitoring of client I/O performance is possible with a separate command line tool (mmpmon) [6]. Lustre offers monitoring via the /proc interface (LustreProc [7]). [8] presents performance data visualisation in Lustre based on debugging output and the /proc interface with the help of Ganglia. The parallel file system PVFS [9] embeds a performance monitor in the server process, which counts the number of metadata and I/O operations, and the amount of data accessed. It s data can be fetched with the command line tool pvfs2-perf-mon-example directly from the server process. In contrast off-line tools are typically applicable after program completion. In order to do so the monitoring system and instrumentation write event traces or profiles to files that are available afterwards. Representatives for this group are Intel Trace Analyzer & Collector [10], Paraver [11], Jumpshot/MPE [12] and Vampir/VampirTrace. Jumpshot/MPE works in cooperation with the MPICH2 library. It supports to analyze MPI functions for data communications and parallel I/Os by tracing every MPI call. Vampire visualizes MPI calls and records performance data according to the Open Trace Format [13]. Although Vampir- Trace supports tracing of internal operations such as remote memory access and I/O performance it supports them only on a client side [14]. Another framework SCALASCA [15] supports runtime summarization of measurements during execution and event trace collection for postmortem trace analysis. TAU [16] provides a robust, flexible, and portable approach for tracing and visualization of applications, CPU internal hardware counters. However, these tools do not support tracing activities of parallel file systems such as PVFS in conjunction with MPI-I/O calls. PIOviz [17, 18] is a trace-based environment which traces MPI calls and PVFS server internals such as network communication and I/O subsystem activity in conjunction with MPI-I/O calls. In addition, it also collects statistics of CPU usage and PVFS internal statistics [18]. As PIOviz uses MPICH s SLOG2 format a user can analyze trace information with Jumpshot. In comparison to tools described above, PIOvis combines an explicit tracing and visualization of the I/O system s behavior and a correlation of program events and induced system events. The latest PIOviz version described here enables tracing inside collective operations on server side as well as on the client side. To allow this ROMIO [19] and MPICH2 were extended as shown in the following section. 3 Tracing MPI Internals Modifications to enable tracing of MPI internals are split into two parts. First, instrumentation of MPICH2 is required to trace communication inside collective operations. Second, ROMIO is extended to allow tracing of PVFS calls and to show used MPI functions inside MPI-I/O. As a consequence, the linked MPI program will depend on the tracing library from MPE. Note that PIOviz uses version 1.0.5p4 of MPICH2 and version of PVFS. In MPICH2 each collective function calls only a set of internal functions to perform blocking or non-blocking point-to-point communication, that means 3
4 either (i)send, (i)receive or sendreceive operations. These functions are bundled in a single file and can be easily instrumented. Once modified, then processing inside the collective operation becomes visible. PIOviz uses a PMPI wrapper provided in MPE to trace the function calls. Once a user executes an MPI program, trace files are created in client and server side. With a normal building procedure ROMIO changes MPI calls to their PMPI pendants, therefore collective function calls inside ROMIO are hidden. However, the developers provide a precompiler macro to remove the redirection. Furthermore, we put MPE instrumentation around interesting parts in ROMIO, for instance around PVFS calls. While the latter piece of instrumentation is similar to the one shipped with ROMIO, the modified instrumentation in ROMIO works together with the existing PIOviz environment. Consequently the modifications allow us to visualize I/O calls, MPI calls made inside these I/O calls, internal communication in MPI calls, and corresponding operations in PVFS servers. 4 Evaluation Two PC cluster systems consisting of 9 nodes each were used to evaluate PIOviz. Both clusters use COTS components and Gigabit Ethernet for interconnection. The PVS cluster is an older 32 Bit Ubuntu 8.04 cluster, in contrast the Kindai cluster uses a 64 Bit CentOS 4.4. We decided to perform experiments with PI- Oviz on both clusters to neglect the influence of software versions and hardware. It turned out that in some cases observed behavior differs. For a qualitative evaluation precise hardware details are unnecessary and thus spared. In section 4.1 we analyze some collective operations performed on the PVS cluster. Provided results are comparable to the ones measured on the Kindai cluster, however the computational part during the collective calls can be observed better on the PVS cluster. The I/O intense HPIO benchmark [20] is run on the Kindai cluster in section 4.2 to assess a performance degradation we found on this cluster. This anomaly does not manifest on the PVS cluster. 4.1 Collective Communication In the following several collective MPI operations are assessed with PIOviz. A more detailed analysis of MPI Allreduce will show potential for improvement in MPICH2. Internal processing in MPI Scatter and MPI Gather is briefly discussed and might be considered from application programmers perspective. In the first experiment a single MPI allreduce call is performed to sum an array of 10 million double values (80 MByte of data respectively). Figure 1 shows the average, minimal and maximal time of the operation for 10 runs of the program. It is noticable that three processes need more time than four, similar behavior can be observed by comparing execution time for 5, 6, 7 and 8 processes. One might expect that the time needed to perform a collective allreduce with a lower number of processes takes less or equal the time than 4
5 with more processes. The reason is that MPICH2 uses a normal binary tree for process numbers equal to a power of 2 (see [2] for details). If the number of processes in the communicator is not a power of 2 then the algorithm exchanges data between processes to merge additional processes to apply the original binary tree algorithm. That matches the observations, two processes take about one second, four processes two seconds and 8 processes three seconds. Screenshots of the internal communication for three and four processes are given in figures 2 and 3. These screenshots show the internal activities of each process in a separate (time) line. However, not all optimizations mentioned in [2] are incorporated in MPICH2, thus the performance is suboptimal. Compare the observable behavior for 13 processes (figure 4) with the schema provided in [2]. Next we will briefly look at MPI Scatter and MPI Gather. For a configuration with 9 clients and 8 MByte of data, screenshots of the internal processing are provided in figure 5 and in figure 6. By looking at the internal processing one might ask why both algorithms work similar to a binary tree. It suggests that the root process sends data for multiple processes to another process. However, compared to broadcast each process gets individual data, therefore the intermediate node just forwards the data. Also the nodes forwarding data are blocked while sending and receiving the additional data. There are cases in which the forwarding might be useful, for instance for small messages where the setup takes longer than sending the message. In this case work could be shared by forwarding messages from the root node. However, in general the algorithm should be optimized for larger messages. From the user s perspective the gather algorithm provides potential for static load balancing (in case it is called frequently). Of course the parameters fed in gather must be known in advance, then the internal processing and dependencies are known. In the example in figure 6 the user could assign more work to odd processes. On the PVS cluster computation takes about 0.5 seconds. Maybe the application runs frequently with a specific parameter set and does not perform dynamic load-balancing. However, figuring out internal dependencies of the algorithm between the processes is hard without visualization. 4.2 Tracing of MPI-I/O To examine internal collective communications in collective MPI-I/O operations, the MPI-I/O HPIO benchmark was used. It supports contiguous and non-contiguous data access patterns in both collective and independent operations. For non-contiguous data accesses, derived data types are created with an ensemble of region size, region count, and region space, where region stands for a data area. Figure 7 illustrates an example of a data pattern for two processes. In this figure, we assume that data is stored in memory contiguously and non-contiguously in data file. Gaps between data regions are specified by region space of this benchmark. According to a file view created by a derived data type, each client process accesses the data file as shown in this figure. In the collective I/O with derived data types, two phase I/O is used in ROMIO [21] to make contiguous data region as much as possible. In this pa- 5
6 Figure 1: Time for allreduce with a variable number of processes Figure 2: Allreduce for three processes Figure 3: Allreduce for four processes Figure 4: Allreduce for 13 processes 6
7 Figure 5: Scatter for 9 Clients and 8 MByte of data Figure 6: Gather for 9 Clients and 8 MByte of data per, we ran collective write operations. An example of PIOviz screenshot for two-phase I/O with four client processes is shown in Figure 8 with text explanations. This figure shows internal MPI communications and PVFS I/O calls with the help of the modified PIOviz. Before starting I/O operations, client processes exchange information about offset and data length to calculate sizes of an associated memory and file domains. After this operation, client processes read data by using PVFS sys read from assigned file domains and copy them to their collective buffers. Later, data to be written is exchanged among them by using non-blocking MPI calls (MPI Isend and MPI Irecv) and overwritten on the buffers based on the file view described by a derived data type. Finally, the modified data in the buffers is written back by using PVFS sys write. If the collective buffer size is not sufficient to manage the whole assigned data, this sequence is repeated until whole data is manipulated. Figure 7: Example of a derived data type in the HPIO benchmark 7
8 Figure 8: Screenshot of two-phase I/O in MPI File write all In this evaluation, we could see effectiveness of client side tracing in conjunction with tracing PVFS server internals. Figures 9 (a) and (b) show screenshots of typical inefficient access patterns obtained by previous and current PIOviz with some text explanations, respectively. The upper four time lines stand for client processes from rank 0 to 3, and the lower five time lines consists of one meta-server and data servers for PVFS in downward order. Note that every PVFS server was waiting for requests from clients long time after the first pair of read and write operations for a PVFS file system. The previous PIOviz does not show any internal MPI communications and PVFS I/O calls, while the current PIOviz does. With the previous PIOviz, we can not determine the point which has delay. On the other hand, the current release can assist to check problematic points. In this case, we can see that one of the client processes takes long time in PVFS sys read. As there is no PVFS operations on PVFS servers while a client process of rank 1 is issuing PVFS sys read, there might be inefficient operations between MPI and PVFS layers on the client process. 5 Conclusions and Future Work The paper showed that insights in MPI and in attached file systems assist to identify inefficient processing. This knowledge could be used by developers to tune internal layers. Also, knowledge of internal processing and dependencies allows the application programmer to optimize it towards the MPI implementation. While in general application users are discouraged from latter optimization, for fixed parameter sets this will improve insight into static load balancing. In complicated MPI-I/O accesses, internal behavior on Jumpshot screenshots reveal bottlenecks in internal operations. Evaluated examples of several MPI calls indicate room for optimization of MPICH2 in cluster environments. However, the modifications made are experimental and not yet suitable for every application. In the future we will improve stability of the environment, and to update it to the current versions of PVFS and MPICH. 8
9 (a) obtained by previous PIOviz (b) obtained by current PIOviz References Figure 9: Screenshots of inefficient I/O patterns [1] Message Passing Interface Forum, MPI: A message-passing interface standard. Version 2.1, June [2] R. Thakur, R. Rabenseifner, and W. Gropp, Optimization of collective communication operations in MPICH, The International Journal of High Performance Computing Applications, vol. 19, pp , Spring [3] M. Kühnemann, T. Rauber, and G. Rünger, Optimizing MPI collective communication by orthogonal structures, Cluster Computing, vol. 9, no. 3, pp , [4] M. K. Velamati, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, R. Sharma, S. Kapoor, and A. Srinivasan, Optimization of collective communication in intra-cell MPI, in HiPC, pp , [5] Paradyn Parallel Performance Tools. [6] IBM, General Parallel File System - Advanced Administration Guide V
10 [7] Sun Microsystems Inc., Lustre 1.6 Manual. [8] Sun Microsystems Inc., Profiling tools for IO. index.php?title=profiling Tools for IO. [9] W. Ligon and R. Ross, PVFS: Parallel Virtual File System, in Beowulf Cluster Computing with Linux (T. Sterling, ed.), Scientific and Engineering Computation, ch. 17, pp , Cambridge, Massachusetts: The MIT Press, Nov [10] Intel Trace Analyzer & Collector. products/asmo-na/eng/cluster/tanalyzer/, [11] J. Labarta, J. Giménez, E. Martínez, P. González, H. Servat, G. Llort, and X. Aguilar, Scalability of visualization and tracing tools, in Proc. of the ParCo 2005, vol. 33 of NIC Series, pp , John von Neumann Institute for Computing, Jülich, [12] A. Chan, W. Gropp, and E. Lusk, An efficient format for nearly constanttime access to arbitrary time intervals in large trace files, Scientific Programming, vol. 16, no. 2-3, pp , [13] A. Knüpfer, R. Brendel, H. Brunst, H. Mix, and W. E. Nagel, Introducing the Open Trace Format (OTF), vol of LNCS, pp , Springer, [14] H. Mickler, A. Knüpfer, M. Kluge, M. S. Müller, and W. E. Nagel, Trace- Based Analysis and Optimization for the Semtex CFD Application Hidden Remote Memory Accesses and I/O Performance, pp , [15] Z. Szebenyi, B. J. N. Wylie, and F. Wolf, SCALASCA parallel performance analyses of SPEC MPI2007 applications, vol of LNCS, pp , Springer, [16] S. S. Shende and A. D. Malony, The TAU parallel performance system, The International Journal of High Performance Computing Applications, vol. 20, pp , Summer [17] T. Ludwig, S. Krempel, M. Kuhn, J. M. Kunkel, and C. Lohse, Analysis of the MPI-IO optimization levels with the PIOViz jumpshot enhancement, vol of LNCS, pp , Springer, [18] J. M. Kunkel and T. Ludwig, Bottleneck detection in parallel file systems with trace-based performance monitoring, vol of LNCS, pp , Springer, [19] R. Thakur, E. Lusk, and W. Gropp, Users Guide for ROMIO: A High- Performance, Portable MPI-IO Implementation, technical memorandum anl/mcs-tm-234, Mathematics and Computer Science Division, Argonne National Laboratory, USA, [20] A. Ching, A. Choudhary, W. keng Liao, L. Ward, and N. Pundit, Evaluating I/O characteristics and methods for storing structured scientific data, in 20th IEEE International Parallel and Distributed Processing Symposium, p. 49, IEEE Computer Society, April [21] R. Thakur, W. Gropp, and E. Lusk, Optimizing noncontiguous accesses in MPI-IO, Parallel Computing, vol. 28, no. 1, pp ,
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationHint Controlled Distribution with Parallel File Systems
Hint Controlled Distribution with Parallel File Systems Hipolito Vasquez Lucas and Thomas Ludwig Parallele und Verteilte Systeme, Institut für Informatik, Ruprecht-Karls-Universität Heidelberg, 6912 Heidelberg,
More informationFeasibility Study of Effective Remote I/O Using a Parallel NetCDF Interface in a Long-Latency Network
Feasibility Study of Effective Remote I/O Using a Parallel NetCDF Interface in a Long-Latency Network Yuichi Tsujita Abstract NetCDF provides portable and selfdescribing I/O data format for array-oriented
More informationScalable Performance Analysis of Parallel Systems: Concepts and Experiences
1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,
More informationMaking Resonance a Common Case: A High-Performance Implementation of Collective I/O on Parallel File Systems
Making Resonance a Common Case: A High-Performance Implementation of Collective on Parallel File Systems Xuechen Zhang 1, Song Jiang 1, and Kei Davis 2 1 ECE Department 2 Computer and Computational Sciences
More informationPattern-Aware File Reorganization in MPI-IO
Pattern-Aware File Reorganization in MPI-IO Jun He, Huaiming Song, Xian-He Sun, Yanlong Yin Computer Science Department Illinois Institute of Technology Chicago, Illinois 60616 {jhe24, huaiming.song, sun,
More informationWorkload Characterization using the TAU Performance System
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of
More informationA Compact Computing Environment For A Windows PC Cluster Towards Seamless Molecular Dynamics Simulations
A Compact Computing Environment For A Windows PC Cluster Towards Seamless Molecular Dynamics Simulations Yuichi Tsujita Abstract A Windows PC cluster is focused for its high availabilities and fruitful
More informationTracing the Cache Behavior of Data Structures in Fortran Applications
John von Neumann Institute for Computing Tracing the Cache Behavior of Data Structures in Fortran Applications L. Barabas, R. Müller-Pfefferkorn, W.E. Nagel, R. Neumann published in Parallel Computing:
More informationECE7995 (7) Parallel I/O
ECE7995 (7) Parallel I/O 1 Parallel I/O From user s perspective: Multiple processes or threads of a parallel program accessing data concurrently from a common file From system perspective: - Files striped
More informationTowards the Performance Visualization of Web-Service Based Applications
Towards the Performance Visualization of Web-Service Based Applications Marian Bubak 1,2, Wlodzimierz Funika 1,MarcinKoch 1, Dominik Dziok 1, Allen D. Malony 3,MarcinSmetek 1, and Roland Wismüller 4 1
More informationAutomatic Adaption of the Sampling Frequency for Detailed Performance Analysis
Automatic Adaption of the for Detailed Performance Analysis Michael Wagner and Andreas Knüpfer Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain Center for Information Services and High Performance
More informationThe SCALASCA performance toolset architecture
The SCALASCA performance toolset architecture Markus Geimer 1, Felix Wolf 1,2, Brian J.N. Wylie 1, Erika Ábrahám 1, Daniel Becker 1,2, Bernd Mohr 1 1 Forschungszentrum Jülich 2 RWTH Aachen University Jülich
More informationMPICH on Clusters: Future Directions
MPICH on Clusters: Future Directions Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory thakur@mcs.anl.gov http://www.mcs.anl.gov/~thakur Introduction Linux clusters are
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationTracing and Visualization of Energy Related Metrics
Tracing and Visualization of Energy Related Metrics 8th Workshop on High-Performance, Power-Aware Computing 2012, Shanghai Timo Minartz, Julian Kunkel, Thomas Ludwig timo.minartz@informatik.uni-hamburg.de
More informationBoosting Application-specific Parallel I/O Optimization using IOSIG
22 2th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Boosting Application-specific Parallel I/O Optimization using IOSIG Yanlong Yin yyin2@iit.edu Surendra Byna 2 sbyna@lbl.gov
More informationChallenges in HPC I/O
Challenges in HPC I/O Universität Basel Julian M. Kunkel German Climate Computing Center / Universität Hamburg 10. October 2014 Outline 1 High-Performance Computing 2 Parallel File Systems and Challenges
More informationVAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW
VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität
More informationA Trace-Scaling Agent for Parallel Application Tracing 1
A Trace-Scaling Agent for Parallel Application Tracing 1 Felix Freitag, Jordi Caubet, Jesus Labarta Computer Architecture Department (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat
More informationJULEA: A Flexible Storage Framework for HPC
JULEA: A Flexible Storage Framework for HPC Workshop on Performance and Scalability of Storage Systems Michael Kuhn Research Group Scientific Computing Department of Informatics Universität Hamburg 2017-06-22
More informationRuprecht-Karls-Universität Heidelberg Institut für Informatik Arbeitsgruppe Parallele und Verteilte Systeme Masterarbeit
Ruprecht-Karls-Universität Heidelberg Institut für Informatik Arbeitsgruppe Parallele und Verteilte Systeme Masterarbeit Towards Automatic Load Balancing of a Parallel File System with Subfile Based Migration
More informationInteractive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO,
Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO, 2013-04-03 Andreas Knüpfer, Thomas William TU Dresden, Germany Overview Introduction Vampir displays GPGPU
More informationPerformance Cockpit: An Extensible GUI Platform for Performance Tools
Performance Cockpit: An Extensible GUI Platform for Performance Tools Tianchao Li and Michael Gerndt Institut für Informatik, Technische Universität München, Boltzmannstr. 3, D-85748 Garching bei Mu nchen,
More informationHigh Performance Supercomputing using Infiniband based Clustered Servers
High Performance Supercomputing using Infiniband based Clustered Servers M.J. Johnson A.L.C. Barczak C.H. Messom Institute of Information and Mathematical Sciences Massey University Auckland, New Zealand.
More informationLeveraging Burst Buffer Coordination to Prevent I/O Interference
Leveraging Burst Buffer Coordination to Prevent I/O Interference Anthony Kougkas akougkas@hawk.iit.edu Matthieu Dorier, Rob Latham, Rob Ross, Xian-He Sun Wednesday, October 26th Baltimore, USA Outline
More informationA Scalable Parallel HITS Algorithm for Page Ranking
A Scalable Parallel HITS Algorithm for Page Ranking Matthew Bennett, Julie Stone, Chaoyang Zhang School of Computing. University of Southern Mississippi. Hattiesburg, MS 39406 matthew.bennett@usm.edu,
More informationPOCCS: A Parallel Out-of-Core Computing System for Linux Clusters
POCCS: A Parallel Out-of-Core System for Linux Clusters JIANQI TANG BINXING FANG MINGZENG HU HONGLI ZHANG Department of Computer Science and Engineering Harbin Institute of Technology No.92, West Dazhi
More informationA High Performance Implementation of MPI-IO for a Lustre File System Environment
A High Performance Implementation of MPI-IO for a Lustre File System Environment Phillip M. Dickens and Jeremy Logan Department of Computer Science, University of Maine Orono, Maine, USA dickens@umcs.maine.edu
More informationData Sieving and Collective I/O in ROMIO
Appeared in Proc. of the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999, pp. 182 189. c 1999 IEEE. Data Sieving and Collective I/O in ROMIO Rajeev Thakur William Gropp
More informationCharacterizing the I/O Behavior of Scientific Applications on the Cray XT
Characterizing the I/O Behavior of Scientific Applications on the Cray XT Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN 37831 rothpc@ornl.gov ABSTRACT
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH
ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH Heike Jagode, Shirley Moore, Dan Terpstra, Jack Dongarra The University of Tennessee, USA [jagode shirley terpstra
More informationGermán Llort
Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,
More informationAnalyzing Cache Bandwidth on the Intel Core 2 Architecture
John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms
More informationHow to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?
bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun
More informationDATA access is one of the critical performance bottlenecks
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 1.119/TC.216.2637353,
More informationA Test Suite for High-Performance Parallel Java
page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium
More informationPerformance Analysis of Large-scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG
Performance Analysis of Large-scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG Holger Brunst 1 and Bernd Mohr 2 1 Center for High Performance Computing Dresden University of Technology Dresden,
More informationImproving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method
Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method Wei-keng Liao 1, Avery Ching 1, Kenin Coloma 1, Alok Choudhary 1, and Mahmut Kandemir 2 1 Northwestern University
More informationOptimization of Collective Reduction Operations
Published in International Conference on Computational Science, June 7 9, Krakow, Poland, LNCS, Springer-Verlag, 004. c Springer-Verlag, http://www.springer.de/comp/lncs/index.html Optimization of Collective
More informationVampir and Lustre. Understanding Boundaries in I/O Intensive Applications
Center for Information Services and High Performance Computing (ZIH) Vampir and Lustre Understanding Boundaries in I/O Intensive Applications Zellescher Weg 14 Treffz-Bau (HRSK-Anbau) - HRSK 151 Tel. +49
More informationBuilding Library Components That Can Use Any MPI Implementation
Building Library Components That Can Use Any MPI Implementation William Gropp Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL gropp@mcs.anl.gov http://www.mcs.anl.gov/~gropp
More informationAn In-place Algorithm for Irregular All-to-All Communication with Limited Memory
An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de
More informationParallel Programming
Parallel Programming for Multicore and Cluster Systems von Thomas Rauber, Gudula Rünger 1. Auflage Parallel Programming Rauber / Rünger schnell und portofrei erhältlich bei beck-shop.de DIE FACHBUCHHANDLUNG
More informationGroup Management Schemes for Implementing MPI Collective Communication over IP Multicast
Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Xin Yuan Scott Daniels Ahmad Faraj Amit Karwande Department of Computer Science, Florida State University, Tallahassee,
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationIntroduction to Parallel Computing
Institute for Advanced Simulation Introduction to Parallel Computing Bernd Mohr published in Multiscale Simulation Methods in Molecular Sciences, J. Grotendorst, N. Attig, S. Blügel, D. Marx (Eds.), Institute
More informationMPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationHDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002
HDF5 I/O Performance HDF and HDF-EOS Workshop VI December 5, 2002 1 Goal of this talk Give an overview of the HDF5 Library tuning knobs for sequential and parallel performance 2 Challenging task HDF5 Library
More informationImplementing Byte-Range Locks Using MPI One-Sided Communication
Implementing Byte-Range Locks Using MPI One-Sided Communication Rajeev Thakur, Robert Ross, and Robert Latham Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA
More informationFrom Cluster Monitoring to Grid Monitoring Based on GRM *
From Cluster Monitoring to Grid Monitoring Based on GRM * Zoltán Balaton, Péter Kacsuk, Norbert Podhorszki and Ferenc Vajda MTA SZTAKI H-1518 Budapest, P.O.Box 63. Hungary {balaton, kacsuk, pnorbert, vajda}@sztaki.hu
More informationRevealing Applications Access Pattern in Collective I/O for Cache Management
Revealing Applications Access Pattern in for Yin Lu 1, Yong Chen 1, Rob Latham 2 and Yu Zhuang 1 Presented by Philip Roth 3 1 Department of Computer Science Texas Tech University 2 Mathematics and Computer
More informationInteractive Analysis of Large Distributed Systems with Scalable Topology-based Visualization
Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Lucas M. Schnorr, Arnaud Legrand, and Jean-Marc Vincent e-mail : Firstname.Lastname@imag.fr Laboratoire d Informatique
More informationI/O Analysis and Optimization for an AMR Cosmology Application
I/O Analysis and Optimization for an AMR Cosmology Application Jianwei Li Wei-keng Liao Alok Choudhary Valerie Taylor ECE Department, Northwestern University {jianwei, wkliao, choudhar, taylor}@ece.northwestern.edu
More informationTowards a Portable Cluster Computing Environment Supporting Single System Image
Towards a Portable Cluster Computing Environment Supporting Single System Image Tatsuya Asazu y Bernady O. Apduhan z Itsujiro Arita z Department of Artificial Intelligence Kyushu Institute of Technology
More informationIntegrating Parallel Application Development with Performance Analysis in Periscope
Technische Universität München Integrating Parallel Application Development with Performance Analysis in Periscope V. Petkov, M. Gerndt Technische Universität München 19 April 2010 Atlanta, GA, USA Motivation
More informationExploiting Shared Memory to Improve Parallel I/O Performance
Exploiting Shared Memory to Improve Parallel I/O Performance Andrew B. Hastings 1 and Alok Choudhary 2 1 Sun Microsystems, Inc. andrew.hastings@sun.com 2 Northwestern University choudhar@ece.northwestern.edu
More informationParallel Programming with MPI on Clusters
Parallel Programming with MPI on Clusters Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory (The rest of our group: Bill Gropp, Rob Ross, David Ashton, Brian Toonen, Anthony
More informationParallel I/O Libraries and Techniques
Parallel I/O Libraries and Techniques Mark Howison User Services & Support I/O for scientifc data I/O is commonly used by scientific applications to: Store numerical output from simulations Load initial
More informationOptimization of non-contiguous MPI-I/O operations
Optimization of non-contiguous MPI-I/O operations Enno Zickler Arbeitsbereich Wissenschaftliches Rechnen Fachbereich Informatik Fakultät für Mathematik, Informatik und Naturwissenschaften Universität Hamburg
More informationImproving the Scalability of Performance Evaluation Tools
Improving the Scalability of Performance Evaluation Tools Sameer Suresh Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory Department of Computer and Information Science University
More informationBenefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous IO
Benefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous IO Weikuan Yu Dhabaleswar K. Panda Network-Based Computing Lab Dept. of Computer Science & Engineering The Ohio State University {yuw,panda}@cse.ohio-state.edu
More informationOptimization of Collective Communication in Intra- Cell MPI
Optimization of Collective Communication in Intra- Cell MPI M. K. Velamati 1, A. Kumar 1, N. Jayam 1, G. Senthilkumar 1, P.K. Baruah 1, R. Sharma 1, S. Kapoor 2, and A. Srinivasan 3 1 Dept. of Mathematics
More informationEvaluating I/O Characteristics and Methods for Storing Structured Scientific Data
Evaluating I/O Characteristics and Methods for Storing Structured Scientific Data Avery Ching 1, Alok Choudhary 1, Wei-keng Liao 1,LeeWard, and Neil Pundit 1 Northwestern University Sandia National Laboratories
More informationEvent-based Measurement and Analysis of One-sided Communication
Event-based Measurement and Analysis of One-sided Communication Marc-André Hermanns 1, Bernd Mohr 1, and Felix Wolf 2 1 Forschungszentrum Jülich, Zentralinstitut für Angewandte Mathematik, 52425 Jülich,
More informationAutomated Tracing of I/O Stack
Automated Tracing of I/O Stack Seong Jo Kim 1, Yuanrui Zhang 1, Seung Woo Son 2, Ramya Prabhakar 1, Mahmut Kandemir 1, Christina Patrick 1, Wei-keng Liao 3, and Alok Choudhary 3 1 Department of Computer
More informationIME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application
More informationMeta-data Management System for High-Performance Large-Scale Scientific Data Access
Meta-data Management System for High-Performance Large-Scale Scientific Data Access Wei-keng Liao, Xaiohui Shen, and Alok Choudhary Department of Electrical and Computer Engineering Northwestern University
More information[Scalasca] Tool Integrations
Mitglied der Helmholtz-Gemeinschaft [Scalasca] Tool Integrations Aug 2011 Bernd Mohr CScADS Performance Tools Workshop Lake Tahoe Contents Current integration of various direct measurement tools Paraver
More informationHigh Performance MPI-2 One-Sided Communication over InfiniBand
High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University
More informationIteration Based Collective I/O Strategy for Parallel I/O Systems
Iteration Based Collective I/O Strategy for Parallel I/O Systems Zhixiang Wang, Xuanhua Shi, Hai Jin, Song Wu Services Computing Technology and System Lab Cluster and Grid Computing Lab Huazhong University
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationMPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters
MPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters Alexey Lastovetsky Vladimir Rychkov Maureen O Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen.OFlynn}@ucd.ie
More informationDesign and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications
Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department
More informationEarly Experiments with the OpenMP/MPI Hybrid Programming Model
Early Experiments with the OpenMP/MPI Hybrid Programming Model Ewing Lusk 1 and Anthony Chan 2 1 Mathematics and Computer Science Division Argonne National Laboratory 2 ASCI FLASH Center University of
More informationOptimizing Assignment of Threads to SPEs on the Cell BE Processor
Optimizing Assignment of Threads to SPEs on the Cell BE Processor T. Nagaraju P.K. Baruah Ashok Srinivasan Abstract The Cell is a heterogeneous multicore processor that has attracted much attention in
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering
More informationEnabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division
Enabling Active Storage on Parallel I/O Software Stacks Seung Woo Son sson@mcs.anl.gov Mathematics and Computer Science Division MSST 2010, Incline Village, NV May 7, 2010 Performing analysis on large
More informationAuto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems
Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems Minesh I. Patel Ph.D. 1, Karl Jordan 1, Mattew Clark Ph.D. 1, and Devesh Bhatt
More informationDistribution of Periscope Analysis Agents on ALTIX 4700
John von Neumann Institute for Computing Distribution of Periscope Analysis Agents on ALTIX 4700 Michael Gerndt, Sebastian Strohhäcker published in Parallel Computing: Architectures, Algorithms and Applications,
More informationMulticast can be implemented here
MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu
More informationA First Implementation of Parallel IO in Chapel for Block Data Distribution 1
A First Implementation of Parallel IO in Chapel for Block Data Distribution 1 Rafael LARROSA a, Rafael ASENJO a Angeles NAVARRO a and Bradford L. CHAMBERLAIN b a Dept. of Compt. Architect. Univ. of Malaga,
More informationImplementing MPI-IO Shared File Pointers without File System Support
Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,
More informationRAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System
RAIDIX Data Storage Solution Clustered Data Storage Based on the RAIDIX Software and GPFS File System 2017 Contents Synopsis... 2 Introduction... 3 Challenges and the Solution... 4 Solution Architecture...
More informationHPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:
HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com
More informationOrthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters
Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters Xuechen Zhang 1 Jianqiang Ou 2 Kei Davis 3 Song Jiang 2 1 Georgia Institute of Technology, 2 Wayne State University,
More informationParallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein
Parallel & Cluster Computing cs 6260 professor: elise de doncker by: lina hussein 1 Topics Covered : Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationAnalyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016
Analyzing the High Performance Parallel I/O on LRZ HPC systems Sandra Méndez. HPC Group, LRZ. June 23, 2016 Outline SuperMUC supercomputer User Projects Monitoring Tool I/O Software Stack I/O Analysis
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationEarly Experiences with KTAU on the IBM BG/L
Early Experiences with KTAU on the IBM BG/L Aroon Nataraj, Allen D. Malony, Alan Morris, and Sameer Shende Performance Research Laboratory, Department of Computer and Information Science University of
More informationA Visual Network Analysis Method for Large Scale Parallel I/O Systems
A Visual Network Analysis Method for Large Scale Parallel I/O Systems Carmen Sigovan, Chris Muelder, Kwan-Liu Ma University of California Davis {cmsigovan, cwmuelder, klma}@ucdavis.edu Jason Cope, Kamil
More informationA Buffered-Mode MPI Implementation for the Cell BE Processor
A Buffered-Mode MPI Implementation for the Cell BE Processor Arun Kumar 1, Ganapathy Senthilkumar 1, Murali Krishna 1, Naresh Jayam 1, Pallav K Baruah 1, Raghunath Sharma 1, Ashok Srinivasan 2, Shakti
More informationAdvanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)
Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS) Understanding I/O Performance Behavior (UIOP) 2017 Sebastian Oeste, Mehmet Soysal, Marc-André Vef, Michael Kluge, Wolfgang E.
More informationFakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview
Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur BenchIT Project Overview Nöthnitzer Straße 46 Raum INF 1041 Tel. +49 351-463 - 38458 (stefan.pflueger@tu-dresden.de)
More informationEfficiency Evaluation of the Input/Output System on Computer Clusters
Efficiency Evaluation of the Input/Output System on Computer Clusters Sandra Méndez, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de
More informationOnline Remote Trace Analysis of Parallel Applications on High-Performance Clusters
Online Remote Trace Analysis of Parallel Applications on High-Performance Clusters Holger Brunst, Allen D. Malony, Sameer S. Shende, and Robert Bell Department for Computer and Information Science University
More informationHigh Performance MPI-2 One-Sided Communication over InfiniBand
High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University
More information