Characterizing the I/O Behavior of Scientific Applications on the Cray XT
|
|
- Mervyn Russell
- 6 years ago
- Views:
Transcription
1 Characterizing the I/O Behavior of Scientific Applications on the Cray XT Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN ABSTRACT Scientific applications use input/output (I/O) for obtaining initial conditions and execution parameters, as a persistent way of saving program output, and for safeguarding against system unreliability. Although system sizes are expected to continue increasing, I/O performance is not expected to keep pace with system computation and communication performance. Understanding application I/O demands and system I/O capabilities is the first step toward bridging this gap between them. In this paper, we present our approach for characterizing the I/O demands of applications on the Cray XT. We also present preliminary case studies showing the use of our I/O characterization infrastructure with climate studies and combustion simulation programs. Categories and Subject Descriptors C.4 [Performance of Systems]: performance attributes, measurement techniques. General Terms Performance, Measurement. Keywords Performance data collection, instrumentation, Cray XT. 1. INTRODUCTION Scientific applications from areas like climate studies, fusion, and molecular dynamics use input/output (I/O) for several purposes, such as to obtain initial conditions and execution parameters, as a persistent way of saving program output, and to safeguard against system unreliability. This last purpose is becoming increasingly important: the desire to reach everincreasing computational targets with high-performance computing (HPC) systems has produced a trend toward systems with an increasing number of components, and system reliability This research is sponsored by the Office of Advanced Scientific Computing Research; U.S. Department of Energy. The work was performed at the Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. DE-AC05-00OR Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. Supercomputing'07, Nov , 2007, Reno, NV. Copyright 2007 ACM ISBN /07/11...$5.00 is expected to decrease as the number of components increases. Current HPC systems have several tens to hundreds of thousands of processor cores, but researchers are already bracing themselves for systems with millions of cores. Even with expected technological advances, this trend is expected to continue many years into the future. Because I/O in the form of checkpointing is the technique most often used to guard against system failures, and because I/O technology is not expected to keep pace with processor technology, I/O is an area of increasing concern for both producers and consumers of HPC systems. Understanding application I/O demands and system I/O capabilities is the first step toward bridging the gap between the two. In response to the need for tools that provide insight into this gap, performance analysis tools like Paradyn [10] and the Tuning and Analysis Utilities (TAU) [15] support measurement and problem diagnosis of I/O performance. To support our investigation into the I/O behavior of scientific applications on leadership class systems, we have designed an I/O event tracing system and produced a prototype implementation for applications running on the Cray XT, the primary platform of the U.S. Department of Energy (DOE) Office of Science s Leadership Computing Facility at Oak Ridge National Laboratory (ORNL). There are two primary contributions of this paper. First, we present our preliminary I/O characterization approach and its prototype implementation. Second, we present preliminary case studies showing the use of our I/O characterization approach with the Parallel Ocean Program (POP) [7] and the S3D combustion simulation program [4]. 2. THE CRAY XT The Cray XT is a parallel computing platform that features massive parallelism and high performance [1]. An XT system consists of processing elements (PEs) connected in a threedimensional mesh or torus topology. Each PE contains an AMD Opteron processor, memory, and a Cray proprietary router Application-Specific Integrated Circuit (ASIC) called SeaStar (see Figure 1). Single- and multi-core processors are supported. The initial Cray XT systems (the XT3 and XT4) use only Opteron-based PEs, but the Cray XT5 also supports heterogeneous systems containing vector processors and Field Programmable Gate Arrays [5].
2 Cray XT PEs are partitioned into compute PEs and service PEs. Compute PEs run application processes, and use either a lightweight operating system kernel called Catamount [8] or Cray s Compute Node Linux (CNL). Service PEs provide login and I/O services with a traditional Linux installation. For this work, we used the Cray XT system from the DOE Leadership Computing Facility at ORNL. During the time of our experimentation, this system contained a combination of XT3 and XT4 cabinets. Also, the system has been converted from using the Catamount kernel on its compute nodes to CNL. At the time of our experimentation, this system used the Catamount kernel on its compute PEs and Lustre as its parallel file system. 3. THE IOT EVENT TRACING INFRASTRUCTURE Event tracing is a well-established technique for performance data collection, and tools like TAU [15], Paraver [13], SCALASCA [2], and svpablo [6] have long supported collection and analysis of program event data, often including I/O events. For our application I/O characterization activity, we adopt a traditional event-based performance data collection approach that produces event trace files for ease of repeated post-mortem analysis and sharing with other researchers. We developed a prototype implementation of our event tracing infrastructure for MPI applications on the Cray XT; we call this prototype IOT. We intend IOT to be the first component of a more comprehensive performance data collection and analysis infrastructure for programs running on DOE leadership-class computing platforms. Our data collection approach uses two components. The first component is a collection of functions that replace I/O and other interesting function calls (e.g., open() and write()) with an instrumented wrapper function. Each wrapper generates an event trace record for function entry, calls the real function that implements the desired functionality, generates an event trace record that captures the relevant details of the I/O operation, and then generates an event trace record for the function exit. At a Figure 1: Cray XT4 Processing Elements (Image courtesy Cray, Inc.) minimum, each event trace record includes a timestamp of when the operation occurred and the type of operation. Figure 2 shows where instrumented functions are interposed between an application process and the default runtime software stack. The second component is an event tracing support library that implements the needed functionality to produce event trace files in the Open Trace Format [12], a file format for expressing event traces that is supported by performance tools such as TAU, SCALASCA [2], and Vampir [11]. On many traditional UNIX and UNIX-like systems (e.g., Linux), we could use shared libraries to interpose our instrumented file I/O wrapper functions into the control flow between an application function that calls a file I/O function and the system s implementation of that I/O function. In fact, Cray s CNL supports shared libraries in order to ease the use of scripting languages such as Python in scientific applications. However, the Cray XT running Catamount does not support shared libraries and the default linking mode for XT systems running CNL is to produce statically linked executable files. Thus, we chose to use link-time function wrapping, facilitated by the GNU linker s strong support for function wrapping. Using the --wrap command-line switch, this linker causes application calls to a function like read() to be calls to a function named wrap_read() instead and exposes the original function with the name real_read() instead. Our instrumented version of wrap_read() uses the symbol real_read() to access the system s implementation of the read() function. In addition to collecting event data for system file I/O functions, our event tracing software can collect event data for MPI [9] functions, including MPI-IO functions. Because a compliant MPI implementation includes support for the PMPI profiling interface, we use the PMPI interface for interposing instrumented MPI functions rather than the linker s function wrapping facility to interpose our instrumented functions. For interposing our instrumented MPI wrapper functions, we use an automated wrapper generator script based on the generator used by the mpip [16] lightweight MPI profiling tool.
3 Figure 2: IOT Interposition of instrumented functions between an application process and the runtime software stack To simplify the use of our event trace capture software, we use custom versions of the Cray Fortran, C, and C++ compiler scripts that automatically include the correct linker switches and libraries to interpose our infrastructure libraries. For basic event tracing scenarios, the user need not modify their application source code to use our infrastructure; instead, the user modifies his makefiles to use the command iot_ftn instead of ftn to link their Fortran program. Performance data collection using event tracing has the potential to generate massive volumes of performance data, e.g., if the events being traced occur frequently, are traced in a large number of processes, or generate a large amount of performance data each time they occur. Several techniques exist to manage the performance data volume produced by detailed event tracing. Dynamic control of the event tracing infrastructure can be used to enable and disable event tracing while a program runs. Such control may be explicitly specified using API functions provided by the tracing infrastructure, or implicitly enabled by the tracing infrastructure in response to excessive performance data volume. Sophisticated performance data collection tools like Paraver use pattern recognition to identify repetitive behavior and only keep event data for a limited, representative sequence of program events. Our IOT event tracing infrastructure currently uses simple event tracing control using explicit API functions coupled with OTF s compressed short format to manage performance data volume. 4. CASE STUDIES As preliminary test cases for the prototype implementation of our I/O characterization approach, we have studied the I/O behavior of two scientific applications on the Cray XT at ORNL. At the time of our experimentation the ORNL Cray XT used the Catamount lightweight kernel on its compute nodes. 4.1 STATISTICS: THE PARALLEL OCEAN PROGRAM The Parallel Ocean Program [7] (POP) is an ocean simulation program produced by researchers at Los Alamos National Laboratory. It serves as the ocean model in the Community Climate System Model [3] (CCSM). It is implemented in Fortran 90 and uses MPI message passing for communicating data between parallel processes. It can use either netcdf or Fortran I/O functions for its output. The program performs I/O for reasons common to many scientific applications: To obtain simulation control parameters and initial conditions, such as topography grid data and forcing data used when POP is run in standalone mode (i.e., outside of the full CCSM); To save time-varying results such as movie frames and calculation history; and To save periodic checkpoint files. In our experimentation, we used POP version and the X1 benchmark problem with a grid spacing of one degree. To limit the time required for our program runs, we limited the number of simulation timesteps to forty timesteps; production runs involve many more timesteps. We also configured the program to output checkpoints every 10 timesteps, movie files every five timesteps, and no calculation history files. POP implements its own collective I/O instead of using existing parallel I/O software like MPI-IO or parallel netcdf. Because performing I/O from too many processes can overwhelm the I/O capabilities of many systems, POP can be configured to limit the number of writer processes. For this study, we configured POP to use four output tasks. After we completed our study, climate community experts notified us that the parallel I/O feature of the POP version we used was suspected to be defective and that only one output task was usually used for production runs. For our experiments, POP s I/O data volume was modest. The input activity consisted of reading approximately 7MB of horizontal and vertical grid data and approximately 490KB of topography data during program initialization. Output activity involved writing checkpoint files consisting of a 10KB text metadata file and a 346MB binary data file, and writing 3.9MB movie files using netcdf.
4 Table 1: POP OTF trace file characteristics (all values in bytes) POP output activity varied between the MPI rank 0 process and other writer processes. The rank 0 process alone writes the checkpoint metadata file; our IOT traces showed this process used seven write function calls to write this 10KB file. The rank 0 process also writes the entire netcdf movie file itself. The IOT event traces showed two write function calls each time a movie file was saved. All writer processes regardless of rank made write function calls to contribute to the checkpoint file. Each writer performed eighty write operations, each of approximately 980KB. Our analysis of the IOT event trace files suggests several potential optimizations to improve POP I/O performance. First, the program could be modified to use a parallel I/O library for writing movie files to avoid serializing this activity. Second, the program s checkpoint output activity could be adapted to use a parallel I/O library instead of its own collective communication and Fortran I/O operations. We stress, however, that any such changes must take into account any differences between the layout of data in memory versus the desired layout on disk, intended to support post-mortem analysis of the program s results. Table 1 describes the OTF event trace files produced when collecting event trace files describing the I/O activity of our four POP writer processes. The table shows the total performance data volume and the per-timestep data volumes for both MPI rank 0 and non-rank-0 processes. These data volumes reflect the data volume of all OTF local (per-process) metadata files but not the global metadata file. The event trace files include events for Fortran I/O operations but not the MPI communication performed by POP to gather program data to the writer processes. I/O operation event data in the generated trace file includes the operation timestamp, duration, and number of bytes read or written but not the number of bytes requested to be read or written. As expected, the OTF compressed short format is the most desirable output format. That this output format produced only 67 bytes per timestep is encouraging. 4.2 VISUALIZATION: S3D For another preliminary I/O characterization case study we used the combustion simulation program S3D [4]. Because our initial goal was to test the functionality of our approach, we applied our software to a small test case with eight application processes running for fifty simulation timesteps. Production S3D runs use thousands of processes and run for hundreds or thousands of timesteps. In contrast to the POP case study where we analyzed IOT event trace files to obtain I/O operation statistics, for our S3D study we focused on event data visualization. A portion of the event trace corresponding to the writing of one checkpoint is shown in the Vampir timeline display shown in Figure 3. In the figure, MPI Although MPI events are shown in the timeline display, the lines indication communication between processes are not shown for clarity. Although our analysis of S3D s I/O behavior is in its early stages, the Vampir timeline display reveals the general checkpoint I/O strategy used by the version of S3D we obtained. The MPI rank 0 process opens and reads data from a control file, then broadcasts a message describing the parameters to use for the checkpoint operation. The other MPI tasks, waiting for the broadcast, open a file for their checkpoint data (an event not clearly visible in the timeline visualization due to the display s zoom factor). Once each process opens its checkpoint file, it writes its checkpoint data in several small write operations, at least as far as the system is concerned. The checkpoint finishes with a barrier operation, but a slow writer process causes most processes a delay before proceeding with the computation. For our S3D test problem, these checkpoint files are each relatively small: only 16MB. However, because each process writes its own checkpoint file, the timeline visualization hints that runs of the version of S3D we used would present the metadata server with many nearly-simultaneous file create operations. This activity could be overwhelming for production runs with tens of thousands of processes. Spreading these file open operations across a longer time interval, and performing fewer large writes rather than several small writes are two potential strategies for improving the I/O performance of the S3D version we used, based on this preliminary event trace analysis. Although our analysis of S3D s I/O behavior has just begun, an early visualization of detailed event trace data has already enhanced our understanding of the S3D I/O strategy and suggested potential approaches for improving the I/O performance of this application. 5. SUMMARY Understanding application I/O behavior is critical to overcoming gaps between the I/O demands of an application and the I/O capabilities of a system. We are developing an event tracing infrastructure for characterizing the I/O behavior of applications running on the Cray XT, a primary computing platform in the DOE Office of Science leadership computing efforts. We have begun to apply our prototype implementation to characterize the I/O behavior of two scientific applications of interest to the Office of Science, obtaining insight into possible optimizations for improving their I/O performance.
5 Figure 3: Vampir event trace timeline visualization showing one S3D checkpoint operation In the future, we plan to use our I/O characterization software as the foundation for a suite of simple tools for I/O performance analysis, automated performance problem diagnosis, and automated performance tuning of application I/O behavior. We also plan to improve the scalability of our I/O performance data collection and analysis functionality using our MRNet [14] scalable tool infrastructure. Finally, we plan to continue our characterization work with applications beyond POP and S3D. 6. ACKNOWLEDGMENTS Our thanks to Jeffrey S. Vetter, Weikuan Yu, and the other members of the ORNL Future Technologies Group for their constructive criticism of our work. Thanks also to Pat Worley for providing a mechanism for access to ORNL Leadership Computing Facility systems via the Performance Evaluation and Analysis Consortium End Station. This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract number DE-AC05-00OR REFERENCES [1] S.R. Alam, R.F. Barrett et al., An Evaluation of the ORNL Cray XT3, International Journal of High Performance Computing Applications, 21(4), 2007 (to appear). [2] D. Becker, F. Wolf et al., Automated Trace-Based Performance Analysis of Metacomputing Applications, Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), [3] M.B. Blackmon, B. Boville et al., The Community Climate System Model, BAMS, 82(11): , [4] J.H. Chen and H.G. Im, Stretch effects on the Burning Velocity of turbulent premixed ydrogen-air Flames, Proc. Comb. Inst, 2000, pp [5] Cray Inc., Cray XT5 Family of Supercomputers, [6] L. DeRose and D.A. Reed, SvPablo: A Multi-Language Architecture-Independent Performance Analysis System, Proc. International Conference on Parallel Processing (ICPP'99), 1999, pp [7] P.W. Jones, P.H. Worley et al., Practical performance portability in the Parallel Ocean Program (POP), Concurrency and Computation: Experience and Practice, 17(10): , [8] S.M. Kelly and R. Brightwell, Software Architecture of the Light Weight Kernel, Catamount, in Cray User Group Technical Conference. Albuquerque, NM, 2005 [9] Message Passing Interface Forum, MPI-2: A Message Passing Interface Standard, International Journal of Supercomputer Applications and High Performance Computing, 12(1 2):1 299, 1998.
6 [10] B.P. Miller, M.D. Callaghan et al., The Paradyn Parallel Performance Measurement Tools, IEEE Computer, 28(11):37 46, [11] W.E. Nagel, A. Arnold et al., VAMPIR: Visualization and Analysis of MPI Resources, Supercomputer 63, 12(1):69-80, [12] Paratools Inc., Open Trace Format, [13] V. Pillet, J. Labarta et al., PARAVER: A Tool to Visualize and Analyze Parallel Code, Proc. WoTUG-18: Transputer and Occam Developments, 1995, pp [14] P.C. Roth, D.C. Arnold, and B.P. Miller, MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools, Proc. SC2003, [15] S. Shende and A.D. Malony, The TAU Parallel Performance System, International Journal of High Performance Computing Applications, 20(2): , [16] J.S. Vetter and M.O. McCracken, Statistical Scalability Analysis of Communication Operations in Distributed Applications, Principles and Practice of Parallel Programming, 36(7):123 32, 2001.
ScalaIOTrace: Scalable I/O Tracing and Analysis
ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,
More informationScalable Tool Infrastructure for the Cray XT Using Tree-Based Overlay Networks
Scalable Tool Infrastructure for the Cray XT Using Tree-Based Overlay Networks Philip C. Roth, Oak Ridge National Laboratory and Jeffrey S. Vetter, Oak Ridge National Laboratory and Georgia Institute of
More informationPhilip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory
Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory A Tree-Based Overlay Network (TBON) like MRNet provides scalable infrastructure for tools and applications MRNet's
More informationGuidelines for Efficient Parallel I/O on the Cray XT3/XT4
Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 Jeff Larkin, Cray Inc. and Mark Fahey, Oak Ridge National Laboratory ABSTRACT: This paper will present an overview of I/O methods on Cray XT3/XT4
More informationScalable I/O Tracing and Analysis
Scalable I/O Tracing and Analysis Karthik Vijayakumar 1 Frank Mueller 1 Xiaosong Ma 1,2 Philip C. Roth 2 1 Department of Computer Science, North Carolina State University, Raleigh, NC 27695-7534 2 Computer
More informationCompute Node Linux: Overview, Progress to Date & Roadmap
Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute
More informationImproving I/O Performance in POP (Parallel Ocean Program)
Improving I/O Performance in POP (Parallel Ocean Program) Wang Di 2 Galen M. Shipman 1 Sarp Oral 1 Shane Canon 1 1 National Center for Computational Sciences, Oak Ridge National Laboratory Oak Ridge, TN
More informationScalable, Automated Characterization of Parallel Application Communication Behavior
Scalable, Automated Characterization of Parallel Application Communication Behavior Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory 12 th Scalable Tools Workshop
More informationAutomated Characterization of Parallel Application Communication Patterns
Automated Characterization of Parallel Application Communication Patterns Philip C. Roth Jeremy S. Meredith Jeffrey S. Vetter Oak Ridge National Laboratory 17 June 2015 ORNL is managed by UT-Battelle for
More informationEarly Evaluation of the Cray X1 at Oak Ridge National Laboratory
Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Patrick H. Worley Thomas H. Dunigan, Jr. Oak Ridge National Laboratory 45th Cray User Group Conference May 13, 2003 Hyatt on Capital Square
More informationEarly Evaluation of the Cray XD1
Early Evaluation of the Cray XD1 (FPGAs not covered here) Mark R. Fahey Sadaf Alam, Thomas Dunigan, Jeffrey Vetter, Patrick Worley Oak Ridge National Laboratory Cray User Group May 16-19, 2005 Albuquerque,
More informationISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH
ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH Heike Jagode, Shirley Moore, Dan Terpstra, Jack Dongarra The University of Tennessee, USA [jagode shirley terpstra
More informationPerformance database technology for SciDAC applications
Performance database technology for SciDAC applications D Gunter 1, K Huck 2, K Karavanic 3, J May 4, A Malony 2, K Mohror 3, S Moore 5, A Morris 2, S Shende 2, V Taylor 6, X Wu 6, and Y Zhang 7 1 Lawrence
More informationImproving the Scalability of Performance Evaluation Tools
Improving the Scalability of Performance Evaluation Tools Sameer Suresh Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory Department of Computer and Information Science University
More informationThe Effect of Emerging Architectures on Data Science (and other thoughts)
The Effect of Emerging Architectures on Data Science (and other thoughts) Philip C. Roth With contributions from Jeffrey S. Vetter and Jeremy S. Meredith (ORNL) and Allen Malony (U. Oregon) Future Technologies
More informationA Holistic Approach for Performance Measurement and Analysis for Petascale Applications
A Holistic Approach for Performance Measurement and Analysis for Petascale Applications Heike Jagode 1,2, Jack Dongarra 1,2 Sadaf Alam 2, Jeffrey Vetter 2 Wyatt Spear 3, Allen D. Malony 3 1 The University
More informationWorkload Characterization using the TAU Performance System
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of
More informationA More Realistic Way of Stressing the End-to-end I/O System
A More Realistic Way of Stressing the End-to-end I/O System Verónica G. Vergara Larrea Sarp Oral Dustin Leverman Hai Ah Nam Feiyi Wang James Simmons CUG 2015 April 29, 2015 Chicago, IL ORNL is managed
More informationScalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer
Scalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer Kevin A. Huck, Allen D. Malony, Sameer Shende, Alan Morris khuck, malony, sameer, amorris@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau
More informationTitan - Early Experience with the Titan System at Oak Ridge National Laboratory
Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationPreparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationComparison of XT3 and XT4 Scalability
Comparison of XT3 and XT4 Scalability Patrick H. Worley Oak Ridge National Laboratory CUG 2007 May 7-10, 2007 Red Lion Hotel Seattle, WA Acknowledgements Research sponsored by the Climate Change Research
More informationInitial Performance Evaluation of the Cray SeaStar Interconnect
Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on
More informationIntroduction to HPC Parallel I/O
Introduction to HPC Parallel I/O Feiyi Wang (Ph.D.) and Sarp Oral (Ph.D.) Technology Integration Group Oak Ridge Leadership Computing ORNL is managed by UT-Battelle for the US Department of Energy Outline
More informationExtending scalability of the community atmosphere model
Journal of Physics: Conference Series Extending scalability of the community atmosphere model To cite this article: A Mirin and P Worley 2007 J. Phys.: Conf. Ser. 78 012082 Recent citations - Evaluation
More informationThe Titan Tools Experience
The Titan Tools Experience Michael J. Brim, Ph.D. Computer Science Research, CSMD/NCCS Petascale Tools Workshop 213 Madison, WI July 15, 213 Overview of Titan Cray XK7 18,688+ compute nodes 16-core AMD
More informationCompute Node Linux (CNL) The Evolution of a Compute OS
Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide
More informationEmpirical Analysis of a Large-Scale Hierarchical Storage System
Empirical Analysis of a Large-Scale Hierarchical Storage System Weikuan Yu, H. Sarp Oral, R. Shane Canon, Jeffrey S. Vetter, and Ramanan Sankaran Oak Ridge National Laboratory Oak Ridge, TN 37831 {wyu,oralhs,canonrs,vetter,sankaranr}@ornl.gov
More informationEvaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More informationScalable, Automated Performance Analysis with TAU and PerfExplorer
Scalable, Automated Performance Analysis with TAU and PerfExplorer Kevin A. Huck, Allen D. Malony, Sameer Shende and Alan Morris Performance Research Laboratory Computer and Information Science Department
More informationScalable Performance Analysis of Parallel Systems: Concepts and Experiences
1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,
More informationImproving Applica/on Performance Using the TAU Performance System
Improving Applica/on Performance Using the TAU Performance System Sameer Shende, John C. Linford {sameer, jlinford}@paratools.com ParaTools, Inc and University of Oregon. April 4-5, 2013, CG1, NCAR, UCAR
More informationAggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer
More informationThe Red Storm System: Architecture, System Update and Performance Analysis
The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI
More informationThe PAPI Cross-Platform Interface to Hardware Performance Counters
The PAPI Cross-Platform Interface to Hardware Performance Counters Kevin London, Shirley Moore, Philip Mucci, and Keith Seymour University of Tennessee-Knoxville {london, shirley, mucci, seymour}@cs.utk.edu
More informationPerformance Measurement and Evaluation Tool for Large-scale Systems
Performance Measurement and Evaluation Tool for Large-scale Systems Hong Ong ORNL hongong@ornl.gov December 7 th, 2005 Acknowledgements This work is sponsored in parts by: The High performance Computing
More informationAllowing Users to Run Services at the OLCF with Kubernetes
Allowing Users to Run Services at the OLCF with Kubernetes Jason Kincl Senior HPC Systems Engineer Ryan Adamson Senior HPC Security Engineer This work was supported by the Oak Ridge Leadership Computing
More informationThe Cray Programming Environment. An Introduction
The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent
More informationCharacterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013
Characterizing Imbalance in Large-Scale Parallel Programs David o hme September 26, 2013 Need for Performance nalysis Tools mount of parallelism in Supercomputers keeps growing Efficient resource usage
More informationRevealing Applications Access Pattern in Collective I/O for Cache Management
Revealing Applications Access Pattern in for Yin Lu 1, Yong Chen 1, Rob Latham 2 and Yu Zhuang 1 Presented by Philip Roth 3 1 Department of Computer Science Texas Tech University 2 Mathematics and Computer
More informationA Case for Standard Non-Blocking Collective Operations
A Case for Standard Non-Blocking Collective Operations T. Hoefler,2, P. Kambadur, R. L. Graham 3, G. Shipman 4 and A. Lumsdaine Open Systems Lab 2 Computer Architecture Group Indiana University Technical
More informationPerformance of a Direct Numerical Simulation Solver forf Combustion on the Cray XT3/4
Performance of a Direct Numerical Simulation Solver forf Combustion on the Cray XT3/4 Ramanan Sankaran and Mark R. Fahey National Center for Computational Sciences Oak Ridge National Laboratory Jacqueline
More informationThe Automatic Library Tracking Database
The Automatic Library Tracking Database Mark Fahey, Nick Jones, and Bilel Hadri National Institute for Computational Sciences ABSTRACT: A library tracking database has been developed and put into production
More informationSteven Carter. Network Lead, NCCS Oak Ridge National Laboratory OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 1
Networking the National Leadership Computing Facility Steven Carter Network Lead, NCCS Oak Ridge National Laboratory scarter@ornl.gov 1 Outline Introduction NCCS Network Infrastructure Cray Architecture
More informationTAUg: Runtime Global Performance Data Access Using MPI
TAUg: Runtime Global Performance Data Access Using MPI Kevin A. Huck, Allen D. Malony, Sameer Shende, and Alan Morris Performance Research Laboratory Department of Computer and Information Science University
More informationScalable Critical Path Analysis for Hybrid MPI-CUDA Applications
Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale
More informationThe Cray Rainier System: Integrated Scalar/Vector Computing
THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier
More informationCHARACTERIZING HPC I/O: FROM APPLICATIONS TO SYSTEMS
erhtjhtyhy CHARACTERIZING HPC I/O: FROM APPLICATIONS TO SYSTEMS PHIL CARNS carns@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory April 20, 2017 TU Dresden MOTIVATION FOR
More informationParallel I/O Libraries and Techniques
Parallel I/O Libraries and Techniques Mark Howison User Services & Support I/O for scientifc data I/O is commonly used by scientific applications to: Store numerical output from simulations Load initial
More informationScalability Improvements in the TAU Performance System for Extreme Scale
Scalability Improvements in the TAU Performance System for Extreme Scale Sameer Shende Director, Performance Research Laboratory, University of Oregon TGCC, CEA / DAM Île de France Bruyères- le- Châtel,
More informationIsolating Runtime Faults with Callstack Debugging using TAU
Isolating Runtime Faults with Callstack Debugging using TAU Sameer Shende, Allen D. Malony, John C. Linford ParaTools, Inc. Eugene, OR {Sameer, malony, jlinford}@paratools.com Andrew Wissink U.S. Army
More informationIS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers
IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC 08 Outline Why should we consider topology
More informationThe Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy
The Role of InfiniBand Technologies in High Performance Computing 1 Managed by UT-Battelle Contributors Gil Bloch Noam Bloch Hillel Chapman Manjunath Gorentla- Venkata Richard Graham Michael Kagan Vasily
More informationJULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING
JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338
More informationImpact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation
Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation S.R. Alam, R.F. Barrett, H. Jagode, J.A. Kuehn, S.W. Poole, and R. Sankaran Oak Ridge National Laboratory, Oak Ridge, TN
More informationPerformance of Variant Memory Configurations for Cray XT Systems
Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)
More informationEfficiency Evaluation of the Input/Output System on Computer Clusters
Efficiency Evaluation of the Input/Output System on Computer Clusters Sandra Méndez, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de
More informationA Trace-Scaling Agent for Parallel Application Tracing 1
A Trace-Scaling Agent for Parallel Application Tracing 1 Felix Freitag, Jordi Caubet, Jesus Labarta Computer Architecture Department (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat
More informationThe Constellation Project. Andrew W. Nash 14 November 2016
The Constellation Project Andrew W. Nash 14 November 2016 The Constellation Project: Representing a High Performance File System as a Graph for Analysis The Titan supercomputer utilizes high performance
More informationExtreme I/O Scaling with HDF5
Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1 Outline Brief overview of
More informationMADNESS. Rick Archibald. Computer Science and Mathematics Division ORNL
MADNESS Rick Archibald Computer Science and Mathematics Division ORNL CScADS workshop: Leadership-class Machines, Petascale Applications and Performance Strategies July 27-30 th Managed by UT-Battelle
More informationIME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application
More informationProfiling Non-numeric OpenSHMEM Applications with the TAU Performance System
Profiling Non-numeric OpenSHMEM Applications with the TAU Performance System John Linford 2,TylerA.Simon 1,2, Sameer Shende 2,3,andAllenD.Malony 2,3 1 University of Maryland Baltimore County 2 ParaTools
More informationMetropolitan Road Traffic Simulation on FPGAs
Metropolitan Road Traffic Simulation on FPGAs Justin L. Tripp, Henning S. Mortveit, Anders Å. Hansson, Maya Gokhale Los Alamos National Laboratory Los Alamos, NM 85745 Overview Background Goals Using the
More informationToward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies
Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday
More informationMPI Performance Engineering through the Integration of MVAPICH and TAU
MPI Performance Engineering through the Integration of MVAPICH and TAU Allen D. Malony Department of Computer and Information Science University of Oregon Acknowledgement Research work presented in this
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationxsim The Extreme-Scale Simulator
www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of
More informationCray RS Programming Environment
Cray RS Programming Environment Gail Alverson Cray Inc. Cray Proprietary Red Storm Red Storm is a supercomputer system leveraging over 10,000 AMD Opteron processors connected by an innovative high speed,
More informationToward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R.
Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R. Vallee Motivation & Challenges Bigger machines (e.g., TITAN, upcoming Exascale
More informationThe Cray Programming Environment. An Introduction
The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent
More informationEarly Evaluation of the Cray XT5
CUG 2009 Proceedings Page of 2 Early Evaluation of the Cray XT5 P. H. Worley, R. F. Barrett, J. A. Kuehn Abstract A Cray XT5 system has recently been installed at Oak Ridge National Laboratory (ORNL).
More informationScalable Compression and Replay of Communication Traces in Massively Parallel Environments
Scalable Compression and Replay of Communication Traces in Massively Parallel Environments Michael Noeth 1, Frank Mueller 1, Martin Schulz 2, Bronis R. de Supinski 2 1 North Carolina State University 2
More informationEnabling high-speed asynchronous data extraction and transfer using DART
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (21) Published online in Wiley InterScience (www.interscience.wiley.com)..1567 Enabling high-speed asynchronous
More informationAn Introduction to OpenACC
An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15
More informationUsing R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov
Using R for HPC Data Science Session: Parallel Programming Paradigms George Ostrouchov Oak Ridge National Laboratory and University of Tennessee and pbdr Core Team Course at IT4Innovations, Ostrava, October
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationScreen Saver Science: Realizing Distributed Parallel Computing with Jini and JavaSpaces
Screen Saver Science: Realizing Distributed Parallel Computing with Jini and JavaSpaces William L. George and Jacob Scott National Institute of Standards and Technology Information Technology Laboratory
More informationChristopher Sewell Katrin Heitmann Li-ta Lo Salman Habib James Ahrens
LA-UR- 14-25437 Approved for public release; distribution is unlimited. Title: Portable Parallel Halo and Center Finders for HACC Author(s): Christopher Sewell Katrin Heitmann Li-ta Lo Salman Habib James
More informationUpdate on Cray Activities in the Earth Sciences
Update on Cray Activities in the Earth Sciences Presented to the 13 th ECMWF Workshop on the Use of HPC in Meteorology 3-7 November 2008 Per Nyberg nyberg@cray.com Director, Marketing and Business Development
More informationGetting Insider Information via the New MPI Tools Information Interface
Getting Insider Information via the New MPI Tools Information Interface EuroMPI 2016 September 26, 2016 Kathryn Mohror This work was performed under the auspices of the U.S. Department of Energy by Lawrence
More informationCommunication Patterns
Communication Patterns Rolf Riesen Sandia National Laboratories P.O. Box 5 Albuquerque, NM 715-111 rolf@cs.sandia.gov Abstract Parallel applications have message-passing patterns that are important to
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationScore-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir
Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir VI-HPS Team Score-P: Specialized Measurements and Analyses Mastering build systems Hooking up the
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More information[Scalasca] Tool Integrations
Mitglied der Helmholtz-Gemeinschaft [Scalasca] Tool Integrations Aug 2011 Bernd Mohr CScADS Performance Tools Workshop Lake Tahoe Contents Current integration of various direct measurement tools Paraver
More informationDynamic Load Balancing for Weather Models via AMPI
Dynamic Load Balancing for Eduardo R. Rodrigues IBM Research Brazil edrodri@br.ibm.com Celso L. Mendes University of Illinois USA cmendes@ncsa.illinois.edu Laxmikant Kale University of Illinois USA kale@cs.illinois.edu
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationSupercomputing and Mass Market Desktops
Supercomputing and Mass Market Desktops John Manferdelli Microsoft Corporation This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
More informationIntroduction to Cluster Computing
Introduction to Cluster Computing Prabhaker Mateti Wright State University Dayton, Ohio, USA Overview High performance computing High throughput computing NOW, HPC, and HTC Parallel algorithms Software
More informationChapter 14 Performance and Processor Design
Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures
More informationPerformance Analysis of Parallel Scientific Applications In Eclipse
Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains
More informationScibox: Online Sharing of Scientific Data via the Cloud
Scibox: Online Sharing of Scientific Data via the Cloud Jian Huang, Xuechen Zhang, Greg Eisenhauer, Karsten Schwan, Matthew Wolf *, Stephane Ethier, Scott Klasky * Georgia Institute of Technology, Princeton
More informationCommission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS
Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS COMPUTER AIDED MIGRATION OF APPLICATIONS SYSTEM **************** CAMAS-TR-2.3.4 Finalization Report
More informationScore-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1
Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide
More informationExploiting Lustre File Joining for Effective Collective IO
Exploiting Lustre File Joining for Effective Collective IO Weikuan Yu, Jeffrey Vetter Oak Ridge National Laboratory Computer Science & Mathematics Oak Ridge, TN, USA 37831 {wyu,vetter}@ornl.gov R. Shane
More informationParallel Execution of Functional Mock-up Units in Buildings Modeling
ORNL/TM-2016/173 Parallel Execution of Functional Mock-up Units in Buildings Modeling Ozgur Ozmen James J. Nutaro Joshua R. New Approved for public release. Distribution is unlimited. June 30, 2016 DOCUMENT
More informationpnfs and Linux: Working Towards a Heterogeneous Future
CITI Technical Report 06-06 pnfs and Linux: Working Towards a Heterogeneous Future Dean Hildebrand dhildebz@umich.edu Peter Honeyman honey@umich.edu ABSTRACT Anticipating terascale and petascale HPC demands,
More informationCommunication Characteristics in the NAS Parallel Benchmarks
Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this
More information