Performance engineering of GemsFDTD computational electromagnetics solver
|
|
- Noel Foster
- 5 years ago
- Views:
Transcription
1 Performance engineering of GemsFDTD computational electromagnetics solver Ulf Andersson 1 and Brian J. N. Wylie 2 1 PDC Centre for High Performance Computing, KTH Royal Institute of Technology, Stockholm, Sweden ulfa@pdc.kth.se 2 Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany b.wylie@fz-juelich.de Abstract. Since modern high-performance computer systems consist of many hardware components and software layers, they present severe challenges for application developers who are primarily domain scientists and not experts with continually evolving hardware and system software. Effective tools for performance analysis are therefore decisive when developing performant scalable parallel applications. Such tools must be convenient to employ in the application development process and analysis must be both clear to interpret and yet comprehensive in the level of detail provided. We describe how the Scalasca toolset was applied in engineering the GemsFDTD computational electromagnetics solver, and the dramatic performance and scalability gains thereby achieved. Keywords: performance engineering, parallel execution tuning, scalability, MPI, computational electromagnetics 1 Introduction Parallel applications develop from initial barely-functional implementations, via a series of debugging and performance improvement stages, into efficient and scalable versions via disciplined performance engineering practices. This process remains on-going throughout the productive lifetime of the application, as larger and more complex computational problems are addressed and upgraded hardware and system software become available. Although codes which become accepted benchmarks are more rigorously investigated than typical applications, they are not immune from the need for continual performance evaluation and re-engineering. This paper considers the GemsFDTD code from the SPEC MPI2007 benchmark suite [6] which was found to perform particularly poorly at larger scales on distributed-memory computer systems. Initial analysis with the Scalasca toolset and then other tools pinpointed aspects in the application s initialization phase that severely limited scalability. Scalasca is an open-source toolset for analysing the execution behaviour of applications based on the MPI and OpenMP parallel To appear in Proc. PARA 2010 (Reykjavík, Iceland), Part I, LNCS 7133, Springer.
2 programming interfaces supporting a wide range of current HPC platforms [2,3]. It combines compact runtime summaries, that are particularly suited to obtaining an overview of execution performance, with in-depth analyses of concurrency inefficiencies via event tracing and parallel replay. With its highly scalable design, Scalasca has facilitated performance analysis and tuning of applications consisting of unprecedented numbers of processes [10,11]. Based on these performance analyses, the developers of the GemsFDTD code could rework the initialization and further optimize the time-stepping loop, to realize substantial application execution performance and overall scalability improvements, ultimately leading to an entirely updated version of the benchmark. We review both initial and revised versions of the code, and their performance analysis with the Scalasca toolset which revealed now resolved and still remaining performance optimization opportunities. 2 GemsFDTD code versions GemsFDTD SPEC MPI2007 v1.1 The SPEC MPI2007 code 113.GemsFDTD solves the Maxwell equations using the finite-difference time-domain (FDTD) method [8]. The radar cross-section of a perfectly conducting object is computed. 113.GemsFDTD is written in Fortran 90 and is a parallel version of the SPEC CFP2006 (Floating Point Component of SPEC CPU2006) code 459.GemsFDTD. 113.GemsFDTD is a subset of a general purpose time-domain code for the Maxwell equations developed within the General ElectroMagnetic Solvers (GEMS) project at PSCI [5]. The core of the FDTD method is second-order accurate central-difference approximations of Faraday s and Ampere s laws. These central-differences are employed on a staggered Cartesian grid resulting in an explicit finite-difference method. These updates are performed in the module material_class. The FDTD method is also referred to as the Yee scheme. It is the standard timedomain method within computational electromagnetics [8]. An incident plane wave is generated using so-called Huygens surfaces. This means that the computational domain is split into a total-field part and a scattered-field part, where the scattered-field part surrounds the total-field part. It uses the excite_mod module to compute the shape of the incident fields. The computational domain is truncated by an absorbing layer in order to minimize the artificial reflections at the boundary. The uni-axial perfectly matched layer (UPML) by Gedney [1] is used. A time-domain near-to-far-field transformation computes the radar cross-section according to Martin and Pettersson [4], handled by the module NFT_class. The execution time during the timestepping is concentrated in five subroutines, two update routines, two UPML routines, and the routine NFT_store. The problem size in 113.GemsFDTD is FDTD cells (N x = N y = N z = 580) surrounded by a twelve cell UPML layer.
3 lGemsFDTD SPEC MPI2007 v GemsFDTD was designed to be scalable up to 256 processes according to the aim of SPEC MPI Version 2.0 of SPEC MPI demanded that the codes were scalable up to 2048 processes. 113.GemsFDTD failed miserably at this [7], thus an extensive rewrite was needed. The end result of this rewrite was 145.lGemsFDTD ( l signifies large) which was accepted into version 2.0 of SPEC MPI. (The original 113.GemsFDTD is retained in SPEC MPI 2007 v2.0 medium-sized benchmark suite for compatibility with earlier releases.) The problem size in 145.lGemsFDTD is FDTD cells (N x = N y = N z = 960) surrounded by a twelve cell UPML layer. This was selected in order to meet the SPEC request that the memory footprint should be slightly less than 64 GiB. Initial performance analysis with Scalasca showed clearly that 113.GemsFDTD performed a lot of 4-byte broadcasts during initialization. Time measurements inside the code itself showed that the multiblock_partition routine on the master rank was a serial bottleneck. In fact, the execution time increased with the number of blocks, which increases linearly with the number of MPI processes. This performance analysis made it clear where the hotspots in the code were and was very useful to the programmer. 2.3 Domain decomposition of GemsFDTD The original Gems code (MBfrida) lets the user select the number of processes (p) and the number of FDTD blocks in all three dimensions (Nbx, Nby, Nbz), where the total number of FDTD blocks are NbF = Nbx Nby Nbz. If the user selects Nbx = Nby = Nbz = 0, then the code sets NbF = p and uses MPI_Dims_create to set (Nbx, Nby, Nbz). The size of the FDTD blocks are (Nbx/N x, Nby/N y, Nbz/N z ) (or slightly smaller). If a layer of UPML is added, then the total number of blocks become NbT = (Nbx+2) (Nby+2) (Nbz+2). In 113.GemsFDTD it was chosen to always use MPI_Dims_create, but setting NbF to p 2 or p 3 instead of p so that NbF is an even number. This was done in order to get at least two processes that only have UPML blocks. NbF is then sometimes further decreased in order to avoid getting elongated FDTD-blocks which is undesirable since that will lead to more UPML-blocks and more data to communicate between the blocks. A precalculated table decides which N bf values are accepted. In the case of p = 256, 254 and 253 are not accepted values, while 252 (=6 6 7) is. Due to a bug in the code only 252+2=254 processes are used when computing a distribution of the 576 blocks (252 FDTD blocks and 324 UPML blocks). When the number of blocks and their sizes are decided, the master computes the workload of each block and then, in the routine multiblock_partition, computes a distribution of the blocks onto the MPI ranks. For 145.lGemsFDTD, N bf is the largest approved value that is less than or equal to the number of processes p. For p <= 256, the same precalculated table as for 113.GemsFDTD is used to decide whether a NbF value is approved,
4 whereas for p > 256 we demand that (NbT/NbF )/(1 + 6/ 3 NbF ) < 1.3 in order to approve NbF. This makes sure that NbT/NbF is bounded. After a block distribution is computed, the master loops through the blocks and broadcasts information about each block. In 113.GemsFDTD this part had 65 N bt +25 four-byte broadcasts. In 145.lGemsFDTD this has been reduced to 7 NbT+11 small broadcasts by merging adjacent broadcasts. Further reductions were possible but would have meant extensive rewrites of the code. 2.4 Summary of the main changes The major improvements in 145.lGemsFDTD compared to 113.GemsFDTD for the phases where they apply: Initialization 1. Develop a new multiblock_partition routine designed for larger numbers of processes, which produces a completely different domain decomposition. 2. Use of fewer MPI_Bcast operations within multiblock_distribute and associated routines. 3. Separation of broadcasts and allocations in the block distribution phase, such that the rank that owns the block delays allocations for the block until after block information has been broadcasted for all blocks. Time-stepping iterations 1. Removal of expensive recomputation of the communication pattern used to exchange blocks in multiblock_communicate, since the same communication pattern is used in each timestep. 2. Replacement of MPI_Sendrecv with non-blocking MPI_Isend and MPI_Irecv communication. 3. Interchanging loops in the near-to-far-field transformation computations. (This improvement was found analyzing the serial code 459.GemsFDTD.) 3 Performance measurements & analyses During the course of development of GemsFDTD, performance measurements of each version were done on a variety of platforms and with different compilers to verify the portability and effectiveness of each of the modifications. Although benefits depend on the respective processor, network and compiler/optimizer capabilities, execution performance analysis with Scalasca and the improvement achieved for a representative example system are studied in detail.
5 3.1 Execution scalability The original version of the GemsFDTD benchmark code (113.GemsFDTD SPEC in MPI2007 v1.1) and revised version (145.lGemsFDTD in SPEC MPI2007 v2.0) were executed on the HECToR Cray XT4 [9] with varying numbers of processes. The XT4 has quad-core 2.3 GHz AMD Opteron processors in four-socket compute blades sharing 8 GiB of memory, allowing the benchmark code to run with 32 or more MPI processes. The codes were built with PGI compilers using typical optimization flags -fastsse -O3 -Mipa=fast,inline as well as processorspecific optimization. Although 113.GemsFDTD would normally be run with a medium-sized input dataset (sphere.pec), to allow comparison of the two versions we ran both with the large-sized training (i.e., ltrain) dataset using RAk_funnel.pec. It was convenient to use only 50 timesteps rather than the lref benchmark reference number of 1500 timesteps, since previous analysis of GemsFDTD [7] determined that there was no significant variation in execution performance for each timestep. Full benchmark execution time can be estimated by multiplying the time for 50 iterations by a factor of 30 and adding the initialization/finalization time. Execution times reported for the entire code and only for the 50 timestep loop iterations of both versions are shown in Figure 1. While the iterations are seen to scale well in both versions, the initialization phases only scale to 128 and 512 processes, respectively. For the original code, the initialization phase becomes particularly dominant, even with only 256 processes, and makes it prohibitive to run at larger scales. 3.2 Scalasca performance analyses Both versions of GemsFDTD were prepared for measurement with the Scalasca instrumenter (1.3 release), which configured the Cray/PGI compiler to automatically instrument each user-level source routine entry and exits, and linked with its measurement library which include instrumented wrappers for MPI library routines. The instrumented executables were then run under the control of the Scalasca measurement and analysis nexus within single batch jobs (to avoid impact of acquiring different partitions of compute nodes in separate jobs). By default, a Scalasca runtime summarization experiment consisting of a full call-path execution profile with auxilliary MPI statistics for each process is collected and stored in a unique archive directory. With all of the user-level source routines instrumented, there can be undesirable measurement overheads for small frequently-executed routines. An initial summary experiment is therefore scored to identify which routines should be filtered: with GemsFDTD, the names of nine routines were placed in a text file and specified to be filtered from subsequent measurements in Scalasca summary and trace experiments. Scalasca summary analysis reports contain a breakdown of the total run time into pure (local) computation time and MPI time, the latter split into time for collective synchronization (i.e., barriers), collective communication and pointto-point communication, as detailed in Figure 2. Both code versions show good
6 v1.1: Entire code v1.1: Init / Finalize v1.1: 50 iterations v2.0: Entire code v2.0: Init / Finalize v2.0: 50 iterations Time [s] Processes Fig. 1. Execution times of original and revised GemsFDTD versions with ltrain dataset for a range of configuration sizes on Cray XT v1.1: Total time v1.1: Execution time v1.1: Coll synch time v1.1: Coll comm time v1.1: P2P comm time v2.0: Total time v2.0: Execution time v2.0: Coll synch time v2.0: Coll comm time v2.0: P2P comm time 100 Time [s] Processes Fig. 2. Scalasca summary analysis breakdown of executions of original and revised GemsFDTD versions with ltrain dataset on Cray XT4 (averages of all processes).
7 scaling of the computation time, with the new version demonstrating better absolute performance (at any particular scale) and better scalability overall. The extremely poor scalability of the original code version is due to the rapidly increasing time for collective communication, isolated to the numerous MPI_Bcast calls during initialization. Collective communication in the revised version is seen growing significantly at the largest scale, however, the primary scalability impediment is the increasing time for explicit barrier synchronization (which is not a factor in the original version). Point-to-point communication time is notably reduced in the revised version, but also scales less well than the local computation, which it exceeds for 1024 or more processes. Scalasca automatic trace analysis profiles are similar to those produced by runtime summarization, however, they include assessments of waiting times inherent in MPI collective and point-to-point operations, such as Late Sender time when an early receiver process must block until the associated message transfer is initiated. Trace analysis profiles from 256-process experiments as presented by the Scalasca analysis report explorer GUI are shown in Figures 3 and 4 comparing the original and improved GemsFDTD execution performance. In Figure 3, with the metric for total time selected from the metric trees in the leftmost panes, and the routines that constitute the initialization phase of GemsFDTD selected from the central call-tree panes, the distribution of times per process is shown with the automatically acquired Cray XT4 machine topology in the right panes. The MPI process with rank 0 is selected in the v2.0 display, and the two processes in the uppermost row of compute nodes that idled throughout the v1.1 execution can be distinguished. (Only the subset of the HECToR Cray XT4 associated with the measured execution is shown, with non-allocated compute nodes grayed or dashed.) In contrast, Figure 4 features Point-to-point communication time selected from the metric tree and the associated MPI routines within multiblock communicate of the solver iteration phase. The Scalasca analysis reports from GemsFDTD v1.1 and v2.0 trace experiments with 256 processes on the Cray XT4 are compared in Table 1 to examine the 12-fold speedup in total execution time of the revised version. Dilation of the application execution time with instrumentation and measurement processing was under 2% compared to the uninstrumented reference version. Initialization is more than 40 times faster through the combination of reworking the multiblock_partition calculation and using almost 9 times fewer broadcasts in multiblock_distribute (even though 3% more bytes are broadcast). Note that in v1.1 the majority of broadcast time (measured in MPI_Bcast) is actually Late Broadcast time on the processes waiting for rank 0 to complete multiblock_partition, however, the non-waiting time for broadcasts is also reduced 20-fold for v2.0. Improvement in the solver iterations is a more modest 33%, however, still with speedup in both calculation and communication times. Significant gains were therefore realized through use of non-blocking communication and improved load balance (including exploiting the two previously unused processes), despite 5% more data being transfered in multiblock_communicate.
8 Fig. 3. Scalasca analysis report explorer presentations of GemsFDTD trace experiments with ltrain dataset for 256 processes on Cray XT4 (v1.1 above and v2.0 below) highlighting 675-fold improvement of total time for the initialization phase.
9 Fig. 4. Scalasca analysis report explorer presentations of GemsFDTD trace experiments with ltrain dataset for 256 processes on Cray XT4 (v1.1 above and v2.0 below) highlighting 2-fold improvement of point-to-point communication time in solver phase.
10 Table 1. Selected performance metrics and statistics for Scalasca trace experiments for GemsFDTD versions with ltrain dataset for 256 MPI processes on Cray XT4. (Maximum of any individual measured process where qualified.) GemsFDTD v1.1 v2.0 Reference execution time [s] Measured execution time [s] Initialization multiblock partition time [s] (max) multiblock distribute time [s] (max) broadcast time [s] (max) Late Broadcast time [s] (max) number of broadcasts (max) bytes incoming by broadcast [kib] (max) Time-stepping iterations Calculation time [s] (max) Communication time [s] (max) number of sends/receives (max) bytes sent [MiB] (max) Late Sender time [s] (max) number of Late Senders (max) Scalasca tracing Trace total size [MiB] Trace collection time [s] Trace analysis time [s] event replay analysis [s] event timestamp correction [s] number of violations corrected The Scalasca traces are less than one quarter of the size for the v2.0 version due to the reduced number of broadcasts, requiring less than half a second to write the traces to disk and unify the associated definitions. Replay of recorded events requires correspondingly less time, however, the time processing event timestamps to correct logical inconsistencies arising from the unsynchronized clocks on Cray XT compute nodes is reduced by over 170-fold, since correcting timestamps of collective operations is particularly expensive. For the v2.0 version of GemsFDTD at this scale, Scalasca trace collection and automatic analysis require only 6% additional job runtime compared to the usual execution time. 4 Conclusions The comprehensive analyses provided by the Scalasca toolset were instrumental in directing the developer s performance engineering of a much more efficient and highly scalable GemsFDTD code. Since these analyses show that there are still significant optimization opportunities at larger scales, further engineering improvements can be expected in future.
11 Acknowledgements John Baron of SGI helped identify some bottlenecks in the original benchmark using the MPInside profiling tool, which led the developers to replace MPI_Sendrecv with non-blocking communication. EPCC service staff provided valuable assistance to use their Cray XT4 system for the performance measurements. Part of this work has been performed under the HPC-Europa2 project (number ) with the support of the European Commission Capacities Area Research Infrastructures. Part of this work has been supported by the computing resources of the PDC, KTH, and the Swedish National Infrastructure for Computing (SNIC). References 1. Gedney, S.D.: An anisotropic perfectly matched layer-absorbing medium for the truncation of FDTD lattices. IEEE Transactions on Antennas and Propagation 44(12), (Dec 1996) 2. Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience 22(6), (Apr 2010) 3. Jülich Supercomputing Centre, Germany: Scalasca toolset for scalable performance analysis of large-scale parallel applications (2010), 4. Martin, T., Pettersson, L.: Dispersion compensation for Huygens sources and farzone transformation in FDTD. IEEE Transactions on Antennas and Propagation 48(4), (Apr 2000) 5. Parallel & Scientific Computing Institute (PSCI), Sweden: GEMS: General ElectroMagnetic Solvers project (2005), 6. Standard Performance Evaluation Corporation (SPEC), USA: SPEC MPI2007 benchmark suite (version 2.0) (2010), 7. Szebenyi, Z., Wylie, B.J.N., Wolf, F.: SCALASCA parallel performance analyses of SPEC MPI2007 applications. In: Proc. 1st SPEC Int l Performance Evaluation Workshop (SIPEW, Darmstadt, Germany). Lecture Notes in Computer Science, vol. 5119, pp Springer (June 2008) 8. Taflove, A., Hagness, S.C.: Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House, 3rd edn. (2005) 9. University of Edinburgh HPCx, UK: HECToR: United Kingdom National Supercomputing Service (2010), Wylie, B.J.N., Böhme, D., Frings, W., Geimer, M., Mohr, B., Szebenyi, Z., Becker, D., Hermanns, M.A., Wolf, F.: Scalable performance analysis of large-scale parallel applications on Cray XT systems with Scalasca. In: Proc. 52nd CUG meeting (Edinburgh, Scotland). Cray User Group (May 2010) 11. Wylie, B.J.N., Böhme, D., Mohr, B., Szebenyi, Z., Wolf, F.: Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset. In: Proc. 24th Int l Parallel & Distributed Processing Symposium, Workshop on Large-Scale Parallel Processing (IPDPS LSPP, Atlanta, GA, USA). IEEE Computer Society (Apr 2010)
SCALASCA parallel performance analyses of SPEC MPI2007 applications
Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute
More informationLARGE-SCALE PERFORMANCE ANALYSIS OF SWEEP3D WITH THE SCALASCA TOOLSET
Parallel Processing Letters Vol. 20, No. 4 (2010) 397 414 c World Scientific Publishing Company LARGE-SCALE PERFORMANCE ANALYSIS OF SWEEP3D WITH THE SCALASCA TOOLSET BRIAN J. N. WYLIE, MARKUS GEIMER, BERND
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationPerformance analysis of Sweep3D on Blue Gene/P with Scalasca
Mitglied der Helmholtz-Gemeinschaft Performance analysis of Sweep3D on Blue Gene/P with Scalasca 2010-04-23 Brian J. N. Wylie, David Böhme, Bernd Mohr, Zoltán Szebenyi & Felix Wolf Jülich Supercomputing
More informationAutomatic trace analysis with the Scalasca Trace Tools
Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationCharacterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013
Characterizing Imbalance in Large-Scale Parallel Programs David o hme September 26, 2013 Need for Performance nalysis Tools mount of parallelism in Supercomputers keeps growing Efficient resource usage
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time
More informationScore-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1
Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide
More informationUsage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications
Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications Felix Wolf 1,2, Erika Ábrahám 1, Daniel Becker 1,2, Wolfgang Frings 1, Karl Fürlinger 3, Markus Geimer
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides
More informationA scalable tool architecture for diagnosing wait states in massively parallel applications
A scalable tool architecture for diagnosing wait states in massively parallel applications Markus Geimer a, Felix Wolf a,b,, Brian J. N. Wylie a, Bernd Mohr a a Jülich Supercomputing Centre, Forschungszentrum
More informationThe SCALASCA performance toolset architecture
The SCALASCA performance toolset architecture Markus Geimer 1, Felix Wolf 1,2, Brian J.N. Wylie 1, Erika Ábrahám 1, Daniel Becker 1,2, Bernd Mohr 1 1 Forschungszentrum Jülich 2 RWTH Aachen University Jülich
More informationMPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationAssessment of LS-DYNA Scalability Performance on Cray XD1
5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationLarge-scale performance analysis of PFLOTRAN with Scalasca
Mitglied der Helmholtz-Gemeinschaft Large-scale performance analysis of PFLOTRAN with Scalasca 2011-05-26 Brian J. N. Wylie & Markus Geimer Jülich Supercomputing Centre b.wylie@fz-juelich.de Overview Dagstuhl
More informationSolution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices
Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices Marianne Spurrier and Joe Swartz, Lockheed Martin Corp. and ruce lack, Cray Inc. ASTRACT: Matrix decomposition and solution
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationCompute Node Linux: Overview, Progress to Date & Roadmap
Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute
More informationA TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER
A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER Shaheen R. Tonse* Lawrence Berkeley National Lab., Berkeley, CA, USA 1. INTRODUCTION The goal of this
More informationIdentifying the root causes of wait states in large-scale parallel applications
Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/01-1 12/January/2010 Identifying the root causes of wait states in large-scale parallel applications D. Böhme,
More informationHPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:
HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com
More informationScalaIOTrace: Scalable I/O Tracing and Analysis
ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationComparing One-Sided Communication with MPI, UPC and SHMEM
Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to
More informationThe NEMO Ocean Modelling Code: A Case Study
The NEMO Ocean Modelling Code: A Case Study CUG 24 th 27 th May 2010 Dr Fiona J. L. Reid Applications Consultant, EPCC f.reid@epcc.ed.ac.uk +44 (0)131 651 3394 Acknowledgements Cray Centre of Excellence
More informationScalable Critical Path Analysis for Hybrid MPI-CUDA Applications
Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale
More informationTransactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN
The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationMPI Casestudy: Parallel Image Processing
MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationAteles performance assessment report
Ateles performance assessment report Document Information Reference Number Author Contributor(s) Date Application Service Level Keywords AR-4, Version 0.1 Jose Gracia (USTUTT-HLRS) Christoph Niethammer,
More informationPerformance Analysis with Periscope
Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance
More informationImplementation of an integrated efficient parallel multiblock Flow solver
Implementation of an integrated efficient parallel multiblock Flow solver Thomas Bönisch, Panagiotis Adamidis and Roland Rühle adamidis@hlrs.de Outline Introduction to URANUS Why using Multiblock meshes
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationAnalyzing Cache Bandwidth on the Intel Core 2 Architecture
John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms
More informationWhatÕs New in the Message-Passing Toolkit
WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2
More informationMVAPICH2 vs. OpenMPI for a Clustering Algorithm
MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore
More informationPeta-Scale Simulations with the HPC Software Framework walberla:
Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,
More informationThe Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter
The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter CUG 2011, May 25th, 2011 1 Requirements to Reality Develop RFP Select
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationAMBER 11 Performance Benchmark and Profiling. July 2011
AMBER 11 Performance Benchmark and Profiling July 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationPerformance of Variant Memory Configurations for Cray XT Systems
Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)
More informationANALYSIS OF CMAQ PERFORMANCE ON QUAD-CORE PROCESSORS. George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA
ANALYSIS OF CMAQ PERFORMANCE ON QUAD-CORE PROCESSORS George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 75, USA. INTRODUCTION CMAQ offers a choice of three gas chemistry solvers: Euler-Backward
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationQLogic TrueScale InfiniBand and Teraflop Simulations
WHITE Paper QLogic TrueScale InfiniBand and Teraflop Simulations For ANSYS Mechanical v12 High Performance Interconnect for ANSYS Computer Aided Engineering Solutions Executive Summary Today s challenging
More informationA New High Order Algorithm with Low Computational Complexity for Electric Field Simulation
Journal of Computer Science 6 (7): 769-774, 1 ISSN 1549-3636 1 Science Publications A New High Order Algorithm with Low Computational Complexity for lectric Field Simulation 1 Mohammad Khatim Hasan, Jumat
More informationJoe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain)
1 of 11 5/4/2011 4:49 PM Joe Wingbermuehle, wingbej@wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download The Auto-Pipe system allows one to evaluate various resource mappings and topologies
More informationImplementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU
Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU 1 1 Samara National Research University, Moskovskoe Shosse 34, Samara, Russia, 443086 Abstract.
More informationGROMACS Performance Benchmark and Profiling. September 2012
GROMACS Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource
More informationA configurable binary instrumenter
Mitglied der Helmholtz-Gemeinschaft A configurable binary instrumenter making use of heuristics to select relevant instrumentation points 12. April 2010 Jan Mussler j.mussler@fz-juelich.de Presentation
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationParallel I/O and Portable Data Formats I/O strategies
Parallel I/O and Portable Data Formats I/O strategies Sebastian Lührs s.luehrs@fz-juelich.de Jülich Supercomputing Centre Forschungszentrum Jülich GmbH Jülich, March 13 th, 2017 Outline Common I/O strategies
More informationTerminal Services Scalability Study
Terminal Services Scalability Study Part 1 The Effect of CPS 4.0 Microsoft Windows Terminal Services Citrix Presentation Server 4.0 June 2007 Table of Contents 1 Executive summary 3 2 Introduction 4 2.1
More informationScore-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (continued)
Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (continued) VI-HPS Team Congratulations!? If you made it this far, you successfully used Score-P
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationExperiences from the Fortran Modernisation Workshop
Experiences from the Fortran Modernisation Workshop Wadud Miah (Numerical Algorithms Group) Filippo Spiga and Kyle Fernandes (Cambridge University) Fatima Chami (Durham University) http://www.nag.co.uk/content/fortran-modernization-workshop
More informationA Graphical User Interface (GUI) for Two-Dimensional Electromagnetic Scattering Problems
A Graphical User Interface (GUI) for Two-Dimensional Electromagnetic Scattering Problems Veysel Demir vdemir@olemiss.edu Mohamed Al Sharkawy malshark@olemiss.edu Atef Z. Elsherbeni atef@olemiss.edu Abstract
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationSIMULATION OF AN IMPLANTED PIFA FOR A CARDIAC PACEMAKER WITH EFIELD FDTD AND HYBRID FDTD-FEM
1 SIMULATION OF AN IMPLANTED PIFA FOR A CARDIAC PACEMAKER WITH EFIELD FDTD AND HYBRID FDTD- Introduction Medical Implanted Communication Service (MICS) has received a lot of attention recently. The MICS
More informationUsing Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh
Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software
More informationEvaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More informationThe Parallel Software Design Process. Parallel Software Design
Parallel Software Design The Parallel Software Design Process Deborah Stacey, Chair Dept. of Comp. & Info Sci., University of Guelph dastacey@uoguelph.ca Why Parallel? Why NOT Parallel? Why Talk about
More informationParallel Code Optimisation
April 8, 2008 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code Theoretical perfomance The theoretical floating point performance of a processor
More informationComparison of TLM and FDTD Methods in RCS Estimation
International Journal of Electrical Engineering. ISSN 0974-2158 Volume 4, Number 3 (2011), pp. 283-287 International Research Publication House http://www.irphouse.com Comparison of TLM and FDTD Methods
More informationDISP: Optimizations Towards Scalable MPI Startup
DISP: Optimizations Towards Scalable MPI Startup Huansong Fu, Swaroop Pophale*, Manjunath Gorentla Venkata*, Weikuan Yu Florida State University *Oak Ridge National Laboratory Outline Background and motivation
More informationHybrid Programming with MPI and SMPSs
Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationDesign and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications
Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department
More informationUsing Intel VTune Amplifier XE and Inspector XE in.net environment
Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector
More informationMulticore Computing and Scientific Discovery
scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research
More informationHigh performance computing and numerical modeling
High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic
More informationSCALASCA v1.0 Quick Reference
General SCALASCA is an open-source toolset for scalable performance analysis of large-scale parallel applications. Use the scalasca command with appropriate action flags to instrument application object
More informationMaxwell: a 64-FPGA Supercomputer
Maxwell: a 64-FPGA Supercomputer Copyright 2007, the University of Edinburgh Dr Rob Baxter Software Development Group Manager, EPCC R.Baxter@epcc.ed.ac.uk +44 131 651 3579 Outline The FHPCA Why build Maxwell?
More informationLarge-scale performance analysis of PFLOTRAN with Scalasca
Mitglied der Helmholtz-Gemeinschaft Large-scale performance analysis of PFLOTRAN with Scalasca 2011-05-26 Brian J. N. Wylie & Markus Geimer Jülich Supercomputing Centre b.wylie@fz-juelich.de Overview Dagstuhl
More informationHimeno Performance Benchmark and Profiling. December 2010
Himeno Performance Benchmark and Profiling December 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationImproving the Scalability of Performance Evaluation Tools
Improving the Scalability of Performance Evaluation Tools Sameer Suresh Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory Department of Computer and Information Science University
More informationAnalysis and Optimization of Yee Bench Using Hardware Performance Counters
1 Analysis and Optimization of Yee Bench Using Hardware Performance Counters Ulf Andersson a, Philip Mucci a,b a PDC, KTH, SE-100 44 Stockholm, Sweden, ulfa@nada.kth.se b Innovative Computing Laboratory,
More informationDevelopment of a Modeling Tool for Collaborative Finite Element Analysis
Development of a Modeling Tool for Collaborative Finite Element Analysis Åke Burman and Martin Eriksson Division of Machine Design, Department of Design Sciences, Lund Institute of Technology at Lund Unversity,
More information