FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461) Particle Simulations on Cray MPP Systems
|
|
- Penelope Mitchell
- 6 years ago
- Views:
Transcription
1 FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461) Interner Bericht Particle Simulations on Cray MPP Systems Christian M. Dury *, Renate Knecht, Gerald H. Ristow * FZJ-ZAM-IB-9714 September 1997 (letzte Änderung: ) (*) Fachbereich Physik, Philipps-Universität Marburg, Renthof 6, D Marburg, Germany Third European CRAY-SGI MPP Workshop, Paris,
2
3 Particle Simulations on Cray MPP Systems Christian M. Dury a, Renate Knecht b, Gerald H. Ristow a a Fachbereich Physik, Philipps-Universität Marburg, Renthof 6, D Marburg, Germany, dury@mailer.uni-marburg.de, ristow@physik.uni-marburg.de b Zentralinstitut für Angewandte Mathematik, Forschungszentrum Jülich GmbH, D Jülich, Germany, r.knecht@fz-juelich.de Abstract Particle simulations were among the first applications to be implemented on scalar computers over forty years ago, and have since played an important role in many science and engineering applications. Because of the inherent parallelism in all particle algorithms the advent of parallel computers has revolutionized this field: basically, the same set of calculations has to be performed for every particle in the system. At present, realistic simulations with a few million particles are possible using large, general-purpose, parallel computers. In this paper the parallel simulation of the size segregation of a binary mixture of granular materials in a half-filled three-dimensional rotating drum using the discrete element method with linear contact forces is investigated. Performance results of an implementation in Fortran 90 using MPI for data communication on CRAY T3D, CRAY T3E-600, and CRAY T3E-900 are presented. These have been determined with the help of the Cray tools MPP Apprentice and the performance analysis tool PAT as well as the message passing visualization tool VAMPIR developed at the Research Centre Jülich. 1 Introduction The study of granular materials has long been an active field of research, partly due to the many interesting physical phenomena which granular materials give rise to and partly because of their importance for industrial applications [1, 6]. Due to the advent of more powerful computers many scientists and engineers believe that some of the phenomena known in this field can be better understood through well planned computer simulations [2]. This belief rests on the premise that these phenomena are collective or emergent in nature, i.e., the constituent grains experience simple, well understood interactions with each other, but that unexpected behavior emerges due to the large numbers of grains involved. Hence, if the grain-grain interactions can be efficiently programmed so that a sufficiently large system can be simulated, then it should be possible to study phenomena which are still poorly understood. In the past such simulations were performed on vector computers. However, since most of the computational time is spent calculating the collision forces acting between the particles, this limits the consideration of interactions to those which can be vectorized. For many problems of scientific interest this limitation is not very restrictive, especially if one ignores factors like the price/performance ratio of the 1
4 computation. One of the promises of massively parallel computers consisting of scalar or super-scalar processors is the ability to perform cost-effective simulations on systems with more complicated, and more realistic interactions. Parallelization techniques for particle algorithms depend upon the range of the particle interactions and the number of particles. For short range interactions and simulations with more than a few thousand particles, the link-cell approach, a form of domain parallelization, is the most appropriate choice. This method divides the physical space into small cells and assigns each particle to a given cell. If the cell size is larger than the particles interaction radius, then only the neighboring cells need to be checked in order to find all possible collision partners. Parallelization is then accomplished by allocating all cells within a given physical domain to a given processor. For homogeneous systems and systems where fluctuations in the particle density are small, a static allocation of the domains to the processors is adequate. However, in the general case, statically allocated partitions lead to inadequately distributed computing loads. This problem can be overcome by mapping the domains to the processors dynamically [7]. Nevertheless the basic physical understanding of granular materials is far from being complete. One of the most intriguing properties is their tendency to segregate. It is observed in many industrial particle handling situations, such as transporting grains or mixing pharmaceutical pills. The rotating-cylinder geometry is an archetype of numerous devices used in industrial material processing where radial segregation can occur on short time scales and axial segregation is observed on larger time scales. The mechanism of the segregation process is based on the surface flow where small particles get stuck along the inclined plane more likely than the larger ones, hence accumulating near the center of the rotating drum (see figure 1). Many parameters are involved in the process of radial segregation and mixing, such as size, shape, mass, frictional forces, angular velocity, filling of the drum, etc. (a) (b) Figure 1: 2D drum: small particles are drawn as filled circles and large particles as open circles. (a) Snapshot of the drum right before the first avalanche. (b) Snapshot of the drum after rotating t = 60s with angular velocity ω = 1.0 Hz, i.e. after 9 rotations 2
5 2 Parallelization Distinct element simulations are based upon the use of distinct, individual elements each of which is free to move according to some given rules [3]. For granular materials, the most important interactions are the inelastic, soft-sphere collisions. For such short-range interactions, the link-cell algorithm is the most efficient programming technique [2] (see figure 2). This method starts by dividing the physical space into either square or cubic cells, depending on the underlying dimension of the physical space, with a side length R L. For polydisperse systems, i.e. systems with particles of varying diameter, one normally takes R L = R max +ɛ, where ɛ is a small positive number and R max is the diameter of the largest particle. For the monodisperse case, where all particles have the same diameter, it is more efficient to take R L = R max ɛ [5]. Figure 2: Link-cell algorithm (2-D) Once the space has been sectioned, all particles whose physical coordinates lie inside a given cell are placed into a linked-list associated with that cell (see figure 3). Then the problem of finding all particles colliding with a given particle is reduced to searching over all neighboring cells for the case R L > R max. (In practice one searches only over half the neighboring cells because the collisions are symmetric.) All interacting particle pairs can now be placed into a list which can then be efficiently processed in order to determine the forces acting on each particle due to the collisions. Usually one tries to find an ɛ such that the list needs only be recreated at most every 10-th time step. After the forces acting on each element are calculated, the Hamiltonian equations of motion are integrated to find the new position of each particle. Normally, a simple leap-frog integration method suffices, however, predictor-corrector schemes are also in widespread use [1]. Typically, the time spent integrating the equations of motion is negligible compared to the time needed for calculating the particle interactions. In this paper the parallel simulation of the size segregation of a binary mixture of granular material in a half-filled three-dimensional rotating drum using the distinct element method with linear contact forces is investigated. The rotation axis in this study is the x-axis and the cylinder is parallelized along this axis 3
6 CELL i,j Pointer Particle n 1 Particle Particle n2 n3 Pointer Pointer Pointer Figure 3: Linked list (see figure 4). Each processing element (PE) is owner of the data of the local particles and the data of the halo regions which contain the particles from the neighboring PEs. The particles in these cells are not updated rather their positions are used for the force calculations of the particles in the true cells. During the course of the simulation particles will migrate outside of the spatial region controlled by the processor on which they reside. Such particles need to be removed from the list in which they are registered and transmitted to the appropriate processor, where they are then registered. The performance measurements presented here have been performed without dynamic load balancing because in this application the flow of particles from one PE to another is approximately the same. Therefore no accumulation of particles on one PE can occur. rotation axis Figure 4: Parallelization of the 3D drum This approach has been implemented in Fortran 90 using MPI [8] for data communication on the systems CRAY T3D, CRAY T3E-600, and CRAY T3E-900 [9]. The numerical methods and the parameters used as well as a quantitative analysis of the segregation for different rotational velocities are described in [4]. 3 Performance Investigations Performance measurements have been accomplished on a CRAY T3D, on a CRAY T3E-600 with stream buffers disabled, and on a CRAY T3E-900 with stream buffers enabled and disabled, resp. External stream buffers in a CRAY T3E system are used to maximize local memory bandwidth, leading to a better performance for vector-like data references. The CRAY T3E-600 at the Research Centre Jülich is equipped with the older PE modules. A hardware design problem in the memory control chip may lead to stability problems of the system when the stream buffers are activated. Therefore they are disabled and may not be activated by user controlled environment variables. The characteristics of the Cray MPP systems used here are shown in table 1. 4
7 T3D T3E-600 T3E-900 Processor DEC Alpha EV4 EV5 EV5 Clock 150 MHz 300 MHz 450 MHz MFLOPS (peak performance) D torus clock 150 MHz 150 MHz 150 MHz 3D torus link bandwidth 300 MHz 500 MHz 500 MHz Primary cache 8 KB 8 KB 8 KB Secondary cache - 96 KB 96 KB Memory bandwidth 300 MB/s 1200 MB/s 1200 MB/s Table 1: Cray MPP systems characteristics On the CRAY T3E-600 the processor clock rate is doubled in comparison to the CRAY T3D. Furthermore, the CRAY T3E processor can perform 2 operations per clock period as opposed to 1 operation on a CRAY T3D. On the CRAY T3E applications can additionally benefit from the secondary cache which is not available on the CRAY T3D. The application s performance has been investigated using the Cray tools MPP Apprentice and the Performance Analysis Tool (PAT) as well as the message passing visualization tool VAMPIR (Visualization and Analysis of MPI Resources) developed at the Research Centre Jülich. MPP Apprentice and PAT can be used to identify the most time-consuming routines. MPP Apprentice assists the user in determining the performance characteristics of a parallel application on a CRAY T3D or T3E system and gives some indication of the causes of the observed behavior. Due to the large overhead induced by the MPP Apprentice run-time library, the given timings are only an indication of the real execution times. Moreover, the given MFLOPS or integer operations cannot be used to measure the real performance. To provide more thorough information the PAT performance analysis tools is available on CRAY T3E systems. PAT uses hardware performance counters and the profil(2) system call on UNICOS/mk systems. It provides a fast, low-overhead method for estimating the amount of time consumed in procedures, determining load balance across PEs, generating and viewing trace files, timing individual calls to routines, and displaying hardware performance counter information. A program that gathers PAT performance data runs much faster than a program instrumented to collect performance data for MPP Apprentice. On the average, a program instrumented to collect data for MPP Apprentice runs three times slower compared to the uninstrumented program. On the other hand, VAMPIR provides detailed information on the message passing communication and the load balancing on the PEs. VAMPIR translates a trace file generated on a Cray MPP system at runtime into a variety of graphical views, e.g. state diagrams, activity charts, time-line displays (see figure 5), and statistics. Time-line displays are helpful to get an overview of the load balancing of the program. Colors are used to represent different kinds of activities. In this example MPI routines are shown in blue whereas the computation part is shown in green. Zooming is possible to analyze the program on any level of detail. Each message sent from one PE to another can be identified. The execution time for one iteration in this example is about 24 ms and can be determined using a VAMPIR popup-menu. To generate trace data in 5
8 the current version the source code has to be instrumented with calls to a run-time library. A future version of PAT will be capable of an object code instrumentation which makes the usage of VAMPIR independent of preprocessors for special programming languages. Figure 5: Time-line display showing one iteration out of 200 with particles on 16 PEs of a CRAY T3E-900 (stream buffers activated) Table 2 shows the measured execution times of the application without I/O. 200 iterations of the simulation were performed for a drum with particles on 16 PEs. The performance gain of CRAY T3E-900 over CRAY T3E-600 can be 50 % at maximum because of the higher clock rate. Moreover, the stream buffer usage may additionally speed up the program. The upper window in figure 6 shows the sum of the execution times of all user routines on 16 PEs of a CRAY T3D in comparison to a CRAY T3E-600 without stream buffer usage. As mentioned above the most time-consuming routine is the computation of the particle-particle interactions. This part of the program is about 3 times faster on the CRAY T3E-600. The window below displays the sum of the MPI 6
9 T3D T3E-600 streams off T3E-900 streams off T3E-900 streams on Execution time s 5.86 s 5.19 s 4.39 s Speedup in relation to CRAY T3D Speedup in relation to CRAY T3E-600 Table 2: 200 iterations with particles on 16 PEs routines showing a considerable amount of synchronization overhead (MPI barrier). In figure 7 the effect of the stream buffer usage can be seen. The overhead induced by MPI communication routines is about the same on both CRAY T3E systems. Only the amount of barrier synchronization is reduced by 50 % on the CRAY T3E-900. The most time-consuming barrier synchronization is at the beginning of the program. PE 0 has to read the input data and broadcast the appropriate subsets to the other PEs, which have to wait that PE 0 has finished the preparatory work. The performance counters of PAT give about 90 to 100 million integer operations per PE for a large system of particles and 50 iterations for the whole program including I/O on 32 PEs of a CRAY T3E-600, which is about 16 % of the theoretical peak performance. The measured wall-clock time is about 2.5 minutes for the iterations of the simulation. 4 Summary and Discussion We have studied the performance of a parallel algorithm simulating the size segregation of a binary mixture of granular materials in a half-filled three-dimensional rotating drum. The algorithm has been implemented on the Cray MPP systems CRAY T3D, CRAY T3E-600, and CRAY T3E-900. The measurements on the CRAY T3E-600 have been performed without stream buffer usage whereas on the CRAY T3E-900 the effect of the stream buffers has been considered as well. The CRAY T3E-600 is about 2.6 times faster than the CRAY T3D for the application described in this paper. Using a CRAY T3E-900 without stream buffers a speedup of 11 % can be achieved in comparison to the CRAY T3E-600. Furthermore, for this application the stream buffer usage gives an additional speedup of 15 % as opposed to a CRAY T3E-900 with stream buffers not activated. The performance improvement is about 25 % in comparison to a CRAY T3E-600 with stream buffers disabled. These performance measurements confirm the results which have been observed for other applications and benchmark tests on Cray MPP systems. Acknowledgements The authors are grateful to the University of Rostock and the Konrad-Zuse-Zentrum für Informationstechnik Berlin for granting access to their CRAY T3E-900 and CRAY T3D, resp. 7
10 Figure 6: Timings for calculation and MPI overhead on CRAY T3D (upper bars) and CRAY T3E-600 (lower bars) without stream buffer usage summed up on 16 PEs 8
11 Figure 7: Timings for calculation and MPI overhead on CRAY T3E-900 with stream buffers disabled (upper bars) and CRAY T3E-900 with stream buffers enabled (lower bars) summed up on 16 PEs 9
12 References 1. M. P. Allen and D. J. Tildesley, Computer Simulations of Liquids, Clarendon Press, Oxford, D.M. Beazley and P.S. Lomdahl, Message-Passing Multi-Cell Molecular Dynamics on the Connection Machine 5, Parallel Computing 20, 2 (1994) P.A. Cundall, O.D.L. Strack, A discrete numerical model for granular assemblies, Géotechnique 29, 1 (1979) C.M. Dury and G.H. Ristow, Radial Segregation in a Two-Dimensional Rotating Drum, Journal de Physique I France 7 (1997) W. Form, N. Ito, and G.A. Kohring, Vectorized and Parallelized Algorithms for Multi-Million Particle MD-Simulations, Int. J. Mod. Phys. C4 (1993) R.W. Hockney and J.W. Eastwood, Computer Simulation Using Particles, Adam Hilger, Bristol, R. Knecht and G.A. Kohring, Dynamic Load Balancing for the Simulation of Granular Materials, Proceedings of ICS 95, Barcelona, 3-7 July 1995, Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, T3E overview, obtainable from: 10
MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationFORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461)
FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D-52425 Jülich, Tel. (02461) 61-6402 Interner Bericht Performance Characteristics for OpenMP Constructs on Different Parallel Computer
More informationS7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs
S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs Elmar Westphal - Forschungszentrum Jülich GmbH Spheroids Spheroid: A volume formed by rotating an ellipse around one of its axes Two
More informationIMD: A SOFTWARE PACKAGE FOR MOLECULAR DYNAMICS STUDIES ON PARALLEL COMPUTERS
International Journal of Modern Physics C, Vol. 8, No. 5 (1997) 1131 1140 c World Scientific Publishing Company IMD: A SOFTWARE PACKAGE FOR MOLECULAR DYNAMICS STUDIES ON PARALLEL COMPUTERS J. STADLER,
More informationCost-Effective Parallel Computational Electromagnetic Modeling
Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationA fast model for the simulation of non-round particles
Granular Matter 1, 9 14 c Springer-Verlag 1998 A fast model for the simulation of non-round particles Alexander V. Potapov, Charles S. Campbell Abstract This paper describes a new, computationally efficient
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2018 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationParallelisation of Surface-Related Multiple Elimination
Parallelisation of Surface-Related Multiple Elimination G. M. van Waveren High Performance Computing Centre, Groningen, The Netherlands and I.M. Godfrey Stern Computing Systems, Lyon,
More informationPerformance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr
Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationComparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)
Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationHealthy Buildings 2017 Europe July 2-5, 2017, Lublin, Poland
Healthy Buildings 2017 Europe July 2-5, 2017, Lublin, Poland Paper ID 0122 ISBN: 978-83-7947-232-1 Numerical Investigation of Transport and Deposition of Liquid Aerosol Particles in Indoor Environments
More informationDynamic load balancing in OSIRIS
Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance
More informationParallel Summation of Inter-Particle Forces in SPH
Parallel Summation of Inter-Particle Forces in SPH Fifth International Workshop on Meshfree Methods for Partial Differential Equations 17.-19. August 2009 Bonn Overview Smoothed particle hydrodynamics
More informationPerformance Study of the MPI and MPI-CH Communication Libraries on the IBM SP
Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationHigh-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers
High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationModeling Evaporating Liquid Spray
Tutorial 17. Modeling Evaporating Liquid Spray Introduction In this tutorial, the air-blast atomizer model in ANSYS FLUENT is used to predict the behavior of an evaporating methanol spray. Initially, the
More informationThe determination of the correct
SPECIAL High-performance SECTION: H i gh-performance computing computing MARK NOBLE, Mines ParisTech PHILIPPE THIERRY, Intel CEDRIC TAILLANDIER, CGGVeritas (formerly Mines ParisTech) HENRI CALANDRA, Total
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationMcNair Scholars Research Journal
McNair Scholars Research Journal Volume 2 Article 1 2015 Benchmarking of Computational Models against Experimental Data for Velocity Profile Effects on CFD Analysis of Adiabatic Film-Cooling Effectiveness
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationModeling Evaporating Liquid Spray
Tutorial 16. Modeling Evaporating Liquid Spray Introduction In this tutorial, FLUENT s air-blast atomizer model is used to predict the behavior of an evaporating methanol spray. Initially, the air flow
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationFLUENT Secondary flow in a teacup Author: John M. Cimbala, Penn State University Latest revision: 26 January 2016
FLUENT Secondary flow in a teacup Author: John M. Cimbala, Penn State University Latest revision: 26 January 2016 Note: These instructions are based on an older version of FLUENT, and some of the instructions
More informationBlue Waters I/O Performance
Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract
More informationMonte Carlo Method on Parallel Computing. Jongsoon Kim
Monte Carlo Method on Parallel Computing Jongsoon Kim Introduction Monte Carlo methods Utilize random numbers to perform a statistical simulation of a physical problem Extremely time-consuming Inherently
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationFORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461)
FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D-52425 Jülich, Tel. (02461) 61-6402 Interner Bericht Wrappers for Tracing Collective Communication Functions with PAT Bart Theelen
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationFORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461) AIX Compiler Update
FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D-52425 Jülich, Tel. (02461) 61-6402 Interner Bericht AIX Compiler Update Klaus Wolkersdorfer FZJ-ZAM-IB-2001-17 September 2001 (letzte
More informationCRAY T3E at the Research Centre Juelich - Delivering GigaFlops Around the Clock
CRAY T3E at the Research Centre Juelich - Delivering GigaFlops Around the Clock Jutta Docter Zentralinstitut fuer Angewandte Mathematik Forschungszentrum Juelich GmbH, Juelich, Germany ABSTRACT: Scientists
More informationMicrowell Mixing with Surface Tension
Microwell Mixing with Surface Tension Nick Cox Supervised by Professor Bruce Finlayson University of Washington Department of Chemical Engineering June 6, 2007 Abstract For many applications in the pharmaceutical
More informationRed Storm / Cray XT4: A Superior Architecture for Scalability
Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationPerformance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows
Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows Benzi John* and M. Damodaran** Division of Thermal and Fluids Engineering, School of Mechanical and Aerospace
More informationISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH
ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH Heike Jagode, Shirley Moore, Dan Terpstra, Jack Dongarra The University of Tennessee, USA [jagode shirley terpstra
More informationNumerical Simulations of Granular Materials Flow around Obstacles: The role of the interstitial gas
Numerical Simulations of Granular Materials Flow around Obstacles: The role of the interstitial gas Avi Levy, Dept. Mech. Eng., Ben Gurion University, Beer Sheva, Israel. Mohamed Sayed, CHC, National Research
More informationIntroducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method
Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System
More informationThe Five Rooms Project
The Five Rooms Project The Assignment If an artist is given the task of graphically designing a surface, then he is also left to decide which creative processes will be active and which criteria will then
More informationLATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS
14 th European Conference on Mixing Warszawa, 10-13 September 2012 LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS Felix Muggli a, Laurent Chatagny a, Jonas Lätt b a Sulzer Markets & Technology
More informationNon-Newtonian Transitional Flow in an Eccentric Annulus
Tutorial 8. Non-Newtonian Transitional Flow in an Eccentric Annulus Introduction The purpose of this tutorial is to illustrate the setup and solution of a 3D, turbulent flow of a non-newtonian fluid. Turbulent
More information12 m. 30 m. The Volume of a sphere is 36 cubic units. Find the length of the radius.
NAME DATE PER. REVIEW #18: SPHERES, COMPOSITE FIGURES, & CHANGING DIMENSIONS PART 1: SURFACE AREA & VOLUME OF SPHERES Find the measure(s) indicated. Answers to even numbered problems should be rounded
More informationCMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**
CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the
More informationChapter 4. Clustering Core Atoms by Location
Chapter 4. Clustering Core Atoms by Location In this chapter, a process for sampling core atoms in space is developed, so that the analytic techniques in section 3C can be applied to local collections
More informationPartitioning Effects on MPI LS-DYNA Performance
Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI 5416-1225 zais@us.ibm.com Abbreviations: MPI message-passing interface RISC - reduced instruction set computing
More information3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:
BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General
More informationNAMD Serial and Parallel Performance
NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance
More informationSmall Height Duct Design for 17 Multicopter Fan Considering Its Interference on Quad-copter
Small Height Duct Design for 17 Multicopter Fan Considering Its Interference on Quad-copter Stremousov K.*, Arkhipov M.* **, Serokhvostov S.* ** * Moscow Institute of Physics and Technology, Department
More informationMATHEMATICAL MODELING OF SILICON SINGLE CRYSTAL INDUSTRIAL GROWTH
MATHEMATICAL MODELING OF SILICON SINGLE CRYSTAL INDUSTRIAL GROWTH Dr. Andris Muižnieks, Dr. Andis Rudevičs, Dr. Armands Krauze, BSc. Vadims Suškovs, BSc. Kirils Surovovs, BSc. Kārlis Janisels 1. Introduction
More informationSimulation of Flow Development in a Pipe
Tutorial 4. Simulation of Flow Development in a Pipe Introduction The purpose of this tutorial is to illustrate the setup and solution of a 3D turbulent fluid flow in a pipe. The pipe networks are common
More informationInfluence of mesh quality and density on numerical calculation of heat exchanger with undulation in herringbone pattern
Influence of mesh quality and density on numerical calculation of heat exchanger with undulation in herringbone pattern Václav Dvořák, Jan Novosád Abstract Research of devices for heat recovery is currently
More informationCFD MODELING FOR PNEUMATIC CONVEYING
CFD MODELING FOR PNEUMATIC CONVEYING Arvind Kumar 1, D.R. Kaushal 2, Navneet Kumar 3 1 Associate Professor YMCAUST, Faridabad 2 Associate Professor, IIT, Delhi 3 Research Scholar IIT, Delhi e-mail: arvindeem@yahoo.co.in
More informationResearch Collection. Localisation of Acoustic Emission in Reinforced Concrete using Heterogeneous Velocity Models. Conference Paper.
Research Collection Conference Paper Localisation of Acoustic Emission in Reinforced Concrete using Heterogeneous Velocity Models Author(s): Gollob, Stephan; Vogel, Thomas Publication Date: 2014 Permanent
More informationAPS Sixth Grade Math District Benchmark Assessment NM Math Standards Alignment
SIXTH GRADE NM STANDARDS Strand: NUMBER AND OPERATIONS Standard: Students will understand numerical concepts and mathematical operations. 5-8 Benchmark N.: Understand numbers, ways of representing numbers,
More informationNIA CFD Futures Conference Hampton, VA; August 2012
Petascale Computing and Similarity Scaling in Turbulence P. K. Yeung Schools of AE, CSE, ME Georgia Tech pk.yeung@ae.gatech.edu NIA CFD Futures Conference Hampton, VA; August 2012 10 2 10 1 10 4 10 5 Supported
More informationPerformance Prediction for Parallel Local Weather Forecast Programs
Performance Prediction for Parallel Local Weather Forecast Programs W. Joppich and H. Mierendorff GMD German National Research Center for Information Technology Institute for Algorithms and Scientific
More informationSIMULATION OF FLOW FIELD AROUND AND INSIDE SCOUR PROTECTION WITH PHYSICAL AND REALISTIC PARTICLE CONFIGURATIONS
XIX International Conference on Water Resources CMWR 2012 University of Illinois at Urbana-Champaign June 17-22, 2012 SIMULATION OF FLOW FIELD AROUND AND INSIDE SCOUR PROTECTION WITH PHYSICAL AND REALISTIC
More informationBenchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.
I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.
More informationSingle Pass Connected Components Analysis
D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected
More informationParallel Computer Architecture II
Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationPeta-Scale Simulations with the HPC Software Framework walberla:
Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,
More informationUsing a Single Rotating Reference Frame
Tutorial 9. Using a Single Rotating Reference Frame Introduction This tutorial considers the flow within a 2D, axisymmetric, co-rotating disk cavity system. Understanding the behavior of such flows is
More informationKinematics of Machines Prof. A. K. Mallik Department of Mechanical Engineering Indian Institute of Technology, Kanpur. Module 10 Lecture 1
Kinematics of Machines Prof. A. K. Mallik Department of Mechanical Engineering Indian Institute of Technology, Kanpur Module 10 Lecture 1 So far, in this course we have discussed planar linkages, which
More informationA geometric algorithm for discrete element method to generate composite materials
A geometric algorithm for discrete element method to generate composite materials J.F. Jerier, F.V. Donzé, D. Imbault & P. Doremus Laboratoire Sols, Solides, Structures, Risques Grenoble, France Jerier@hmg.inpg.fr
More informationUCLA UCLA Previously Published Works
UCLA UCLA Previously Published Works Title Parallel Markov chain Monte Carlo simulations Permalink https://escholarship.org/uc/item/4vh518kv Authors Ren, Ruichao Orkoulas, G. Publication Date 2007-06-01
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationLecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationConstrained Diffusion Limited Aggregation in 3 Dimensions
Constrained Diffusion Limited Aggregation in 3 Dimensions Paul Bourke Swinburne University of Technology P. O. Box 218, Hawthorn Melbourne, Vic 3122, Australia. Email: pdb@swin.edu.au Abstract Diffusion
More informationATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract
ATLAS NOTE December 4, 2014 ATLAS offline reconstruction timing improvements for run-2 The ATLAS Collaboration Abstract ATL-SOFT-PUB-2014-004 04/12/2014 From 2013 to 2014 the LHC underwent an upgrade to
More informationAge Related Maths Expectations
Step 1 Times Tables Addition Subtraction Multiplication Division Fractions Decimals Percentage & I can count in 2 s, 5 s and 10 s from 0 to 100 I can add in 1 s using practical resources I can add in 1
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationComputation of Three-Dimensional Electromagnetic Fields for an Augmented Reality Environment
Excerpt from the Proceedings of the COMSOL Conference 2008 Hannover Computation of Three-Dimensional Electromagnetic Fields for an Augmented Reality Environment André Buchau 1 * and Wolfgang M. Rucker
More informationCHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison
CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison Support: Rapid Innovation Fund, U.S. Army TARDEC ASME
More informationpc++/streams: a Library for I/O on Complex Distributed Data-Structures
pc++/streams: a Library for I/O on Complex Distributed Data-Structures Jacob Gotwals Suresh Srinivas Dennis Gannon Department of Computer Science, Lindley Hall 215, Indiana University, Bloomington, IN
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More information1 Serial Implementation
Grey Ballard, Razvan Carbunescu, Andrew Gearhart, Mehrzad Tartibi CS267: Homework 2 1 Serial Implementation For n particles, the original code requires O(n 2 ) time because at each time step, the apply
More informationA Chromium Based Viewer for CUMULVS
A Chromium Based Viewer for CUMULVS Submitted to PDPTA 06 Dan Bennett Corresponding Author Department of Mathematics and Computer Science Edinboro University of PA Edinboro, Pennsylvania 16444 Phone: (814)
More informationScope and Sequence for the New Jersey Core Curriculum Content Standards
Scope and Sequence for the New Jersey Core Curriculum Content Standards The following chart provides an overview of where within Prentice Hall Course 3 Mathematics each of the Cumulative Progress Indicators
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationENERGY-224 Reservoir Simulation Project Report. Ala Alzayer
ENERGY-224 Reservoir Simulation Project Report Ala Alzayer Autumn Quarter December 3, 2014 Contents 1 Objective 2 2 Governing Equations 2 3 Methodolgy 3 3.1 BlockMesh.........................................
More informationPulsating flow around a stationary cylinder: An experimental study
Proceedings of the 3rd IASME/WSEAS Int. Conf. on FLUID DYNAMICS & AERODYNAMICS, Corfu, Greece, August 2-22, 2 (pp24-244) Pulsating flow around a stationary cylinder: An experimental study A. DOUNI & D.
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationParallel Computer Architecture and Programming Written Assignment 3
Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the
More informationUnit 1: Area Find the value of the variable(s). If your answer is not an integer, leave it in simplest radical form.
Name Per Honors Geometry / Algebra II B Midterm Review Packet 018-19 This review packet is a general set of skills that will be assessed on the midterm. This review packet MAY NOT include every possible
More informationData Analytics on RAMCloud
Data Analytics on RAMCloud Jonathan Ellithorpe jdellit@stanford.edu Abstract MapReduce [1] has already become the canonical method for doing large scale data processing. However, for many algorithms including
More informationDomain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández
Domain Decomposition for Colloid Clusters Pedro Fernando Gómez Fernández MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2004 Authorship declaration I, Pedro Fernando
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationTwo main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s
. Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture
More informationSystems Programming and Computer Architecture ( ) Timothy Roscoe
Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture
More informationDistributed Individual-Based Simulation
Distributed Individual-Based Simulation Jiming Liu, Michael B. Dillencourt, Lubomir F. Bic, Daniel Gillen, and Arthur D. Lander University of California Irvine, CA 92697 bic@ics.uci.edu http://www.ics.uci.edu/
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationA Source Localization Technique Based on a Ray-Trace Technique with Optimized Resolution and Limited Computational Costs
Proceedings A Source Localization Technique Based on a Ray-Trace Technique with Optimized Resolution and Limited Computational Costs Yoshikazu Kobayashi 1, *, Kenichi Oda 1 and Katsuya Nakamura 2 1 Department
More information