Partitioning Effects on MPI LS-DYNA Performance
|
|
- Rosalyn Dawson
- 5 years ago
- Views:
Transcription
1 Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI Abbreviations: MPI message-passing interface RISC - reduced instruction set computing SMP - shared memory parallel MPP massively parallel processing Keywords: Crash Simulation, LS-DYNA, Metal Stamping, Parallel Computing, Performance, Workstation 17-1
2 ABSTRACT The MPI version of LS-DYNA includes several options for decomposition of the finite element model. In this paper, the use of these options will be explored, for both metalforming and automotive crash simulations for input decks with size ranging from small to very large. The effect on elapsed time performance and scalability will be measured for different partitioning options. In addition, performance characteristics of a workstation cluster will be evaluated. INTRODUCTION The MPI version of LS-DYNA has been developed by LSTC throughout the 199 s. A primary goal for the code has been superior scalability, enabling many processors to work together efficiently in the execution of LS-DYNA. Some important aspects which affect performance of the MPI version are processor speed, model characteristics (size and type of elements), communication, and load balance. Processor performance has a very direct impact on MPI LS-DYNA elapsed time, but for the purposes of this study, the type of processors will be considered fixed, so the resulting variations in performance will not be considered. Conversely, communication and load balance can be somewhat controlled by the end user. During the initialization phase of the MPI code, the model is partitioned into several domains, and the arrangement of these domains will cause changes in how much communication is required between the domains, and will also influence the load imbalance between the domains. This study demonstrates, for several representative input decks, how various partitioning options affect overall performance. SMP PARALLELISM IN LS-DYNA The first LS-DYNA multi-processor version was an SMP (shared memory parallel) implementation available on the CRAY Y-MP in Subsequently, SMP Parallel LS-DYNA has been ported to many computer platforms and architectures, ranging from high-end vector supercomputers to RISC-based servers to machines based on low-cost commodity-based microprocessors. For all of these computers, SMP parallelism is achieved in basically the same fashion. Compilers use directives inserted into the LS-DYNA source code to generate machine instructions which divide the work for key loops among the available processors specified by the user. Only loops which are safe - those that will produce correct results when run in parallel - have these parallel constructs. In SMP parallelism, there are several barriers which cause less than ideal scaling. Chief among them is the fact that only a certain number of the loops in the code are able to run in parallel. Even for those loops that are parallel, there is startup time where the work is divided among the various processors, adding to system overhead. Finally, even though the work is scheduled as evenly as possible, load imbalance among the processors means that some processors will complete their portion of the work in a loop and must wait for the others, before the machine can advance to the next section of the LS-DYNA code. Because of all these factors, SMP parallel performance is limited. While experts from the LSTC development team and the various computer hardware vendors continually work on 17-2
3 improving SMP parallel performance, parallel scalability beyond 4 or 6 processors is not very effective, and is not used very often in a production environment. MPI PARALLELISM IN LS-DYNA Characteristics of MPI LS-DYNA The limits on SMP parallelism, not just in LS-DYNA, but in many codes, have led to the development of domain decomposition parallel methods. The chief advantage is that with domain decomposition methods, a far greater percentage of the instructions can be run in parallel, so the parallel efficiency is greater. In the domain decomposition version of LS-DYNA, the geometry is divided into several domains, one for each processor. During every time step, each processor advances the solution for its own domain to the end of that time step. This process is independent of all other domains, so it is highly parallel. However, before work on the next time step can begin, communication must occur to relate information on the state of the solution to neighboring domains. Once this communication is complete, the solution phase of the next time step begins. The communication in LS-DYNA takes place according to MPI (the message-passing interface), a standard communication protocol. This is a portable set of communication calls available on all popular computer systems. Therefore, the code is referred to either the MPI version of LS-DYNA, or the domain decomposition version of LS-DYNA. Another common name is MPP-DYNA. The MPP moniker was originally associated with the massively parallel processing machines of the mid-199 s, but as parallel computer architectures evolved, massively parallel became something of a misnomer. Today MPP-DYNA more accurately refers to the message-passing parallel version of the code, as opposed to the shared memory parallel version of LS-DYNA. The MPP machines were envisioned with thousands of processors applied to execution of the same binary. Parallel speedup of computers is limited by Amdahl s Law. Figure 1 shows how a code can scale as a function of percentage of time spent in parallel routines. In order to use hundreds (or thousands) of processors a code must be more than 99.9% parallel. For a fullfeatured industrial code like LS-DYNA, this is very difficult. Fortunately, LSTC has been able to make the MPI version contain enough parallel content so that in cases it is more than 98% parallel and can scale efficiently up to 64 or more processors. Since microprocessors have evolved to be very powerful in recent years, this level of scalability coupled with fast processors allows the user to solve large LS-DYNA simulations in a reasonable elapsed times. It is apparent that a well-selected set of domains could influence MPI LS-DYNA performance. Load balance between the domains is achieved by insuring that each domain has an equal amount of work required at each time step. This is, of course, more complicated than just dividing the total number of elements in the model into groups with the same number of elements. The computational cost of different types of elements and materials is one factor which complicates the load balance. In addition, time spent in contact also has a great influence on computational cost. The time spend in contact will also change during the solution, so that a domain which initially produces good load balance may end up with much worse load balance characteristics. The relative cost for communication is also important. If the domains are established incorrectly, great amounts of communication could be required between domains. 17-3
4 Because of these reasons, selection of the proper domain decomposition does influence performance. Parallel Speedup Figure 1. Amdahl s law for parallel speedup. Each line corresponds to speedup for a particular percentage of parallel instructions, from 5% to 99.9% parallel. Default Partitioning There are several options available for partitioning the LS-DYNA model. These are documented in the LS-DYNA Users manual. The default method used for partitioning is RCB (recursive coordinate bisection). The other methods available are RSB (recursive spectral bisection) and greedy. The default RCB method usually produces the shortest elapsed time, so the others will not be investigated here. The partitioning method can be specified in the partitioning file, an optional file used by the MPI LS-DYNA binary, where the user can specify several options relating to partitioning. A useful tool for investigating the partitioning is the show command in the partitioning file. Enabling this option will cause LS-DYNA to halt just after initialization, with the partitioning information graphically contained in the D3PLOT file. Figure 2 shows the NCAC Taurus model partitioned into four domains. 17-4
5 Figure 2. Default partitioning of the NCAC Taurus model - 4 domains. Geometric Partitioning While using RCB, the partitioning can be controlled using both the geometry and the contact surfaces of the model. This is useful because the default partitioning is sometimes less than ideal. For instance, the NCAC Taurus model of Figure 2 is involved in a front crash, so that most of the contact searching will take place in the front of the vehicle. Therefore, the two domains at the front of the vehicle will require more computation than those at the rear of the vehicle, leading to load imbalance. A better domain decomposition is something like that which is displayed in Figure 3, where most domains runs from the front of the vehicle to the back, so that all domains have similar contact characteristics, and better overall load balancing. 17-5
6 Figure 3. Customized partitioning of the NCAC Taurus model - 8 domains. This type of domain decomposition can be achieved using the geometry options expdir and expsf. These options will expand the model in a certain direction (expdir) by a certain scale factor (expsf) before the domain decomposition occurs. A typical usage is decomposition (expdir 2 expsf 1) which will stretch the model out by a factor of 1 in the Y direction. Once this is done, the regular partitioning algorithm will generally partition the model in long strips from the front to the rear of the vehicle, as desired. Partitioning According to Contact Surfaces For some crash input decks, the model can be efficiently partitioned according to the contact surfaces (the sliding interfaces) through use of the silist option in the partitioning file. A typical front crash model may have one or perhaps a few sliding interfaces which involve the elements toward the front end of the vehicle, where the contact occurs. Overall computational cost is lessened by not making elements towards the rear of the vehicle members of these sliding interface sets. Some less important sliding interfaces may exist in order to handle particular cases of contact. By assigning the more important sliding interfaces to the silist, MPI LS-DYNA will first partition the elements associated with those interfaces, and then follow with the remainder of the elements. This helps prevent load balance problems, since the primary contact surfaces are generally evenly distributed among the processors. 17-6
7 LS-DYNA PERFORMANCE IN STAMPING SIMULTATIONS Even though it is a general-purpose code, LS-DYNA has historicallybeen used primarilyfor crash simulation and metal stamping, so those applications will be examined in this study. This portion of the study examines some of the performance characteristics of stamping simulations. General Scalability of Stamping Input Decks The scalability of MPI LS-DYNA increases as the number of elements in each model grows. In order to assess the practical limits of scalability, several metalstamping input decks of various sizes were provided by Volvo Car Company Model Identifier Size (elements) Stamp , Stamp , Stamp , Stamp-19 1,9, Each input deck was run on a range of processors, between 1 and 64, and the timing information was recorded from the D3HSP file. For each input deck, both the initialization phase plus the simulation phase elapsed times decrease as more processors are used for the analysis. Figure 4 shows an example of such data Elapsed time (sec) calculation initialization Number of processors Figure 4. Parallel speedup for the Stamp-655 input deck. 17-7
8 Data for all four input decks is summarized in Figure 5. In this plot, the vertical axis represents the elapsed time speedup which occurs as the number of processors is doubled. Two effects are obvious: 1.As more and more processors are used, the gain from parallelism decreases. However, for the larger models it is still reasonable to use between 32 and 64 processors efficiently. 2.Better parallelism is observed for the larger models. For the smallest model used in this study, the elapsed time actually increased as the number of processors went from 32 to 64. Smaller models generate too much overhead which can t be overcome with parallelism at these processor counts. 2 Parallel Speedup Stamp-141 Stamp-257 Stamp-655 Stamp-19 1->2 2->4 4->8 8->16 16->32 32->64 Increase in number of processors Figure 5. Parallel speedup of all Stamping input decks. Partitioning Options for Stamping The LS-DYNA Users Manual recommends the following partitioning file options for stamping input decks (assuming that the direction of travel for the punch is in the Z direction): decomposition ( expdir 3 expsf ) This essentially flattens out the input deck geometry before partitioning, and the model is partitioned in a series of columns in the Z direction. Partitioning Study with a Metal Stamping Input Deck Two of the Volvo input decks were run to completion as a test of the partitioning options. Each deck was run according to the LSTC recommendation, plus according to the other two directions, as a comparison. Results of this experiment are shown in Figures 6 and 7. These figures verify that the recommendations in the Users Manual are correct, demonstrating that partitioning in the Z direction can substantially improve elapsed time performance. In addition, the recommended solution appears more robust, since partitioning in the other 17-8
9 directions sometimes results in the simulation stopping at a point just before the planned completion of the analysis. Elapsed time (hr) expdir 1 expdir 2 expdir 3 Figure 6. Partitioning results for the Stamp-257 model. Elapsed time (hr) expdir 1 expdir 2 expdir 3 Figure 7. Partitioning results for the Stamp-141 model. 17-9
10 LS-DYNA PERFORMANCE IN AUTOMOTIVE CRASH SIMULTATIONS General Scalability of Automotive Crash Input Decks General scalability of MPI LS-DYNA for automotive crash was measured by testing a variety of input decks: Elements Model Identifier 5,5 WPI Rigid Pole 28, NCAC Taurus 1, Customer Front Offset 275, NCAC Neon Figures 8-11 show scalability results for these input decks, showing how the elapsed time decreases as the number of processors increases. In general, performance of these input decks is very similar to the tendencies observed in the stamping scalability study. 7 Elapsed Time (sec) RS/6 SP Workstation cluster Figure 8. WPI Rigid Pole scalability. Scalability on a workstation cluster Use of the domain decomposition version of LS-DYNA allows users to run simulations on a network of computers. As a first step in evaluating the efficiency of that concept, all of the automotive crash input decks and one of the stamping input decks were run on a small network of workstations in addition to the IBM RS/6 SP computer. This network of workstations consisted of two IBM 43P model 26 workstations, each with a pair of 2 MHz POWER3 processors, identical to those found in the RS/6 SP system. Therefore, the only significant difference in the two configurations was the interconnection: 17-1
11 System RS/6 SP RS/6 workstation cluster Communication switch 1T Ethernet Latency (microsec) 24 high Bandwidth (MB/sec) Elapsed Time (sec) RS/6 SP Workstation cluster Figure 9. NCAC Taurus scalability. Therefore, this experiment was able to demonstrate the importance of the role of communication in the performance of the domain decomposition version of LS-DYNA. As expected, the results for one and two processors (see Figures 8-12) are very close, since at this processor count all calculations reside on one node of the RS/6 SP system or one workstation, and there is no communication between nodes
12 Elapsed Time (sec) RS/6 SP Workstation cluster Figure 1. Customer 1, element model scalability. Elapsed Time (sec) RS/6 SP Workstation cluster Figure 11. NCAC Neon scalability
13 Elapsed Time (sec) RS/6 SP Workstation cluster Figure 12. Stamp-141 scalability. When using four processors, the slower ethernet communication causes a slight increase in the overall elapsed time of the jobs, but many users should find this an attractive and economical way to utilize their computer resources. In order to gather data with convenient elapsed times, the larger simulations in this study were shortened in duration. One possible explanation of the results is that the communication characteristics would change as the simulation progresses, where contact plays a more important role. Therefore, the metalforming and 1, element crash simulations were both run to completion. A comparison of the elapsed times shows that even for the full simulations, the workstation cluster is only 3% slower than the RS/6 SP system, so the partial simulation results are valid in comparing timings. Partitioning Study with some Automotive Crash Input Decks The three largest representative crash simulation input decks were run using various settings for partitioning, to determine the effect on performance. The results are summarized in Figures These bar charts show how the elapsed times of the jobs change for various numbers of processors, for both default partitioning and partitioning using options such as: decomposition ( expdir 2 expsf 1. silist 1,2 ) 17-13
14 Elapsed Time (hr) Default Partitioning Custom Partitioning Figure 13. Partitioning effects on the NCAC Taurus model. Elapsed Time (hr) Default Partitioning Custom Partitioning Figure 14. Partitioning effects on the 1, element customer model
15 Elapsed Time (hr) Default Partitioning Custom Partitioning Figure 15. Partitioning effects on the NCAC Taurus input deck. Two general trends concerning usage of these partitioning options can be observed from the performance data: 1.Customized partitioning does improve performance when using 4 or 8 processors. 2.Customized performance degrades performance when using 16 or more processors. The most significant performance change is observed for the 1, element model, where customized partitioning with 4 processors improves the performance from 7.3 hours to 5.16 hours elapsed time, a gain of 36%. SUMMARY For stamping simulations, it is prudent to follow the recommended partitioning settings described in the LS-DYNA Users Manual. This has been show to provide a faster and more robust solution. For automotive crash simulations using up to 8 processors, partitioning with the silist, expdir, and expsf options can improve performance, up to 36% for one of the examples studied here. For 16 or more processors, default partitioning provides better performance than customized partitioning. ACKNOWLEDGEMENTS Many people and organizations contributed to this effort. Particular thanks go to the following: Volvo Car Corporation - Olofstrom stamping group, for use of the stamping input decks NCAC - for use of the public domain Taurus and Neon crash simulation input decks WPI - for use of the public domain rigid pole input deck LSTC - for overall assistance, guidance, and provision of required binaries and license files 17-15
16 17-16
Assessment of LS-DYNA Scalability Performance on Cray XD1
5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123
More informationLS-DYNA Scalability Analysis on Cray Supercomputers
13 th International LS-DYNA Users Conference Session: Computing Technology LS-DYNA Scalability Analysis on Cray Supercomputers Ting-Ting Zhu Cray Inc. Jason Wang LSTC Abstract For the automotive industry,
More informationOptimizing LS-DYNA Productivity in Cluster Environments
10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for
More informationHPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:
HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com
More informationSingle-Points of Performance
Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical
More informationConsiderations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture
4 th European LS-DYNA Users Conference MPP / Linux Cluster / Hardware I Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture Authors: Stan Posey, Nick
More informationConsiderations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment
9 th International LS-DYNA Users Conference Computing / Code Technology (2) Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment Stanley Posey HPC Applications Development SGI,
More informationNew Features in LS-DYNA HYBRID Version
11 th International LS-DYNA Users Conference Computing Technology New Features in LS-DYNA HYBRID Version Nick Meng 1, Jason Wang 2, Satish Pathy 2 1 Intel Corporation, Software and Services Group 2 Livermore
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationMPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to
More informationPerformance Analysis of LS-DYNA in Huawei HPC Environment
Performance Analysis of LS-DYNA in Huawei HPC Environment Pak Lui, Zhanxian Chen, Xiangxu Fu, Yaoguo Hu, Jingsong Huang Huawei Technologies Abstract LS-DYNA is a general-purpose finite element analysis
More informationDetermining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect
8 th International LS-DYNA Users Conference Computing / Code Tech (1) Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationA Study of Workstation Computational Performance for Real-Time Flight Simulation
A Study of Workstation Computational Performance for Real-Time Flight Simulation Summary Jeffrey M. Maddalon Jeff I. Cleveland II This paper presents the results of a computational benchmark, based on
More informationEfficient Processing of Multiple Contacts in MPP-DYNA
Efficient Processing of Multiple Contacts in MPP-DYNA Abstract Brian Wainscott Livermore Software Technology Corporation 7374 Las Positas Road, Livermore, CA, 94551 USA Complex models often contain more
More informationHigh Performance Computing Course Notes Course Administration
High Performance Computing Course Notes 2009-2010 2010 Course Administration Contacts details Dr. Ligang He Home page: http://www.dcs.warwick.ac.uk/~liganghe Email: liganghe@dcs.warwick.ac.uk Office hours:
More informationDesign Optimization of Hydroformed Crashworthy Automotive Body Structures
Design Optimization of Hydroformed Crashworthy Automotive Body Structures Akbar Farahani a, Ronald C. Averill b, and Ranny Sidhu b a Engineering Technology Associates, Troy, MI, USA b Red Cedar Technology,
More informationHigh Performance Computing Course Notes HPC Fundamentals
High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationImprovement of domain decomposition of. LS-DYNA MPP R7 and R8
Improvement of domain decomposition of LS-DYNA MPP R7 and R8 Mitsuhiro Makino Dynapower corp. Iroko27-3 Hoshouji, Okazaki, Aichi, 444-0206, Japan makino@dynapower.co.jp 1 Introduction In order to get the
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationLS-DYNA Productivity and Power-aware Simulations in Cluster Environments
LS-DYNA Productivity and Power-aware Simulations in Cluster Environments Gilad Shainer 1, Tong Liu 1, Jacob Liberman 2, Jeff Layton 2 Onur Celebioglu 2, Scot A. Schultz 3, Joshua Mora 3, David Cownie 3,
More informationDistributed Systems CS /640
Distributed Systems CS 15-440/640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andvinay Kolar 1 Objectives Discussion on Programming Models
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationIntroduction to High-Performance Computing
Introduction to High-Performance Computing 2 What is High Performance Computing? There is no clear definition Computing on high performance computers Solving problems / doing research using computer modeling,
More information3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:
BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationPlatform Choices for LS-DYNA
Platform Choices for LS-DYNA Manfred Willem and Lee Fisher High Performance Computing Division, HP lee.fisher@hp.com October, 2004 Public Benchmarks for LS-DYNA www.topcrunch.org administered by University
More informationMPP Contact: Options and Recommendations
MPP Contact: Options and Recommendations Brian Wainscott LSTC Introduction There are many different contact algorithms currently implemented in LS-DYNA, each having various options and parameters. I will
More informationDesign of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono
Design of Parallel Programs Algoritmi e Calcolo Parallelo Web: home.dei.polimi.it/loiacono Email: loiacono@elet.polimi.it References q The material in this set of slide is taken from two tutorials by Blaise
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationMonte Carlo Method on Parallel Computing. Jongsoon Kim
Monte Carlo Method on Parallel Computing Jongsoon Kim Introduction Monte Carlo methods Utilize random numbers to perform a statistical simulation of a physical problem Extremely time-consuming Inherently
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationParallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University
Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes Todd A. Whittaker Ohio State University whittake@cis.ohio-state.edu Kathy J. Liszka The University of Akron liszka@computer.org
More informationComparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)
Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,
More informationMPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCluster Scalability of Implicit and Implicit-Explicit LS-DYNA Simulations Using a Parallel File System
Cluster Scalability of Implicit and Implicit-Explicit LS-DYNA Simulations Using a Parallel File System Mr. Stan Posey, Dr. Bill Loewe Panasas Inc., Fremont CA, USA Dr. Paul Calleja University of Cambridge,
More informationUnderstanding Parallelism and the Limitations of Parallel Computing
Understanding Parallelism and the Limitations of Parallel omputing Understanding Parallelism: Sequential work After 16 time steps: 4 cars Scalability Laws 2 Understanding Parallelism: Parallel work After
More informationA PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D
3 rd International Symposium on Impact Engineering 98, 7-9 December 1998, Singapore A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D M.
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing Introduction to Parallel Computing with MPI and OpenMP P. Ramieri Segrate, November 2016 Course agenda Tuesday, 22 November 2016 9.30-11.00 01 - Introduction to parallel
More informationSmall-overlap Crash Simulation Challenges and Solutions
Small-overlap Crash Simulation Challenges and Solutions Sri Rama Murty Arepalli, Ganesh Kini, Roshan MN, Venugopal Thummala and Andrea Gittens 1 Introduction ESI Group Insurance Institute of Highway Safety,
More informationApplication of Parallel Processing to Rendering in a Virtual Reality System
Application of Parallel Processing to Rendering in a Virtual Reality System Shaun Bangay Peter Clayton David Sewry Department of Computer Science Rhodes University Grahamstown, 6140 South Africa Internet:
More informationAn Analysis of Object Orientated Methodologies in a Parallel Computing Environment
An Analysis of Object Orientated Methodologies in a Parallel Computing Environment Travis Frisinger Computer Science Department University of Wisconsin-Eau Claire Eau Claire, WI 54702 frisintm@uwec.edu
More informationHybrid (MPP+OpenMP) version of LS-DYNA
Hybrid (MPP+OpenMP) version of LS-DYNA LS-DYNA Forum 2011 Jason Wang Oct. 12, 2011 Outline 1) Why MPP HYBRID 2) What is HYBRID 3) Benefits 4) How to use HYBRID Why HYBRID LS-DYNA LS-DYNA/MPP Speedup, 10M
More informationParallel and distributed crashworthiness simulation
Parallel and distributed crashworthiness simulation G. Lonsdale C&C Research Laboratories, NEC Europe Ltd., St. Augustin, Germany J. Clinckemaillie ESI SA, Rungis, France S. Meliciani ESI SA, Rungis, France
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationLS-DYNA Performance on 64-Bit Intel Xeon Processor-Based Clusters
9 th International LS-DYNA Users Conference Computing / Code Technology (2) LS-DYNA Performance on 64-Bit Intel Xeon Processor-Based Clusters Tim Prince, PhD ME Hisaki Ohara, MS IS Nick Meng, MS EE Intel
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationLS-DYNA Performance Benchmark and Profiling. April 2015
LS-DYNA Performance Benchmark and Profiling April 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationCSC630/CSC730 Parallel & Distributed Computing
CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2
More informationReasons for Scatter in Crash Simulation Results
4 th European LS-DYNA Users Conference Crash / Automotive Applications III Reasons for Scatter in Crash Simulation Results Authors: Clemens-August Thole, Liquan Mei Fraunhofer Institute for Algorithms
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationHistory. Jurassic Period (-1945)
History Jurassic Period (-1945) Astronomical/Geometric tools Schickard, Leibniz, Pascal (17th Cen.) Babbage: Analytic engine (1842) first programmable machine. Scheutz (1853) Hollerith (1911) IBM (1924)
More informationPARALLEL BUCKET SORTING ALGORITHM
PARALLEL BUCKET SORTING ALGORITHM Hiep Hong Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 hiepst@yahoo.com ABSTRACT Bucket sorting algorithm achieves O(n) running
More informationPerformance Benefits of NVIDIA GPUs for LS-DYNA
Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA
More informationDell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia
More informationRecent Developments and Roadmap Part 0: Introduction. 12 th International LS-DYNA User s Conference June 5, 2012
Recent Developments and Roadmap Part 0: Introduction 12 th International LS-DYNA User s Conference June 5, 2012 1 Outline Introduction Recent developments. See the separate PDFs for: LS-PrePost Dummies
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationDevelopment of a Modeling Tool for Collaborative Finite Element Analysis
Development of a Modeling Tool for Collaborative Finite Element Analysis Åke Burman and Martin Eriksson Division of Machine Design, Department of Design Sciences, Lund Institute of Technology at Lund Unversity,
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationResource allocation and utilization in the Blue Gene/L supercomputer
Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa Agenda Blue Gene/L Background Blue Gene/L
More informationMulti-core Programming - Introduction
Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationz/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2
z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2 z/os 1.2 introduced a new heuristic for determining whether it is more efficient in terms
More informationPARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS
Technical Report of ADVENTURE Project ADV-99-1 (1999) PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS Hiroyuki TAKUBO and Shinobu YOSHIMURA School of Engineering University
More informationAnalytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationItanium a viable next-generation technology for crash simulation?
4 th European LS-DYNA Users Conference MPP / Linux Cluster / Hardware I Itanium a viable next-generation technology for crash simulation? Author: Adrian Hillcoat Hewlett-Packard Ltd Bracknell, UK Correspondence:
More informationChapter 13 Strong Scaling
Chapter 13 Strong Scaling Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationScript for Automated One Step Forming Analysis using LS-DYNA and LS-Prepost
12 th International LS-DYNA Users Conference Automotive(2) Script for Automated One Step Forming Analysis using LS-DYNA and LS-Prepost Amit Nair, Dilip Bhalsod Livermore Software Technology Corporation
More informationPARAMETRIC FINITE ELEMENT MODEL OF A SPORTS UTILITY VEHICLE - DEVELOPMENT AND VALIDATION
7 th International LS-DYNA Users Conference Crash/Safety (3) PARAMETRIC FINITE ELEMENT MODEL OF A SPORTS UTILITY VEHICLE - DEVELOPMENT AND VALIDATION Gustavo A. Aramayo Computation Materials Science Group
More informationChapter 3: AIS Enhancements Through Information Technology and Networks
Accounting Information Systems: Essential Concepts and Applications Fourth Edition by Wilkinson, Cerullo, Raval, and Wong-On-Wing Chapter 3: AIS Enhancements Through Information Technology and Networks
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationTEST CASE DOCUMENTATION AND TESTING RESULTS TEST CASE ID AWG-ERIF-10. MAT 224 Dynamic Punch Test Aluminium 2024 LSTC-QA-LS-DYNA-AWG-ERIF-10-9
TEST CASE DOCUMENTATION AND TESTING RESULTS LSTC-QA-LS-DYNA-AWG-ERIF-10-9 TEST CASE ID AWG-ERIF-10 MAT 224 Dynamic Punch Test Aluminium 2024 Tested with LS-DYNA R R10 Revision 116539 Thursday 11 th May,
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationDesign and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications
Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department
More informationExperiences with the Parallel Virtual File System (PVFS) in Linux Clusters
Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract
More informationLECTURE 1. Introduction
LECTURE 1 Introduction CLASSES OF COMPUTERS When we think of a computer, most of us might first think of our laptop or maybe one of the desktop machines frequently used in the Majors Lab. Computers, however,
More informationWhite Paper. Low Cost High Availability Clustering for the Enterprise. Jointly published by Winchester Systems Inc. and Red Hat Inc.
White Paper Low Cost High Availability Clustering for the Enterprise Jointly published by Winchester Systems Inc. and Red Hat Inc. Linux Clustering Moves Into the Enterprise Mention clustering and Linux
More informationSystem Architecture PARALLEL FILE SYSTEMS
Software and the Performance Effects of Parallel Architectures Keith F. Olsen,, Poughkeepsie, NY James T. West,, Austin, TX ABSTRACT There are a number of different parallel architectures: parallel hardware
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationChapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationParallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism
Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large
More information