Partitioning Effects on MPI LS-DYNA Performance

Size: px
Start display at page:

Download "Partitioning Effects on MPI LS-DYNA Performance"

Transcription

1 Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI Abbreviations: MPI message-passing interface RISC - reduced instruction set computing SMP - shared memory parallel MPP massively parallel processing Keywords: Crash Simulation, LS-DYNA, Metal Stamping, Parallel Computing, Performance, Workstation 17-1

2 ABSTRACT The MPI version of LS-DYNA includes several options for decomposition of the finite element model. In this paper, the use of these options will be explored, for both metalforming and automotive crash simulations for input decks with size ranging from small to very large. The effect on elapsed time performance and scalability will be measured for different partitioning options. In addition, performance characteristics of a workstation cluster will be evaluated. INTRODUCTION The MPI version of LS-DYNA has been developed by LSTC throughout the 199 s. A primary goal for the code has been superior scalability, enabling many processors to work together efficiently in the execution of LS-DYNA. Some important aspects which affect performance of the MPI version are processor speed, model characteristics (size and type of elements), communication, and load balance. Processor performance has a very direct impact on MPI LS-DYNA elapsed time, but for the purposes of this study, the type of processors will be considered fixed, so the resulting variations in performance will not be considered. Conversely, communication and load balance can be somewhat controlled by the end user. During the initialization phase of the MPI code, the model is partitioned into several domains, and the arrangement of these domains will cause changes in how much communication is required between the domains, and will also influence the load imbalance between the domains. This study demonstrates, for several representative input decks, how various partitioning options affect overall performance. SMP PARALLELISM IN LS-DYNA The first LS-DYNA multi-processor version was an SMP (shared memory parallel) implementation available on the CRAY Y-MP in Subsequently, SMP Parallel LS-DYNA has been ported to many computer platforms and architectures, ranging from high-end vector supercomputers to RISC-based servers to machines based on low-cost commodity-based microprocessors. For all of these computers, SMP parallelism is achieved in basically the same fashion. Compilers use directives inserted into the LS-DYNA source code to generate machine instructions which divide the work for key loops among the available processors specified by the user. Only loops which are safe - those that will produce correct results when run in parallel - have these parallel constructs. In SMP parallelism, there are several barriers which cause less than ideal scaling. Chief among them is the fact that only a certain number of the loops in the code are able to run in parallel. Even for those loops that are parallel, there is startup time where the work is divided among the various processors, adding to system overhead. Finally, even though the work is scheduled as evenly as possible, load imbalance among the processors means that some processors will complete their portion of the work in a loop and must wait for the others, before the machine can advance to the next section of the LS-DYNA code. Because of all these factors, SMP parallel performance is limited. While experts from the LSTC development team and the various computer hardware vendors continually work on 17-2

3 improving SMP parallel performance, parallel scalability beyond 4 or 6 processors is not very effective, and is not used very often in a production environment. MPI PARALLELISM IN LS-DYNA Characteristics of MPI LS-DYNA The limits on SMP parallelism, not just in LS-DYNA, but in many codes, have led to the development of domain decomposition parallel methods. The chief advantage is that with domain decomposition methods, a far greater percentage of the instructions can be run in parallel, so the parallel efficiency is greater. In the domain decomposition version of LS-DYNA, the geometry is divided into several domains, one for each processor. During every time step, each processor advances the solution for its own domain to the end of that time step. This process is independent of all other domains, so it is highly parallel. However, before work on the next time step can begin, communication must occur to relate information on the state of the solution to neighboring domains. Once this communication is complete, the solution phase of the next time step begins. The communication in LS-DYNA takes place according to MPI (the message-passing interface), a standard communication protocol. This is a portable set of communication calls available on all popular computer systems. Therefore, the code is referred to either the MPI version of LS-DYNA, or the domain decomposition version of LS-DYNA. Another common name is MPP-DYNA. The MPP moniker was originally associated with the massively parallel processing machines of the mid-199 s, but as parallel computer architectures evolved, massively parallel became something of a misnomer. Today MPP-DYNA more accurately refers to the message-passing parallel version of the code, as opposed to the shared memory parallel version of LS-DYNA. The MPP machines were envisioned with thousands of processors applied to execution of the same binary. Parallel speedup of computers is limited by Amdahl s Law. Figure 1 shows how a code can scale as a function of percentage of time spent in parallel routines. In order to use hundreds (or thousands) of processors a code must be more than 99.9% parallel. For a fullfeatured industrial code like LS-DYNA, this is very difficult. Fortunately, LSTC has been able to make the MPI version contain enough parallel content so that in cases it is more than 98% parallel and can scale efficiently up to 64 or more processors. Since microprocessors have evolved to be very powerful in recent years, this level of scalability coupled with fast processors allows the user to solve large LS-DYNA simulations in a reasonable elapsed times. It is apparent that a well-selected set of domains could influence MPI LS-DYNA performance. Load balance between the domains is achieved by insuring that each domain has an equal amount of work required at each time step. This is, of course, more complicated than just dividing the total number of elements in the model into groups with the same number of elements. The computational cost of different types of elements and materials is one factor which complicates the load balance. In addition, time spent in contact also has a great influence on computational cost. The time spend in contact will also change during the solution, so that a domain which initially produces good load balance may end up with much worse load balance characteristics. The relative cost for communication is also important. If the domains are established incorrectly, great amounts of communication could be required between domains. 17-3

4 Because of these reasons, selection of the proper domain decomposition does influence performance. Parallel Speedup Figure 1. Amdahl s law for parallel speedup. Each line corresponds to speedup for a particular percentage of parallel instructions, from 5% to 99.9% parallel. Default Partitioning There are several options available for partitioning the LS-DYNA model. These are documented in the LS-DYNA Users manual. The default method used for partitioning is RCB (recursive coordinate bisection). The other methods available are RSB (recursive spectral bisection) and greedy. The default RCB method usually produces the shortest elapsed time, so the others will not be investigated here. The partitioning method can be specified in the partitioning file, an optional file used by the MPI LS-DYNA binary, where the user can specify several options relating to partitioning. A useful tool for investigating the partitioning is the show command in the partitioning file. Enabling this option will cause LS-DYNA to halt just after initialization, with the partitioning information graphically contained in the D3PLOT file. Figure 2 shows the NCAC Taurus model partitioned into four domains. 17-4

5 Figure 2. Default partitioning of the NCAC Taurus model - 4 domains. Geometric Partitioning While using RCB, the partitioning can be controlled using both the geometry and the contact surfaces of the model. This is useful because the default partitioning is sometimes less than ideal. For instance, the NCAC Taurus model of Figure 2 is involved in a front crash, so that most of the contact searching will take place in the front of the vehicle. Therefore, the two domains at the front of the vehicle will require more computation than those at the rear of the vehicle, leading to load imbalance. A better domain decomposition is something like that which is displayed in Figure 3, where most domains runs from the front of the vehicle to the back, so that all domains have similar contact characteristics, and better overall load balancing. 17-5

6 Figure 3. Customized partitioning of the NCAC Taurus model - 8 domains. This type of domain decomposition can be achieved using the geometry options expdir and expsf. These options will expand the model in a certain direction (expdir) by a certain scale factor (expsf) before the domain decomposition occurs. A typical usage is decomposition (expdir 2 expsf 1) which will stretch the model out by a factor of 1 in the Y direction. Once this is done, the regular partitioning algorithm will generally partition the model in long strips from the front to the rear of the vehicle, as desired. Partitioning According to Contact Surfaces For some crash input decks, the model can be efficiently partitioned according to the contact surfaces (the sliding interfaces) through use of the silist option in the partitioning file. A typical front crash model may have one or perhaps a few sliding interfaces which involve the elements toward the front end of the vehicle, where the contact occurs. Overall computational cost is lessened by not making elements towards the rear of the vehicle members of these sliding interface sets. Some less important sliding interfaces may exist in order to handle particular cases of contact. By assigning the more important sliding interfaces to the silist, MPI LS-DYNA will first partition the elements associated with those interfaces, and then follow with the remainder of the elements. This helps prevent load balance problems, since the primary contact surfaces are generally evenly distributed among the processors. 17-6

7 LS-DYNA PERFORMANCE IN STAMPING SIMULTATIONS Even though it is a general-purpose code, LS-DYNA has historicallybeen used primarilyfor crash simulation and metal stamping, so those applications will be examined in this study. This portion of the study examines some of the performance characteristics of stamping simulations. General Scalability of Stamping Input Decks The scalability of MPI LS-DYNA increases as the number of elements in each model grows. In order to assess the practical limits of scalability, several metalstamping input decks of various sizes were provided by Volvo Car Company Model Identifier Size (elements) Stamp , Stamp , Stamp , Stamp-19 1,9, Each input deck was run on a range of processors, between 1 and 64, and the timing information was recorded from the D3HSP file. For each input deck, both the initialization phase plus the simulation phase elapsed times decrease as more processors are used for the analysis. Figure 4 shows an example of such data Elapsed time (sec) calculation initialization Number of processors Figure 4. Parallel speedup for the Stamp-655 input deck. 17-7

8 Data for all four input decks is summarized in Figure 5. In this plot, the vertical axis represents the elapsed time speedup which occurs as the number of processors is doubled. Two effects are obvious: 1.As more and more processors are used, the gain from parallelism decreases. However, for the larger models it is still reasonable to use between 32 and 64 processors efficiently. 2.Better parallelism is observed for the larger models. For the smallest model used in this study, the elapsed time actually increased as the number of processors went from 32 to 64. Smaller models generate too much overhead which can t be overcome with parallelism at these processor counts. 2 Parallel Speedup Stamp-141 Stamp-257 Stamp-655 Stamp-19 1->2 2->4 4->8 8->16 16->32 32->64 Increase in number of processors Figure 5. Parallel speedup of all Stamping input decks. Partitioning Options for Stamping The LS-DYNA Users Manual recommends the following partitioning file options for stamping input decks (assuming that the direction of travel for the punch is in the Z direction): decomposition ( expdir 3 expsf ) This essentially flattens out the input deck geometry before partitioning, and the model is partitioned in a series of columns in the Z direction. Partitioning Study with a Metal Stamping Input Deck Two of the Volvo input decks were run to completion as a test of the partitioning options. Each deck was run according to the LSTC recommendation, plus according to the other two directions, as a comparison. Results of this experiment are shown in Figures 6 and 7. These figures verify that the recommendations in the Users Manual are correct, demonstrating that partitioning in the Z direction can substantially improve elapsed time performance. In addition, the recommended solution appears more robust, since partitioning in the other 17-8

9 directions sometimes results in the simulation stopping at a point just before the planned completion of the analysis. Elapsed time (hr) expdir 1 expdir 2 expdir 3 Figure 6. Partitioning results for the Stamp-257 model. Elapsed time (hr) expdir 1 expdir 2 expdir 3 Figure 7. Partitioning results for the Stamp-141 model. 17-9

10 LS-DYNA PERFORMANCE IN AUTOMOTIVE CRASH SIMULTATIONS General Scalability of Automotive Crash Input Decks General scalability of MPI LS-DYNA for automotive crash was measured by testing a variety of input decks: Elements Model Identifier 5,5 WPI Rigid Pole 28, NCAC Taurus 1, Customer Front Offset 275, NCAC Neon Figures 8-11 show scalability results for these input decks, showing how the elapsed time decreases as the number of processors increases. In general, performance of these input decks is very similar to the tendencies observed in the stamping scalability study. 7 Elapsed Time (sec) RS/6 SP Workstation cluster Figure 8. WPI Rigid Pole scalability. Scalability on a workstation cluster Use of the domain decomposition version of LS-DYNA allows users to run simulations on a network of computers. As a first step in evaluating the efficiency of that concept, all of the automotive crash input decks and one of the stamping input decks were run on a small network of workstations in addition to the IBM RS/6 SP computer. This network of workstations consisted of two IBM 43P model 26 workstations, each with a pair of 2 MHz POWER3 processors, identical to those found in the RS/6 SP system. Therefore, the only significant difference in the two configurations was the interconnection: 17-1

11 System RS/6 SP RS/6 workstation cluster Communication switch 1T Ethernet Latency (microsec) 24 high Bandwidth (MB/sec) Elapsed Time (sec) RS/6 SP Workstation cluster Figure 9. NCAC Taurus scalability. Therefore, this experiment was able to demonstrate the importance of the role of communication in the performance of the domain decomposition version of LS-DYNA. As expected, the results for one and two processors (see Figures 8-12) are very close, since at this processor count all calculations reside on one node of the RS/6 SP system or one workstation, and there is no communication between nodes

12 Elapsed Time (sec) RS/6 SP Workstation cluster Figure 1. Customer 1, element model scalability. Elapsed Time (sec) RS/6 SP Workstation cluster Figure 11. NCAC Neon scalability

13 Elapsed Time (sec) RS/6 SP Workstation cluster Figure 12. Stamp-141 scalability. When using four processors, the slower ethernet communication causes a slight increase in the overall elapsed time of the jobs, but many users should find this an attractive and economical way to utilize their computer resources. In order to gather data with convenient elapsed times, the larger simulations in this study were shortened in duration. One possible explanation of the results is that the communication characteristics would change as the simulation progresses, where contact plays a more important role. Therefore, the metalforming and 1, element crash simulations were both run to completion. A comparison of the elapsed times shows that even for the full simulations, the workstation cluster is only 3% slower than the RS/6 SP system, so the partial simulation results are valid in comparing timings. Partitioning Study with some Automotive Crash Input Decks The three largest representative crash simulation input decks were run using various settings for partitioning, to determine the effect on performance. The results are summarized in Figures These bar charts show how the elapsed times of the jobs change for various numbers of processors, for both default partitioning and partitioning using options such as: decomposition ( expdir 2 expsf 1. silist 1,2 ) 17-13

14 Elapsed Time (hr) Default Partitioning Custom Partitioning Figure 13. Partitioning effects on the NCAC Taurus model. Elapsed Time (hr) Default Partitioning Custom Partitioning Figure 14. Partitioning effects on the 1, element customer model

15 Elapsed Time (hr) Default Partitioning Custom Partitioning Figure 15. Partitioning effects on the NCAC Taurus input deck. Two general trends concerning usage of these partitioning options can be observed from the performance data: 1.Customized partitioning does improve performance when using 4 or 8 processors. 2.Customized performance degrades performance when using 16 or more processors. The most significant performance change is observed for the 1, element model, where customized partitioning with 4 processors improves the performance from 7.3 hours to 5.16 hours elapsed time, a gain of 36%. SUMMARY For stamping simulations, it is prudent to follow the recommended partitioning settings described in the LS-DYNA Users Manual. This has been show to provide a faster and more robust solution. For automotive crash simulations using up to 8 processors, partitioning with the silist, expdir, and expsf options can improve performance, up to 36% for one of the examples studied here. For 16 or more processors, default partitioning provides better performance than customized partitioning. ACKNOWLEDGEMENTS Many people and organizations contributed to this effort. Particular thanks go to the following: Volvo Car Corporation - Olofstrom stamping group, for use of the stamping input decks NCAC - for use of the public domain Taurus and Neon crash simulation input decks WPI - for use of the public domain rigid pole input deck LSTC - for overall assistance, guidance, and provision of required binaries and license files 17-15

16 17-16

Assessment of LS-DYNA Scalability Performance on Cray XD1

Assessment of LS-DYNA Scalability Performance on Cray XD1 5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123

More information

LS-DYNA Scalability Analysis on Cray Supercomputers

LS-DYNA Scalability Analysis on Cray Supercomputers 13 th International LS-DYNA Users Conference Session: Computing Technology LS-DYNA Scalability Analysis on Cray Supercomputers Ting-Ting Zhu Cray Inc. Jason Wang LSTC Abstract For the automotive industry,

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT: HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com

More information

Single-Points of Performance

Single-Points of Performance Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical

More information

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture 4 th European LS-DYNA Users Conference MPP / Linux Cluster / Hardware I Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture Authors: Stan Posey, Nick

More information

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment 9 th International LS-DYNA Users Conference Computing / Code Technology (2) Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment Stanley Posey HPC Applications Development SGI,

More information

New Features in LS-DYNA HYBRID Version

New Features in LS-DYNA HYBRID Version 11 th International LS-DYNA Users Conference Computing Technology New Features in LS-DYNA HYBRID Version Nick Meng 1, Jason Wang 2, Satish Pathy 2 1 Intel Corporation, Software and Services Group 2 Livermore

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Performance Analysis of LS-DYNA in Huawei HPC Environment

Performance Analysis of LS-DYNA in Huawei HPC Environment Performance Analysis of LS-DYNA in Huawei HPC Environment Pak Lui, Zhanxian Chen, Xiangxu Fu, Yaoguo Hu, Jingsong Huang Huawei Technologies Abstract LS-DYNA is a general-purpose finite element analysis

More information

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect 8 th International LS-DYNA Users Conference Computing / Code Tech (1) Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

A Study of Workstation Computational Performance for Real-Time Flight Simulation

A Study of Workstation Computational Performance for Real-Time Flight Simulation A Study of Workstation Computational Performance for Real-Time Flight Simulation Summary Jeffrey M. Maddalon Jeff I. Cleveland II This paper presents the results of a computational benchmark, based on

More information

Efficient Processing of Multiple Contacts in MPP-DYNA

Efficient Processing of Multiple Contacts in MPP-DYNA Efficient Processing of Multiple Contacts in MPP-DYNA Abstract Brian Wainscott Livermore Software Technology Corporation 7374 Las Positas Road, Livermore, CA, 94551 USA Complex models often contain more

More information

High Performance Computing Course Notes Course Administration

High Performance Computing Course Notes Course Administration High Performance Computing Course Notes 2009-2010 2010 Course Administration Contacts details Dr. Ligang He Home page: http://www.dcs.warwick.ac.uk/~liganghe Email: liganghe@dcs.warwick.ac.uk Office hours:

More information

Design Optimization of Hydroformed Crashworthy Automotive Body Structures

Design Optimization of Hydroformed Crashworthy Automotive Body Structures Design Optimization of Hydroformed Crashworthy Automotive Body Structures Akbar Farahani a, Ronald C. Averill b, and Ranny Sidhu b a Engineering Technology Associates, Troy, MI, USA b Red Cedar Technology,

More information

High Performance Computing Course Notes HPC Fundamentals

High Performance Computing Course Notes HPC Fundamentals High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Improvement of domain decomposition of. LS-DYNA MPP R7 and R8

Improvement of domain decomposition of. LS-DYNA MPP R7 and R8 Improvement of domain decomposition of LS-DYNA MPP R7 and R8 Mitsuhiro Makino Dynapower corp. Iroko27-3 Hoshouji, Okazaki, Aichi, 444-0206, Japan makino@dynapower.co.jp 1 Introduction In order to get the

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments LS-DYNA Productivity and Power-aware Simulations in Cluster Environments Gilad Shainer 1, Tong Liu 1, Jacob Liberman 2, Jeff Layton 2 Onur Celebioglu 2, Scot A. Schultz 3, Joshua Mora 3, David Cownie 3,

More information

Distributed Systems CS /640

Distributed Systems CS /640 Distributed Systems CS 15-440/640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andvinay Kolar 1 Objectives Discussion on Programming Models

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017 LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing 2 What is High Performance Computing? There is no clear definition Computing on high performance computers Solving problems / doing research using computer modeling,

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017 LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource

More information

Platform Choices for LS-DYNA

Platform Choices for LS-DYNA Platform Choices for LS-DYNA Manfred Willem and Lee Fisher High Performance Computing Division, HP lee.fisher@hp.com October, 2004 Public Benchmarks for LS-DYNA www.topcrunch.org administered by University

More information

MPP Contact: Options and Recommendations

MPP Contact: Options and Recommendations MPP Contact: Options and Recommendations Brian Wainscott LSTC Introduction There are many different contact algorithms currently implemented in LS-DYNA, each having various options and parameters. I will

More information

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono Design of Parallel Programs Algoritmi e Calcolo Parallelo Web: home.dei.polimi.it/loiacono Email: loiacono@elet.polimi.it References q The material in this set of slide is taken from two tutorials by Blaise

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Monte Carlo Method on Parallel Computing. Jongsoon Kim

Monte Carlo Method on Parallel Computing. Jongsoon Kim Monte Carlo Method on Parallel Computing Jongsoon Kim Introduction Monte Carlo methods Utilize random numbers to perform a statistical simulation of a physical problem Extremely time-consuming Inherently

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes Todd A. Whittaker Ohio State University whittake@cis.ohio-state.edu Kathy J. Liszka The University of Akron liszka@computer.org

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Cluster Scalability of Implicit and Implicit-Explicit LS-DYNA Simulations Using a Parallel File System

Cluster Scalability of Implicit and Implicit-Explicit LS-DYNA Simulations Using a Parallel File System Cluster Scalability of Implicit and Implicit-Explicit LS-DYNA Simulations Using a Parallel File System Mr. Stan Posey, Dr. Bill Loewe Panasas Inc., Fremont CA, USA Dr. Paul Calleja University of Cambridge,

More information

Understanding Parallelism and the Limitations of Parallel Computing

Understanding Parallelism and the Limitations of Parallel Computing Understanding Parallelism and the Limitations of Parallel omputing Understanding Parallelism: Sequential work After 16 time steps: 4 cars Scalability Laws 2 Understanding Parallelism: Parallel work After

More information

A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D

A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D 3 rd International Symposium on Impact Engineering 98, 7-9 December 1998, Singapore A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D M.

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing Introduction to Parallel Computing with MPI and OpenMP P. Ramieri Segrate, November 2016 Course agenda Tuesday, 22 November 2016 9.30-11.00 01 - Introduction to parallel

More information

Small-overlap Crash Simulation Challenges and Solutions

Small-overlap Crash Simulation Challenges and Solutions Small-overlap Crash Simulation Challenges and Solutions Sri Rama Murty Arepalli, Ganesh Kini, Roshan MN, Venugopal Thummala and Andrea Gittens 1 Introduction ESI Group Insurance Institute of Highway Safety,

More information

Application of Parallel Processing to Rendering in a Virtual Reality System

Application of Parallel Processing to Rendering in a Virtual Reality System Application of Parallel Processing to Rendering in a Virtual Reality System Shaun Bangay Peter Clayton David Sewry Department of Computer Science Rhodes University Grahamstown, 6140 South Africa Internet:

More information

An Analysis of Object Orientated Methodologies in a Parallel Computing Environment

An Analysis of Object Orientated Methodologies in a Parallel Computing Environment An Analysis of Object Orientated Methodologies in a Parallel Computing Environment Travis Frisinger Computer Science Department University of Wisconsin-Eau Claire Eau Claire, WI 54702 frisintm@uwec.edu

More information

Hybrid (MPP+OpenMP) version of LS-DYNA

Hybrid (MPP+OpenMP) version of LS-DYNA Hybrid (MPP+OpenMP) version of LS-DYNA LS-DYNA Forum 2011 Jason Wang Oct. 12, 2011 Outline 1) Why MPP HYBRID 2) What is HYBRID 3) Benefits 4) How to use HYBRID Why HYBRID LS-DYNA LS-DYNA/MPP Speedup, 10M

More information

Parallel and distributed crashworthiness simulation

Parallel and distributed crashworthiness simulation Parallel and distributed crashworthiness simulation G. Lonsdale C&C Research Laboratories, NEC Europe Ltd., St. Augustin, Germany J. Clinckemaillie ESI SA, Rungis, France S. Meliciani ESI SA, Rungis, France

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

LS-DYNA Performance on 64-Bit Intel Xeon Processor-Based Clusters

LS-DYNA Performance on 64-Bit Intel Xeon Processor-Based Clusters 9 th International LS-DYNA Users Conference Computing / Code Technology (2) LS-DYNA Performance on 64-Bit Intel Xeon Processor-Based Clusters Tim Prince, PhD ME Hisaki Ohara, MS IS Nick Meng, MS EE Intel

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

LS-DYNA Performance Benchmark and Profiling. April 2015

LS-DYNA Performance Benchmark and Profiling. April 2015 LS-DYNA Performance Benchmark and Profiling April 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

Reasons for Scatter in Crash Simulation Results

Reasons for Scatter in Crash Simulation Results 4 th European LS-DYNA Users Conference Crash / Automotive Applications III Reasons for Scatter in Crash Simulation Results Authors: Clemens-August Thole, Liquan Mei Fraunhofer Institute for Algorithms

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

History. Jurassic Period (-1945)

History. Jurassic Period (-1945) History Jurassic Period (-1945) Astronomical/Geometric tools Schickard, Leibniz, Pascal (17th Cen.) Babbage: Analytic engine (1842) first programmable machine. Scheutz (1853) Hollerith (1911) IBM (1924)

More information

PARALLEL BUCKET SORTING ALGORITHM

PARALLEL BUCKET SORTING ALGORITHM PARALLEL BUCKET SORTING ALGORITHM Hiep Hong Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 hiepst@yahoo.com ABSTRACT Bucket sorting algorithm achieves O(n) running

More information

Performance Benefits of NVIDIA GPUs for LS-DYNA

Performance Benefits of NVIDIA GPUs for LS-DYNA Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA

More information

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia

More information

Recent Developments and Roadmap Part 0: Introduction. 12 th International LS-DYNA User s Conference June 5, 2012

Recent Developments and Roadmap Part 0: Introduction. 12 th International LS-DYNA User s Conference June 5, 2012 Recent Developments and Roadmap Part 0: Introduction 12 th International LS-DYNA User s Conference June 5, 2012 1 Outline Introduction Recent developments. See the separate PDFs for: LS-PrePost Dummies

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Development of a Modeling Tool for Collaborative Finite Element Analysis

Development of a Modeling Tool for Collaborative Finite Element Analysis Development of a Modeling Tool for Collaborative Finite Element Analysis Åke Burman and Martin Eriksson Division of Machine Design, Department of Design Sciences, Lund Institute of Technology at Lund Unversity,

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Resource allocation and utilization in the Blue Gene/L supercomputer

Resource allocation and utilization in the Blue Gene/L supercomputer Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa Agenda Blue Gene/L Background Blue Gene/L

More information

Multi-core Programming - Introduction

Multi-core Programming - Introduction Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2

z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2 z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2 z/os 1.2 introduced a new heuristic for determining whether it is more efficient in terms

More information

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS Technical Report of ADVENTURE Project ADV-99-1 (1999) PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS Hiroyuki TAKUBO and Shinobu YOSHIMURA School of Engineering University

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Itanium a viable next-generation technology for crash simulation?

Itanium a viable next-generation technology for crash simulation? 4 th European LS-DYNA Users Conference MPP / Linux Cluster / Hardware I Itanium a viable next-generation technology for crash simulation? Author: Adrian Hillcoat Hewlett-Packard Ltd Bracknell, UK Correspondence:

More information

Chapter 13 Strong Scaling

Chapter 13 Strong Scaling Chapter 13 Strong Scaling Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Script for Automated One Step Forming Analysis using LS-DYNA and LS-Prepost

Script for Automated One Step Forming Analysis using LS-DYNA and LS-Prepost 12 th International LS-DYNA Users Conference Automotive(2) Script for Automated One Step Forming Analysis using LS-DYNA and LS-Prepost Amit Nair, Dilip Bhalsod Livermore Software Technology Corporation

More information

PARAMETRIC FINITE ELEMENT MODEL OF A SPORTS UTILITY VEHICLE - DEVELOPMENT AND VALIDATION

PARAMETRIC FINITE ELEMENT MODEL OF A SPORTS UTILITY VEHICLE - DEVELOPMENT AND VALIDATION 7 th International LS-DYNA Users Conference Crash/Safety (3) PARAMETRIC FINITE ELEMENT MODEL OF A SPORTS UTILITY VEHICLE - DEVELOPMENT AND VALIDATION Gustavo A. Aramayo Computation Materials Science Group

More information

Chapter 3: AIS Enhancements Through Information Technology and Networks

Chapter 3: AIS Enhancements Through Information Technology and Networks Accounting Information Systems: Essential Concepts and Applications Fourth Edition by Wilkinson, Cerullo, Raval, and Wong-On-Wing Chapter 3: AIS Enhancements Through Information Technology and Networks

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

TEST CASE DOCUMENTATION AND TESTING RESULTS TEST CASE ID AWG-ERIF-10. MAT 224 Dynamic Punch Test Aluminium 2024 LSTC-QA-LS-DYNA-AWG-ERIF-10-9

TEST CASE DOCUMENTATION AND TESTING RESULTS TEST CASE ID AWG-ERIF-10. MAT 224 Dynamic Punch Test Aluminium 2024 LSTC-QA-LS-DYNA-AWG-ERIF-10-9 TEST CASE DOCUMENTATION AND TESTING RESULTS LSTC-QA-LS-DYNA-AWG-ERIF-10-9 TEST CASE ID AWG-ERIF-10 MAT 224 Dynamic Punch Test Aluminium 2024 Tested with LS-DYNA R R10 Revision 116539 Thursday 11 th May,

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract

More information

LECTURE 1. Introduction

LECTURE 1. Introduction LECTURE 1 Introduction CLASSES OF COMPUTERS When we think of a computer, most of us might first think of our laptop or maybe one of the desktop machines frequently used in the Majors Lab. Computers, however,

More information

White Paper. Low Cost High Availability Clustering for the Enterprise. Jointly published by Winchester Systems Inc. and Red Hat Inc.

White Paper. Low Cost High Availability Clustering for the Enterprise. Jointly published by Winchester Systems Inc. and Red Hat Inc. White Paper Low Cost High Availability Clustering for the Enterprise Jointly published by Winchester Systems Inc. and Red Hat Inc. Linux Clustering Moves Into the Enterprise Mention clustering and Linux

More information

System Architecture PARALLEL FILE SYSTEMS

System Architecture PARALLEL FILE SYSTEMS Software and the Performance Effects of Parallel Architectures Keith F. Olsen,, Poughkeepsie, NY James T. West,, Austin, TX ABSTRACT There are a number of different parallel architectures: parallel hardware

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information