Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Similar documents
Intel Many Integrated Core (MIC) Architecture

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Bitonic Sorting Intel OpenCL SDK Sample Documentation

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Optimizing the operations with sparse matrices on Intel architecture

Intel s Architecture for NFV

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Installation Guide and Release Notes

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Installation Guide and Release Notes

Intel Core TM i7-4702ec Processor for Communications Infrastructure

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Desktop 4th Generation Intel Core, Intel Pentium, and Intel Celeron Processor Families and Intel Xeon Processor E3-1268L v3

Software Evaluation Guide for WinZip* esources-performance-documents.html

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

Intel Cache Acceleration Software for Windows* Workstation

Software Evaluation Guide for ImTOO* YouTube* to ipod* Converter Downloading YouTube videos to your ipod

Installation Guide and Release Notes

Drive Recovery Panel

Simulation-Driven Design Has Never Been Easier

Intel Core TM Processor i C Embedded Application Power Guideline Addendum

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

What s P. Thierry

Intel vpro Technology Virtual Seminar 2010

Solid-State Drive System Optimizations In Data Center Applications

Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes

Intel Desktop Board D945GCCR

Case Study: Optimizing King of Soldier* with Intel Graphics Performance Analyzers on Intel HD Graphics 4000

Altair RADIOSS Performance Benchmark and Profiling. May 2013

LED Manager for Intel NUC

High Performance Computing The Essential Tool for a Knowledge Economy

Using Intel VTune Amplifier XE for High Performance Computing

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Intel Atom Processor E6xx Series Embedded Application Power Guideline Addendum January 2012

Intel Desktop Board DP55SB

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

FAST FORWARD TO YOUR <NEXT> CREATION

Intel Desktop Board D975XBX2

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Intel Atom Processor D2000 Series and N2000 Series Embedded Application Power Guideline Addendum January 2012

Intel Desktop Board DZ68DB

Bei Wang, Dmitry Prohorov and Carlos Rosales

How to Create a.cibd File from Mentor Xpedition for HLDRC

Software Evaluation Guide for CyberLink MediaEspresso *

Capture and Capitalize on Business Intelligence with Intel and IBM

Installation Guide and Release Notes

Software Occlusion Culling

Intel Parallel Amplifier Sample Code Guide

Software Evaluation Guide for Photodex* ProShow Gold* 3.2

Intel Stereo 3D SDK Developer s Guide. Alpha Release

6th Generation Intel Core Processor Series

Software Evaluation Guide for Sony Vegas Pro 8.0b* Blu-ray Disc Image Creation Burning HD video to Blu-ray Disc

Intel vpro Technology Virtual Seminar 2010

Introduction. How it works

Innovating and Integrating for Communications and Storage

The Intel Processor Diagnostic Tool Release Notes

Using Intel Inspector XE 2011 with Fortran Applications

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes

How to Create a.cibd/.cce File from Mentor Xpedition for HLDRC

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Intel Desktop Board D946GZAB

Intel Desktop Board DG41CN

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Intel Cache Acceleration Software - Workstation

Intel Atom Processor E3800 Product Family Development Kit Based on Intel Intelligent System Extended (ISX) Form Factor Reference Design

Mobility: Innovation Unleashed!

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

Intel 848P Chipset. Specification Update. Intel 82848P Memory Controller Hub (MCH) August 2003

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Intel Desktop Board DG31PR

Intel Xeon Processor E v3 Family

Intel Desktop Board D945GCLF2

Intel & Lustre: LUG Micah Bhakti

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Intel Desktop Board DG41RQ

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

HPCG Results on IA: What does it tell about architecture?

Intel technologies, tools and techniques for power and energy efficiency analysis

Software Evaluation Guide Adobe Premiere Pro CS3 SEG

Theory and Practice of the Low-Power SATA Spec DevSleep

Ravindra Babu Ganapathi

IXPUG 16. Dmitry Durnov, Intel MPI team

Innovation Accelerating Mission Critical Infrastructure

Software Evaluation Guide for WinZip 15.5*

Data Center Energy Efficiency Using Intel Intelligent Power Node Manager and Intel Data Center Manager

Intel Desktop Board DH55TC

Using the Intel VTune Amplifier 2013 on Embedded Platforms

POWER YOUR CREATIVITY WITH THE INTEL CORE X-SERIES PROCESSOR FAMILY

2013 Intel Corporation

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

Jim Harris Principal Software Engineer Intel Data Center Group

Open FCoE for ESX*-based Intel Ethernet Server X520 Family Adapters

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Transcription:

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations, require a massive number of computations. Today s simulation software relies on highly parallelized codes to take advantage of the large number of cores in high-performance computing (HPC) clusters to increase processing speed for such complex simulations. Altair RADIOSS* software, optimized for processors, uses Hybrid Massively Parallel Processing to deliver outstanding performance on a wide range of computing configurations, from smaller single-node workstations to more powerful clusters with thousands of cores. To evaluate the performance and scalability of Altair RADIOSS on Intel Xeon processor E7-489 v2, Altair benchmarked RADIOSS using a modified publicly available crash simulation model of a Chrysler Neon* passenger car on a single-node platform with 4 sockets/6 cores/12 threads and 256 GB of memory. RADIOSS was able to easily take advantage of all 6 cores, running the workload 2.75X faster than on a comparable 24-core platform based on processor E5-2695 v2. This paper summarizes the findings of the benchmark.

KEY BENCHMARK RESULTS: 1 2.75x faster on single 6-core node 1x performance evolution from processor 51 series to processor E5 v2 family Message Passing Interface (MPI) delivers excellent performance for single-node systems Hybrid Massively Parallel Processing supports massive scalability for large systems Altair Driving HPC and Engineering Innovation Altair knows HPC. The company not only provides product design services, but also develops, uses, and markets their own HPC engineering simulation software, 3D industrial design suites, enterprise analytics solutions, and HPC management tools. Altair understands the needs and demands of companies who rely on computing clusters and compute-intensive applications to develop products and solve problems. The company s 3 year track record for high-end software and consulting services enables Altair to consistently deliver high-value software solutions for their 5+ customers. Intel and Altair Optimizing HPC Solutions Together Intel and Altair have a long history of collaboration, resulting in Altair software solutions designed and optimized for high-performance computing on Intel architecture. Altair and Intel software developers continue to work closely to tune and optimize Altair codes using Intel Parallel Studio XE suite, including Intel MPI Library, Intel Fortran and C/C++ compilers, Intel Math Kernel Library, and Intel VTune Amplifier, as well as other Intel products like Intel Trace Analyzer. Altair has worked closely with Intel to optimize message passing interface (MPI) codes in Altair PBS Works* and to certify PBS Professional* and RADIOSS*, OptiStruct*, and AcuSolve* solvers as Intel Cluster Ready applications. PBS Professional supports the Intel Xeon Phi coprocessor for accelerated problem solving on Intel architecture-based clusters. 1 MPP Hybrid PERFORMANCE (XTimes) 1 1 Ideal Hybrid: 8 threads SMP Hybrid: 4 threads SMP Hybrid: 2 threads SMP Pure MPP 8 16 32 64 128 256 512 NUMBER OF CORES 2 Figure 1. Hybrid Massively Parellel Processing (HMPP) enables scalability on large clusters.

About RADIOSS Altair RADIOSS is a leading structural analysis solver for highly non-linear problems under dynamic loads. For over 25 years, RADIOSS has established itself as a leader and industry standard for automotive crash and impact analysis. Companies around the world and across all industries use RADIOSS to improve the crashworthiness, safety, and manufacturability of their structural designs. The software is known for scalability, quality, and robustness, as well as for providing multi-physics simulation capabilities and supporting advanced materials, such as composites. Optimized for Parallelism Altair has optimized RADIOSS to take advantage of multi-core Intel processors in systems ranging from single-node platforms to large clusters. Using Hybrid Massively Parallel Processing (HMPP), Altair engineers combined MPI with OpenMP coding to create a multi-level parallelism model, which enables high scalability and allows users to fine tune RADIOSS to the workload and hardware the software runs on (Figure 1). The result is superior performance and flexibility for a wide range of structural analysis problems on RADIOSS running on systems of any size. RADIOSS Performance on Processors Working with Intel engineers over the years to achieve key hardware and software optimizations, Altair developers have continually improved RADIOSS performance on Intel architecture, resulting in a 1X performance increase from processor 51 series to Intel Xeon processor E5 v2 family (Figure 2). These expanding core counts on Intel processor-based platforms continue to deliver unmatched RA- DIOSS scalability. Indeed, a single node today can deliver the performance of large clusters from just a few years ago. Benchmarking RADIOSS on Processor E7-489 v2 Altair has benchmarked RADIOSS performance in numerous scenarios using a modified version of a publicly available model of the Chrysler Neon (see Figure 7). This freely available model has been used in previous benchmarks in the automotive industry; the modified version has 1 Million elements versus the public version s only 27, elements. By today s standards this is a moderately-sized model. Note that a car crash is a short term event lasting only 8 milliseconds (ms) in the present case. The model was run with RADIOSS on a 6-core, single-node platform based on the Intel Xeon processor E7-489 v2. RADIOSS performed beyond expectations during these benchmarks, revealing the benefits possible for accelerating time to solution of complex crash simulations for smaller systems, in addition to large HPC clusters. 15 32 PERFORMANCE (XTimes) 1 5 24 16 8 NUMBER OF CORES 516 2x2@3. GHz X5355 2x4@3.2 GHz X556 2x4@2.8 GHz X568 2x6@3.3 GHz E5-268 2x8@2.7 GHz E5-2696 v2 2x12@2.5 GHz Performance Number of Cores Per Node Frequency GHz Figure 2. RADIOSS single-node performance evolution for Intel processor-based platforms. 3

Intel Xeon Processor E7-489 v2 The Intel Xeon processor E7-489 v2 integrates 15 cores with Intel Hyper- Threading technology 2 (3 simultaneous threads), 37.5 MB of cache, three 8 gigatransfer-per-second (GT/s) Intel QuickPath Interconnect (Intel QPI) links, and Intel Advanced Vector Extensions 3 /(Intel AVX) onto a single 22nm-process die. Built into a 4-socket platform with 256 GB of memory, this configuration offers highly scalable performance for complex simulations using RADIOSS. With 6 cores and 12 threads, Altair was able to analyze the capabilities of HMPP and the response from a large number of cores on a node. At low node counts, pure MPI generally outperforms HMPP, while at high core counts, it is advised to use only one MPI per socket with a number of OpenMP threads that matches the number of cores per socket. Test Platforms The systems used for testing are listed in Table 1. Key Findings The benchmarks showed outstanding performance, revealing the efficiency of RADIOSS on Intel Xeon processor E7-489 v2-based platform for systems with just a few nodes. Pure MPI Parallelism Benchmarking Using only MPI parallelism, RADIOSS processed the benchmark simulation 2.75X faster on the 4-socket Intel Xeon processor E7-489 v2-based platform than on the 2-socket platform based on Intel Xeon processor E5-2695 v2 with a total of 24 cores (Figure 3). HMPP Benchmarking Hybrid MPP parallelization makes it possible to maintain scalability when pure MPI efficiency drops (when the number of MPI domains and associated inter-mpi communication cost increase) i.e., for systems with higher core counts. Using HMPP, RADIOSS has shown to be highly scalable and incredibly efficient on large HPC clusters up to thousands of cores (Figure 4). Depending on the hardware platform, users can adjust decomposition of their problems across MPI domains, while maximizing performance across cores with multithreading based on OpenMP. At a high number of cores, a configuration that runs one MPI process per socket and executes as many OpenMP threads as the number of cores per socket has shown to deliver optimal scalability. As illustrated in Figure 4, processor E5-2695 v2 product family-based platform processor E7-489 v2 product family-based platform CPU Intel Xeon processor E5-2695 v2 Intel Xeon processor E7-489 v2 # sockets 2 4 # cores/thread 12/24 15/3 Total cores/threads 24/48 6/12 Cache 3 MB 37.5 MB Memory 128 GB; 16 DDR3 256 GB; 16 DDR3 Frequency 2.5 GHz 2.8 GHz Table 1. Benchmarking Test Platforms. 4

scalability with pure MPI drops after 16 nodes, while for HMPP with 8 OpenMP threads per MPI, RADIOSS continues to scale up to 64 nodes. Thus, with HMPP, it is possible to scale up to 1,24 cores even when running a moderately sized model of 1 million elements. For smaller systems, Altair believes that because of Altair s optimized codes and high efficiency communications of the Intel Xeon processor E7-489 v2-based platform, single-domain decomposition under MPI delivers the best computational performance (best data locality). By using the Intel MPI Library, which is optimized to take advantage of the virtual shared memory of the Intel Xeon processor E7 family, the communication cost on the 4-socket platform stays extremely low. Hyper-Threading Intel Hyper-Threading technology provides an additional approximate 5 percent performance boost when activated, by taking advantage of all 12 threads. Hyper-Threading is particularly well suited on single node systems. As shown in Figure 5, Altair used HMPP with 2 OpenMP threads per MPI to take advantage of the 2 threads per cores available with Hyper-Threading. Single-Versus Double-Precision Floating-Point Benchmark Comparisons between single-precision (SP) and double-precision (DP) floatingpoint showed that RADIOSS performed up to 1.5X better overall with single precision on a single-node system. Here, too, Intel Hyper-Threading technology offers a 5 percent gain in performance. Note: The standard version of RADIOSS uses double-precision FP. Altair has developed a single-precision FP version which enables users to improve performance by a ratio of around 1.5X. This is an extended single-precision version, which continues to use double-precision at certain critical places, keeping very good accuracy, while maximizing performance. Neon 1M 8ms RADIOSS DIFFERENTIATORS: Optimization-enabled crash simulation, including advanced materials such as composites. ELAPSED TIME (S) 9 8 7 6 5 4 3 2 1 83 24 MPI Processor E5 v2 @2.5 GHz 293 6 MPI Processor E7 v2 @2.8 GHz High scalability due to advanced multi-processor solution Hybrid Massively Parallel Processing (HMPP). High quality formulation and complete material and rupture library. Fully repeatable results, regardless of number of cores, nodes, or threads used in parallel computation. Full integration with Altair s PBS Professional* for HPC workload management. Figure 3. Scalability improvement on Intel Xeon processor E7-489 v2. 5

Neon Refined 1 Million 8ms Scalability Study vs Number of SMP Threads Up To 124 Cores 8 ELAPSED TIME (S) 7 6 5 4 3 2 1 6299 753 6658 686 45 3787 3545 3524 288 211 2121 2442 1552 131 1315 14 2121 191 889 879 1565 885 717 2 Nodes 4 Nodes 8 Nodes 16 Nodes 32 Nodes 64 Nodes 1 Thread 4 Threads 2 Threads 8 Threads Each node is HP SL23-gen8 with dual processor E5-267@2.6 GHz, 16 cores and 128 GB 16 MHz DIMM per node; Infiniband FDR Figure 4. Tuning HMPP performance for RADIOSS across clusters. Neon 1M 8ms Hyper-Threading Test Neon 1M 8ms DP vs SP 3 293 276 3 276 ELAPSED TIME (S) 2 1 ELAPSED TIME (S) 2 1 19 6 MPI Beta v13. DP 6 MPI x 2 OpenMP Beta v13. DP Beta v13. DP 6 MPI x 2 OpenMP Beta v13. SP Figure 5. Tuning HMPP performance with Hyper-Threading. Figure 6. Double Precision (DP) versus Single Precision (SP) Performance comparison (with Hyper-Threading enabled). 6

Conclusion Optimized for massively parallel processing on Intel Xeon processors, Altair RADIOSS software delivers highly scalable performance by taking advantage of all the system s cores. RADIOSS performs well on both small and large systems, making it an ideal solution for businesses stepping up from a workstation to something more powerful to run compute-intensive simulations. Benchmarks using the modified crash model of a Neon car with 1 Million elements in an 8ms crash show that, when run on an Intel Xeon processor E7-489 v2 product family-based platform with 6 cores and 12 threads, the software scales well due to the high efficiency of the hardware and optimization of the code for the Intel architecture. For companies looking to refresh their technical computing for running RADIOSS, whether on a single workstation or a large cluster, now is the time to consider an Intel Xeon processor-based platform or cluster to take advantage of the scalability and optimizations for performance of RADIOSS on Intel architecture. For more information on the Intel Xeon processor family, visit www.intel.com/content/www/us/en/servers/server-products.html. For more information about Altair s collaboration with Intel, visit www.altair.com/partner-intel. Figure 7. Modified Chrysler Neon* model (1 million elements) used in RADIOSS* crash simulation benchmark testing. 7

1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. For more complete information about performance and benchmark results, visit Performance Test Disclosure. 2 Available on select Intel processors. Requires an Intel HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading. 3 Intel Advance Vector Extensions (Intel AVX) and Intel Advance Vector Extensions 2 (Intel AVX 2) are designed to achieve higher throughput in certain integer and floating point operations. Intel AVX and Intel AVX 2 instructions may run at lower frequency to maintain reliable operation. Consult your system manufacturer. Performance may vary depending on hardware, software, and system configuration. For more information see product specification update. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined. Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-8-548-4725, or by visiting Intel s Web site at www.intel.com. Copyright 214 Intel Corporation. All rights reserved. Intel, the Intel logo, VTune, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Printed in USA 414/DW/HBD/PDF Please Recycle 33525-1US