sgi Scalability Considerations for Compute Intensive Applications on Clusters Christian Tanasescu Daniel Thomas SGI Inc.
|
|
- Belinda Holmes
- 5 years ago
- Views:
Transcription
1 Scalability Considerations for Compute Intensive Applications on Clusters Chr Daniel Thomas SGI Inc.
2 Agenda Applications Segments HPC Computational Requirements Scalability and Application profiles Standard benchmarks vs application Communication vs Computation ratio BandeLa - profiling and modeling tool Platform Directions Conclusions
3 The Move to Technology Transparency Application drives the platform High Sellable Feature Feeds and speeds Architecture Feeds and speeds Application Architecture Feeds and speeds Low Time
4 Application Segments CSM - Computational Structural Mechanics CFD - Computational Fluid Dynamics CCM - Computational Chemistry and Material Sciense BIO - Bioinformatics SPI- Seismic Processing and Interpretation RES- Reservoir Simulation CWO- Climate/Weather/Ocean Simulation
5 Sweet Spot Scalability per Application Segment p p 8-64p 16-64p p 4-32p 4-16p CSM CFD CCM BIO SPI RES CWO
6 HPC Resource Demands for Energy System Resource Benefits/Requirements Energy Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability Seismic Processing 1 ProMAX Omega GeoDepth H H H L M 2 H ~4 to 500 Reservoir Simulation VIP Eclipse M H L H M ~100 1 Seismic processing packages such as ProMAX are comprised of a large number of executables. The data in this row are for the subsets of executables that are most time-consuming. 2 There are modules for which this entry would be H, but they comprise only about 10% of the total seismic processing workload.
7 HPC Resource Demands for CAE MCAE Segment IFEA Statics IFEA Dynamics EFEA CFD Unstructured CFD Structured Software ABAQUS ANSYS MSC.Nastran ABAQUS ANSYS MSC.Nastran LS-DYNA PAM-CRASH RADIOSS FLUENT STAR-CD PowerFLOW OVERFLOW System Resource Benefits/Requirements CPU Memory BW I/O BW Com. BW Latency Scalability H H M L L < 10p L H H H L < 10p H L L M M ~ 50p M H M H H ~ 100p H H L M M ~ 100p
8 HPC Resource Demands for Bioinformatics System Resource Benefits/Requirements Bio Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability Sequence Matching Blast, Fasta Smith-W. HMMER, Wise2 H M L L M 4-32 HTC + seq. match. code H H M L M ~100 Sequence Alignment ClustalW Phylip H M L M L 24 Sequence Assembly Phrap Paracel H M M M L 16
9 HPC Resource Demands for Computational Chemistry System Resource Benefits/Requirements Segment Software CPU Memory BW I/O BW Comm. Latency BW QM MM/MD "ab-initio" Gaussian Gamess ADF CASTEP Semiempirical Mopac Ampac Amber Charmm NAMD Scalability H H /M H /M L L 1-32 H L L L M 1-4 H M M M H 1-64 Docking Dock FleXx H L L L L 1-64 QM: Quantum Mechanics. A large variation for Memory BW, I/O BW and scalability MM/MD: Molecular Mechanics/Molecular Dynamics Docking: Scalability via throughput
10 HPC Resource Demands for Weather and Climate Models Segment explicit finite difference semi-implicit finite difference spectral climate models spectral weather models coupled climate models Software MM5 HIRLAM CCM3/CAM NOGAPS IFS ALADIN CCSM2 FMS System Resource Benefits/Requirements CPU Memory BW I/O BW Com. BW Latency Scalability H M L L H ~ p H M L H H ~ p H M L H M ~ p H M L H M ~ 200p H M L H H ~ 100p
11 rformance Dependency on Architecture 8MB L2 in O2000 and O3000 1,6 1,4 1,2 1 Memory Cpu,cache 1,19 Cpu 1,03 Performance corridor defined by Linpack (lower limit) and STREAM (higher limit) 1,64 1,29 1,26 1,25 1,21 1,17 1,17 1,18 1,11 1,05 1,07 1,03 1,04 0,8 Linpack Specfp2000 STREAM Abaqus/std-1 Nastran(103) Nastran(101) StarCD-1 LS-Dyna-1 Pamcrash Radioss-1 Vectis-1 Fluent-1 CASTEP Amber Gaussian Relative performance improvement in applications is greater than the factor indicated by Specfp2000 Exception are the I/O intensive apps like Nastran-NVH or Gaussian (BW steers performance)
12 rformance Dependency on Microprocessor ock Rate - same Architecture 1,4 1,2 1 0,8 0,6 Performance corridor defined by Linpack (higher limit) and STREAM (lower limit) 1,19 1,19 1,19 1,2 1,17 1,18 1,2 1,17 1,17 1,17 1,19 1,2 1,1 1,14 1,12 1,12 1,13 1,1 1,11 1,04 1,04 Linpack Specfp2000 STREAM Abaqus/std-1 Abaqus/Exp-1 Nastran(103) Nastran(111) Nastran(101) Nastran(108) Ansys StarCD-1 StarCD-8 LS-Dyna-1 Pamcrash Radioss-1 Madymo Vectis-8 Fluent-8 Fluent-1 Fire-1 Relative Performance to Cpu Cpu,cache Memory
13 rformance Dependency on Microprocessor che Size - same Architecture 1,1 Performance corridor defined by Specfp2000 (lower limit) and STREAM (higher limit) 0, ,95 0,92 0,93 0,9 0,91 0,92 0,9 0,89 0,9 0,85 0,84 0,91 0,93 0,89 0,9 0,86 0,87 0,88 0,85 0,8 0,7 0,6 Linpack Specfp2000 STREAM Abaqus/std-1 Abaqus/Exp-1 Nastran(103) Nastran(111) Nastran(101) Nastran(108) Ansys StarCD-1 StarCD-8 LS-Dyna-1 Pamcrash Radioss-1 Madymo Vectis-8 Fluent-8 Fluent-1 Fire-1 Powerflow-16 Cpu Cpu,cache Memory
14 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Key Applications Instruction Mix Floating Point Operations Integer Operations Memory access instructions Branche Instructions Nastran Ansys Pamcrash Ls-Dyna Radioss Powerflow Fluent StarHPC Fire Gaussi an Gamess Amber CASTEP ADF BLAST FASTA ClustalW MM5 HIRLAM CCM3 IFS ProMAX Omega Eclipse VIP CSM CFD CCM BIO CWO SPI RES
15 Instruction mix Real applications have between 5% - 45% FP instructions with an average of 22%, while the average of memory access instructions is 39% A higher number of INT than FP instructions. Exception are BLAS-like solvers as Nastran, Abaqus and ProMAX. Ratio of graduated loads and stores to FP operations is 1.7x Compute Intensive Applications are also Data Intensive Applications Vector systems had the the system balance=1 (one Flop per Byte) Next generation architectures need to address memory bandwidth issue IO puts additional burden on memory bandwidth
16 System balance Supercomputing platforms must balance Microprocessor power Memory size Bandwidth Latency I/O balance is another important consideration Balance # Cpus Lower is better CRAY T90 NEC SX-5 IBM SGI HP HP SGI Altix Supercomputers after Cray 1 began to lose balance
17 ommunication vs. Computation Ratio in ey Applications - measured with BandeLa Computation Wait MPI SW latency Data Transfer 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Nastran/ 4 Ansys/ 2 Pam-Crash/ 32 Ls-Dyna/ 48p Radioss/ 96 PowerFLOW/ 64 Fluent/ 64 StarHPC Fire/ 32 Gaussi an/ 16 Gamess/ 32 Amber/ 8 CASTEP/ 128 ADF/ 32 BLAST/ 16 FASTA/ 16 ClustalW/ 16 MM5 HIRLAM CCM3/ 16 IFS ProMAX Omega Eclipse/ 52 VIP/ 32 CSM CFD CCM BIO CWO SPI RES
18 Communication Details Computation : the time outside MPI Wait: The time a CPU is locked on mpi_wait load unbalance contention of the traffic through the interconnect gabric or the switch MPI SW Latency : the time accounted to the MPI library sensitive to MPI latency Data Transfer : the time the transfer engine is active (bcopy on Origin3000 or Altix 3000). Sensitive to MPI bandwidth An important inhibiting factor for scalability is the load imbalance (WAIT). It needs to be addressed by future architectures and programming models
19 BandeLa Profiling Tool An MPI tool to answer the question : What if the BANDwidth and LAtency change up or down. 1) Run the application with the targeted number of CPU in order to capture the timings outside the Mpi calls and to capture the sequence of MPI kernels generated by the Mpi library( Isend, Irecv, wait, test) 2) Replay the timings applying a simple model to time the above kernels
20 BandeLa Profiling Tool Several topologies can be specified single host, clusters Several communication schemes O3000 like (receiving CPU does the transfer) Synchronous and Asynchronous transfers Interleaving (an arriving message shares the hardware immediately) No interleaving (an arriving message waits for the previous messages to fully complete)
21 BandeLa- Basic Functionality The MPI library transforms any MPI function in a sequence of 4 kernels : MPI_SGI_request_send mpi_isend MPI_SGI_request_recv mpi_irecv MPI_SGI_request_test mpi_test MPI_SGI_request_wait mpi_wait The Bandela instrumentation catches these sequences and records the computational time outside MPI. This is an application signature independent from the communication hardware
22 BandeLa- Instrumentation example No need to relink. Some environment variables can also be set in order to partially instrument the application without relinking setenv LD_LIBRARY64_PATH.../ACQUIRE_64 setenv RLD64_LIST libbandela.so:default f77 o test_bcast test_bcast.f lmpi setenv MPI_BUFFER_MAX 2000 mpirun np 4 test_bcast -One file is created for each process -4 Files are created for this example:fort.177, fort.178, fort.179, fort.180 (starting file.177 may be change with an environment Variable)
23 BandeLa- Parameters (single host) MPI Latency : The time which is accounted to the MPI software for doing its work (queuing messages, checking message arrivals, ) For the model this is just the amount of time simply added to the communication table of the particular CPU on entry of a MPI kernel function : 2.25 µs on Origin 3000 or 4.5 µs full send-receive on Origin3000 MPI Bandwidth : The speed at which bcopy is doing its job. 250Mb /s in average on the Origin3000
24 BandeLa- Validating the model with measurement (Origin3000 single host) CCM3 - spectral climate model on 16 CPU Origin Elapsed time (secs) Measured communication Data Physical Transfer MPI SW Latency Wait Computation MPI ranks Model using a Bandwidth of 250Mb/s( default
25 BandeLa- Validating the model with measurement (Origin3000 single host) CCM3 - spectral climate model on 16 CPU Origin Elapse Time(s) Measured communication Data Physical Transfer MPI SW Latency Wait Computation MPI ranks Model using tuned bandwidth 225 Mb/s
26 BandeLa- what- if analysis Topologies vailable SINGLE SHARED MEMORY HOST (0rigin3000 or Altix 3000) SWITCH (cluster s)
27 Bandela - Data Transfer methods Transfer synchronously done by the receiving CPU (Origin3000 or Altix 3000 host) Transfer synchronously/asynchronously done with interleaving/no interleaving, constant bandwidth Transfer asynchronously with interleaving bandwidth depending of request size(myrinet)
28 BandeLa Myrinet parameters? MPI SW Latency : Myrinet 2000 ping-pong latency on Origin300 has been measured: 17µs Bandwidth : As for the Origin300 single host, the workload may change the adapter performance but : -The bandwidth also depends of the size -The CPUs have to share the adapter(s) ( this is considered in the model used in BandeLa) -The number of adapters used change one adapter performance
29 BandeLa Myrinet parameters? Myrinet Bandwidth modeled Mb/s message size ndwidth chart given by Myricom.Bandwidth chart used by the m It depends only of the asympto which is not 250 Mb/s on the re
30 BandeLa Myrinet parameters? We use the following asymptotic values : 1 adapter : 93 Mb/s 2 adapters : 85 Mb/s 4 adapters : 75 Mb/s These values were set from runs with two Origin300 linked with 4 adapters on both machines. We think these are the asymptotic bandwidths really seen by the applications
31 CCM3 study Bandwidth effect CCM3 - large case on 64 CPUs Current perf. Model: MPI latency 5.5 msec, Bandwith 275Mb/s Elapse(s) x Bandwith (1.1Gb/s) Ratio unbalance Ratio CPU : MPI ranks Computation Load Unbalance MPI latency Physical Data Transfer
32 CCM3 study CCM3- large case on 64 CPU Origin3000 SSI and 4 x 16p Origin Elapsed time(secs) p O300 4x16p O300 1 Myrinet board 4x16p O300 4 Myrinet boards Data Physical Transfer Mpi SW Latency Wait Computation MPI ranks CCM3 performance highly depends of the number of communication channels
33 CASTEP - Latency effect CASTEP - 24p execution modelled for different latencies Latency HIPPI 800 Elapsed Time (s) Origin3000 -SSI Latency GSN Nr. of Processors Computation Load unbalancing MPI latency Physical Data Transfer
34 Communication vs. Computation Sweep3D on MHz Elapse time (s) Data Transfer MPI Latency Wait Computation Number of MPI tasks
35 BadeLa Pam-Crash study Scalability on Altix 3000 Tests on Altix 64 CPU, 64p 1.5MB L3 cache, Single System Image (SSI) Pam-Crash V2003 DMP-SP Using BMW model w/ 284,000 elements Run for 5000 time steps a special library is used to time or model the communication
36 Pam-Crash V2003 on Altix 3000 Automotive model, 5000 time steps PAM-Crash Altix 900 MHz Speed-up BMW elements # CPU Computation Speed up rank 0 Global speed-up Perfect Speed up Computation scales up to 64 Communication overhead is too high at 64p
37 Pam-Crash V2003 on Altix 3000 BMW6 model, 5000 time steps The Bandela model estimates that a perfect MPI machine ( zero latency, infinite bandwidth ) would not help. The run Altix 3000 is closed to a perfect MPI model Pam-Crash BMW6 Altix 900MHz Bandela 16 CPU Altix (model), Perfect MPI machine(model) 16 CPU Altix measured Elapse Measured Phys_transfer Mpi_Sw_late WAIT computation MPI ranks
38 BandeLa -Vampir compatibility BandeLa can generate a trace compatible with the Vampir Browser. Using Vampir you can zoom the latency and Bandwidth changes at any degree of details Chr
39 Requirements for Petaflops applications Memory and cache footprint amount of memory req. at each level of the memory hierarchy Degree of data reuse associated with core kernels of the apps, the scaling of these kernels, and the associated estimate of memory BW required at each level of the memory hierarchy Instruction mix (fp,integer, ld/st) IO/ requirements and storage for temp results and checkpoints Amount of concurrency available in the apps, and communication requirements bisection BW, latency, fast synchronization patterns Communication/computation ratio and degree of overlap Processor 3-50 cycle Cache(s) L1,L2,L cycl Main Memory 1.5 million cycles Disk
40 Big Datasets : Generic Tera-Scale ( ) z(1000) y(1000) x(1000) t(1000) program main real*8 pressure(1000,1000,1000,1000) real*8 volume(1000,1000,1000,1000) real*8 temperature(1000,1000,1000,1000) do k = 1,1000 do j = 1,1000 do i = 1,1000 do time_step=1,1000 pressure(i,j,k,time_step) = 0. volume(i,j,k, time_step) = 0. temperature(i,j,k, time_step) = 0. end do end do end do end do print *, pressure(1,1,1,1), volume(1,1,1,1), temperature(1,1,1,1) stop end C $f77-64 main.f (to compile) C $limit stacksize m (before run), 24TB C Only 3 attributes
41 Supercluster concept SGI Altix 3000 First industry-standard Linux cluster with global shared memory NUMA support for large nodes: Single node to 64 CPU, 512 GB of memory Global shared memory: Clusters to 2,048 CPU, 16 TB of memory All nodes can access one large memory space efficiently, so complex communication and data passing between nodes isn t needed, big data sets fit entirely in memory; less disk I/O is needed Conventional Clusters Supercluster SGI Altix 3000 Commodity interconnect mem mem mem mem mem node node node node node OS OS OS OS OS... mem node + OS node + OS NUMAFlex interconnect Global Shared Memory node node OS OS node + OS
42 Parallel Programming Models Intra-Partition Altix 3000 Inter-Partition Partition penmp threads PI HMEM Partition Partition MPI SHMEM XPMEM
43 MPI In Clusters and Global Addressable Memory MPI-1 2-sided send/receive latencies (short 8-byte mess Gigabit Ethernet (TCP/IP) Myrinet Quadrics 100 us(low-cost Clusters) 13 us (Mid-range Clusters) 4-5 us (High-end Clusters) MPT us [SMP] Altix 1.5 us [Supercluster] Goal us
44 Speedup SGI Altix 3000 Scalability for compute intensive applications Higher is Better Nr. of Processors Scalability on Altix 3000 in general similar to Origin3000 Gaussian (CCM) Amber (CCM ) Fasta (BIO) Star-CD (CFD) Vectis (CFD) Ls-Dyna (CSM ) TAU (CFD) HTC-Blast (BIO) Fastx (BIO) MM5 (CWO) CASTEP (CCM) GAM ESS (CCM ) NAMD (CCM) NWChem ICCM) VASP (CCM ) Ideal
45 Platform Directions Mainframe Era Corporate Resource is expensive and needs to be shared Total cost of Computing High Centralised computing more cost effective Time 1985 Low Low Cost of Comm. Bandwidth High Total Cost of Computing = cost of HW + SW + related support costs
46 Platform Directions Decentralized Computing Corporate Resource is cheap enough that I don t have to share Total cost of Computing High Low Centralised computing more cost effective Time Decentralised client server computing more cost effective Low Cost of Comm. Bandwidth High Total Cost of Computing = cost of HW + SW + related support costs
47 Platform Directions Server Consolidation Scale out Nodes Corporate Resource is cheap and can have as much as I want NOWs Clusters of SMPs Total cost of Computing High Low Centralised computing more cost effective 2000 Time Decentralised client server computing more cost effective Low SMPs Cost of Comm. Bandwidth Processors per node Scale up Total Cost of Computing = cost of HW + SW + related support costs High
48 Platform Directions Grid Computing Scale out Corporate Resource is cheap and can have as much as I want, but I don t have to own it. Nodes NOWs Super-Cluster of SMPs Total cost of Computing High Low Centralised computing more cost effective Decentralised client server computing more cost effective Low SMPs Cost of Comm. Bandwidth Processors per node Scale up Total Cost of Computing = cost of HW + SW + related support costs High
49 Conclusions Compute Intensive Applications are also Data Intensive Standard benchmarks define a performance corridor for applications Communication vs. computation profiling of compute intensive applications is essential in designing scalable parallel computer systems Load imbalance most influential factor on scalability Preserving globally addressable memory beyond the boundary of a single node in a cluster improves not only the communication efficiency through but also improves the load balancing Altix 3000 Super-Cluster is a very efficient MPI machine
50 Thank you Any questions or to
Cyclone SGI Cloud Computing for HPC. Christian Tanasescu Vice President Software Engineering
Cyclone SGI Cloud Computing for HPC Christian Tanasescu Vice President Software Engineering Agenda Rationale for Cyclone SGI offering Role in SGI business model Cyclone service and usage models Partnerships
More informationFUSION1200 Scalable x86 SMP System
FUSION1200 Scalable x86 SMP System Introduction Life Sciences Departmental System Manufacturing (CAE) Departmental System Competitive Analysis: IBM x3950 Competitive Analysis: SUN x4600 / SUN x4600 M2
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationScalable x86 SMP Server FUSION1200
Scalable x86 SMP Server FUSION1200 Challenges Scaling compute-power is either Complex (scale-out / clusters) or Expensive (scale-up / SMP) Scale-out - Clusters Requires advanced IT skills / know-how (high
More informationDell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia
More informationConsiderations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture
4 th European LS-DYNA Users Conference MPP / Linux Cluster / Hardware I Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture Authors: Stan Posey, Nick
More informationConsiderations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment
9 th International LS-DYNA Users Conference Computing / Code Technology (2) Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment Stanley Posey HPC Applications Development SGI,
More informationAccelerating High Performance Computing.
Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational
More informationManufacturing Bringing New Levels of Performance to CAE Applications
Solution Brief: Manufacturing Bringing New Levels of Performance to CAE Applications Abstract Computer Aided Engineering (CAE) is used to help manufacturers bring products to market faster while maintaining
More informationThe Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations
The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations Ophir Maor HPC Advisory Council ophir@hpcadvisorycouncil.com The HPC-AI Advisory Council World-wide HPC non-profit
More informationCray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:
Cray events! Cray User Group (CUG):! When: May 16-19, 2005! Where: Albuquerque, New Mexico - USA! Registration: reserved to CUG members! Web site: http://www.cug.org! Cray Technical Workshop Europe:! When:
More informationHPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:
HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com
More informationTHE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION CREATING THE POWER OF ONE
THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION CREATING THE POWER OF ONE ScaleMP Introduction August, 2012 - 2 - Server Virtualization PARTITIONING Subset of the physical resources AGGREGATION
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationThe MOSIX Scalable Cluster Computing for Linux. mosix.org
The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What
More informationRECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016
RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationAssessment of LS-DYNA Scalability Performance on Cray XD1
5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123
More informationAltix Usage and Application Programming
Center for Information Services and High Performance Computing (ZIH) Altix Usage and Application Programming Discussion And Important Information For Users Zellescher Weg 12 Willers-Bau A113 Tel. +49 351-463
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationGPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation
GPU ACCELERATED COMPUTING 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation GAMING PRO ENTERPRISE VISUALIZATION DATA CENTER AUTO
More informationCommodity Cluster Computing
Commodity Cluster Computing Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne http://capawww.epfl.ch Commodity Cluster Computing 1. Introduction 2. Characterisation of nodes, parallel machines,applications
More informationTechnologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017
Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of
More informationNOW and the Killer Network David E. Culler
NOW and the Killer Network David E. Culler culler@cs http://now.cs.berkeley.edu NOW 1 Remember the Killer Micro 100,000,000 10,000,000 R10000 Pentium Transistors 1,000,000 100,000 i80286 i80386 R3000 R2000
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationTrends in systems and how to get efficient performance
Trends in systems and how to get efficient performance Martin Hilgeman HPC Consultant martin.hilgeman@dell.com The landscape is changing We are no longer in the general purpose era the argument of tuning
More informationHPC Solution. Technology for a New Era in Computing
HPC Solution Technology for a New Era in Computing TEL IN HPC & Storage.. 20 years of changing with Technology Complete Solution Integrators for Select Verticals Mechanical Design & Engineering High Performance
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationNehalem Hochleistungsrechnen für reale Anwendungen
Nehalem Hochleistungsrechnen für reale Anwendungen T-Systems HPCN Workshop DLR Braunschweig May 14-15, 2009 Hans-Joachim Plum Intel GmbH 1 Performance tests and ratings are measured using specific computer
More informationAltair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015
Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute
More informationItanium 2 Impact Software / Systems MSC.Software. Jay Clark Director, Business Development High Performance Computing
Itanium 2 Impact Software / Systems MSC.Software Jay Clark Director, Business Development High Performance Computing jay.clark@mscsoftware.com Agenda What MSC.Software does Software vendor point of view
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationOutline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers
Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS
More informationBenchmark Results. 2006/10/03
Benchmark Results cychou@nchc.org.tw 2006/10/03 Outline Motivation HPC Challenge Benchmark Suite Software Installation guide Fine Tune Results Analysis Summary 2 Motivation Evaluate, Compare, Characterize
More informationDell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for ANSYS Mechanical, ANSYS Fluent, and
More informationComposite Metrics for System Throughput in HPC
Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last
More information2008 International ANSYS Conference
2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationMSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures
MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures Presented By: Dr. Olivier Schreiber, Application Engineering, SGI Walter Schrauwen, Senior Engineer, Finite Element Development, MSC
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationMellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007
Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise
More informationTECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING
TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Accelerated computing is revolutionizing the economics of the data center. HPC and hyperscale customers deploy accelerated
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationSTAR-CCM+ Performance Benchmark and Profiling. July 2014
STAR-CCM+ Performance Benchmark and Profiling July 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: CD-adapco, Intel, Dell, Mellanox Compute
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationCluster Network Products
Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster
More informationPartitioning Effects on MPI LS-DYNA Performance
Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI 5416-1225 zais@us.ibm.com Abbreviations: MPI message-passing interface RISC - reduced instruction set computing
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationWhatÕs New in the Message-Passing Toolkit
WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2
More informationStockholm Brain Institute Blue Gene/L
Stockholm Brain Institute Blue Gene/L 1 Stockholm Brain Institute Blue Gene/L 2 IBM Systems & Technology Group and IBM Research IBM Blue Gene /P - An Overview of a Petaflop Capable System Carl G. Tengwall
More informationAdvanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED
Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied
More informationMPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationFuture Routing Schemes in Petascale clusters
Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract
More informationTESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications
TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More informationSecond Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications
Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications Pawe l Gepner, David L. Fraser, and Micha l F. Kowalik Intel Corporation {pawel.gepner,david.l.fraser,michal.f.kowalik}@intel.com
More informationOptimizing LS-DYNA Productivity in Cluster Environments
10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for
More informationScalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment
Silicon Graphics, Inc. Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Presented by: Jean-Pierre Panziera Principal Engineer Altix 3700 SSSI - Architecture and Software
More informationFull Vehicle Dynamic Analysis using Automated Component Modal Synthesis. Peter Schartz, Parallel Project Manager ClusterWorld Conference June 2003
Full Vehicle Dynamic Analysis using Automated Component Modal Synthesis Peter Schartz, Parallel Project Manager Conference Outline Introduction Background Theory Case Studies Full Vehicle Dynamic Analysis
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationParallel Programming with MPI
Parallel Programming with MPI Science and Technology Support Ohio Supercomputer Center 1224 Kinnear Road. Columbus, OH 43212 (614) 292-1800 oschelp@osc.edu http://www.osc.edu/supercomputing/ Functions
More informationSupercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?
Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA
More informationAltair RADIOSS Performance Benchmark and Profiling. May 2013
Altair RADIOSS Performance Benchmark and Profiling May 2013 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Altair, AMD, Dell, Mellanox Compute
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationDell HPC System for Manufacturing System Architecture and Application Performance
Dell HPC System for Manufacturing System Architecture and Application Performance This Dell technical white paper describes the architecture of the Dell HPC System for Manufacturing and discusses performance
More informationWhy Multiprocessors?
Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software
More informationAdapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES
Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Tom Atwood Business Development Manager Sun Microsystems, Inc. Takeaways Understand the technical differences between
More informationClustering Optimizations How to achieve optimal performance? Pak Lui
Clustering Optimizations How to achieve optimal performance? Pak Lui 130 Applications Best Practices Published Abaqus CPMD LS-DYNA MILC AcuSolve Dacapo minife OpenMX Amber Desmond MILC PARATEC AMG DL-POLY
More informationNew Features in LS-DYNA HYBRID Version
11 th International LS-DYNA Users Conference Computing Technology New Features in LS-DYNA HYBRID Version Nick Meng 1, Jason Wang 2, Satish Pathy 2 1 Intel Corporation, Software and Services Group 2 Livermore
More informationComputer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer
Computer Comparisons Using HPCC Nathan Wichmann Benchmark Engineer Outline Comparisons using HPCC HPCC test used Methods used to compare machines using HPCC Normalize scores Weighted averages Comparing
More informationUniprocessor Computer Architecture Example: Cray T3E
Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure
More informationThe STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance
The STREAM Benchmark John D. McCalpin, Ph.D. IBM eserver Performance 2005-01-27 History Scientific computing was largely based on the vector paradigm from the late 1970 s through the 1980 s E.g., the classic
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationChapter 2: Computer-System Structures. Hmm this looks like a Computer System?
Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding
More informationScheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications
Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for
More informationBirds of a Feather Presentation
Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationSingle-Points of Performance
Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationParallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein
Parallel & Cluster Computing cs 6260 professor: elise de doncker by: lina hussein 1 Topics Covered : Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster
More informationGPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester
NVIDIA GPU Computing A Revolution in High Performance Computing GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester John Ashley Senior Solutions Architect
More informationCluster Computing. Cluster Architectures
Cluster Architectures Overview The Problem The Solution The Anatomy of a Cluster The New Problem A big cluster example The Problem Applications Many fields have come to depend on processing power for progress:
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationHigh Performance Computing Course Notes HPC Fundamentals
High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationBlue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft
Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small
More informationCOSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University
COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the
More informationThe State of Accelerated Applications. Michael Feldman
The State of Accelerated Applications Michael Feldman Accelerator Market in HPC Nearly half of all new HPC systems deployed incorporate accelerators Accelerator hardware performance has been advancing
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationManaging CAE Simulation Workloads in Cluster Environments
Managing CAE Simulation Workloads in Cluster Environments Michael Humphrey V.P. Enterprise Computing Altair Engineering humphrey@altair.com June 2003 Copyright 2003 Altair Engineering, Inc. All rights
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More information