sgi Scalability Considerations for Compute Intensive Applications on Clusters Christian Tanasescu Daniel Thomas SGI Inc.

Size: px
Start display at page:

Download "sgi Scalability Considerations for Compute Intensive Applications on Clusters Christian Tanasescu Daniel Thomas SGI Inc."

Transcription

1 Scalability Considerations for Compute Intensive Applications on Clusters Chr Daniel Thomas SGI Inc.

2 Agenda Applications Segments HPC Computational Requirements Scalability and Application profiles Standard benchmarks vs application Communication vs Computation ratio BandeLa - profiling and modeling tool Platform Directions Conclusions

3 The Move to Technology Transparency Application drives the platform High Sellable Feature Feeds and speeds Architecture Feeds and speeds Application Architecture Feeds and speeds Low Time

4 Application Segments CSM - Computational Structural Mechanics CFD - Computational Fluid Dynamics CCM - Computational Chemistry and Material Sciense BIO - Bioinformatics SPI- Seismic Processing and Interpretation RES- Reservoir Simulation CWO- Climate/Weather/Ocean Simulation

5 Sweet Spot Scalability per Application Segment p p 8-64p 16-64p p 4-32p 4-16p CSM CFD CCM BIO SPI RES CWO

6 HPC Resource Demands for Energy System Resource Benefits/Requirements Energy Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability Seismic Processing 1 ProMAX Omega GeoDepth H H H L M 2 H ~4 to 500 Reservoir Simulation VIP Eclipse M H L H M ~100 1 Seismic processing packages such as ProMAX are comprised of a large number of executables. The data in this row are for the subsets of executables that are most time-consuming. 2 There are modules for which this entry would be H, but they comprise only about 10% of the total seismic processing workload.

7 HPC Resource Demands for CAE MCAE Segment IFEA Statics IFEA Dynamics EFEA CFD Unstructured CFD Structured Software ABAQUS ANSYS MSC.Nastran ABAQUS ANSYS MSC.Nastran LS-DYNA PAM-CRASH RADIOSS FLUENT STAR-CD PowerFLOW OVERFLOW System Resource Benefits/Requirements CPU Memory BW I/O BW Com. BW Latency Scalability H H M L L < 10p L H H H L < 10p H L L M M ~ 50p M H M H H ~ 100p H H L M M ~ 100p

8 HPC Resource Demands for Bioinformatics System Resource Benefits/Requirements Bio Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability Sequence Matching Blast, Fasta Smith-W. HMMER, Wise2 H M L L M 4-32 HTC + seq. match. code H H M L M ~100 Sequence Alignment ClustalW Phylip H M L M L 24 Sequence Assembly Phrap Paracel H M M M L 16

9 HPC Resource Demands for Computational Chemistry System Resource Benefits/Requirements Segment Software CPU Memory BW I/O BW Comm. Latency BW QM MM/MD "ab-initio" Gaussian Gamess ADF CASTEP Semiempirical Mopac Ampac Amber Charmm NAMD Scalability H H /M H /M L L 1-32 H L L L M 1-4 H M M M H 1-64 Docking Dock FleXx H L L L L 1-64 QM: Quantum Mechanics. A large variation for Memory BW, I/O BW and scalability MM/MD: Molecular Mechanics/Molecular Dynamics Docking: Scalability via throughput

10 HPC Resource Demands for Weather and Climate Models Segment explicit finite difference semi-implicit finite difference spectral climate models spectral weather models coupled climate models Software MM5 HIRLAM CCM3/CAM NOGAPS IFS ALADIN CCSM2 FMS System Resource Benefits/Requirements CPU Memory BW I/O BW Com. BW Latency Scalability H M L L H ~ p H M L H H ~ p H M L H M ~ p H M L H M ~ 200p H M L H H ~ 100p

11 rformance Dependency on Architecture 8MB L2 in O2000 and O3000 1,6 1,4 1,2 1 Memory Cpu,cache 1,19 Cpu 1,03 Performance corridor defined by Linpack (lower limit) and STREAM (higher limit) 1,64 1,29 1,26 1,25 1,21 1,17 1,17 1,18 1,11 1,05 1,07 1,03 1,04 0,8 Linpack Specfp2000 STREAM Abaqus/std-1 Nastran(103) Nastran(101) StarCD-1 LS-Dyna-1 Pamcrash Radioss-1 Vectis-1 Fluent-1 CASTEP Amber Gaussian Relative performance improvement in applications is greater than the factor indicated by Specfp2000 Exception are the I/O intensive apps like Nastran-NVH or Gaussian (BW steers performance)

12 rformance Dependency on Microprocessor ock Rate - same Architecture 1,4 1,2 1 0,8 0,6 Performance corridor defined by Linpack (higher limit) and STREAM (lower limit) 1,19 1,19 1,19 1,2 1,17 1,18 1,2 1,17 1,17 1,17 1,19 1,2 1,1 1,14 1,12 1,12 1,13 1,1 1,11 1,04 1,04 Linpack Specfp2000 STREAM Abaqus/std-1 Abaqus/Exp-1 Nastran(103) Nastran(111) Nastran(101) Nastran(108) Ansys StarCD-1 StarCD-8 LS-Dyna-1 Pamcrash Radioss-1 Madymo Vectis-8 Fluent-8 Fluent-1 Fire-1 Relative Performance to Cpu Cpu,cache Memory

13 rformance Dependency on Microprocessor che Size - same Architecture 1,1 Performance corridor defined by Specfp2000 (lower limit) and STREAM (higher limit) 0, ,95 0,92 0,93 0,9 0,91 0,92 0,9 0,89 0,9 0,85 0,84 0,91 0,93 0,89 0,9 0,86 0,87 0,88 0,85 0,8 0,7 0,6 Linpack Specfp2000 STREAM Abaqus/std-1 Abaqus/Exp-1 Nastran(103) Nastran(111) Nastran(101) Nastran(108) Ansys StarCD-1 StarCD-8 LS-Dyna-1 Pamcrash Radioss-1 Madymo Vectis-8 Fluent-8 Fluent-1 Fire-1 Powerflow-16 Cpu Cpu,cache Memory

14 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Key Applications Instruction Mix Floating Point Operations Integer Operations Memory access instructions Branche Instructions Nastran Ansys Pamcrash Ls-Dyna Radioss Powerflow Fluent StarHPC Fire Gaussi an Gamess Amber CASTEP ADF BLAST FASTA ClustalW MM5 HIRLAM CCM3 IFS ProMAX Omega Eclipse VIP CSM CFD CCM BIO CWO SPI RES

15 Instruction mix Real applications have between 5% - 45% FP instructions with an average of 22%, while the average of memory access instructions is 39% A higher number of INT than FP instructions. Exception are BLAS-like solvers as Nastran, Abaqus and ProMAX. Ratio of graduated loads and stores to FP operations is 1.7x Compute Intensive Applications are also Data Intensive Applications Vector systems had the the system balance=1 (one Flop per Byte) Next generation architectures need to address memory bandwidth issue IO puts additional burden on memory bandwidth

16 System balance Supercomputing platforms must balance Microprocessor power Memory size Bandwidth Latency I/O balance is another important consideration Balance # Cpus Lower is better CRAY T90 NEC SX-5 IBM SGI HP HP SGI Altix Supercomputers after Cray 1 began to lose balance

17 ommunication vs. Computation Ratio in ey Applications - measured with BandeLa Computation Wait MPI SW latency Data Transfer 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Nastran/ 4 Ansys/ 2 Pam-Crash/ 32 Ls-Dyna/ 48p Radioss/ 96 PowerFLOW/ 64 Fluent/ 64 StarHPC Fire/ 32 Gaussi an/ 16 Gamess/ 32 Amber/ 8 CASTEP/ 128 ADF/ 32 BLAST/ 16 FASTA/ 16 ClustalW/ 16 MM5 HIRLAM CCM3/ 16 IFS ProMAX Omega Eclipse/ 52 VIP/ 32 CSM CFD CCM BIO CWO SPI RES

18 Communication Details Computation : the time outside MPI Wait: The time a CPU is locked on mpi_wait load unbalance contention of the traffic through the interconnect gabric or the switch MPI SW Latency : the time accounted to the MPI library sensitive to MPI latency Data Transfer : the time the transfer engine is active (bcopy on Origin3000 or Altix 3000). Sensitive to MPI bandwidth An important inhibiting factor for scalability is the load imbalance (WAIT). It needs to be addressed by future architectures and programming models

19 BandeLa Profiling Tool An MPI tool to answer the question : What if the BANDwidth and LAtency change up or down. 1) Run the application with the targeted number of CPU in order to capture the timings outside the Mpi calls and to capture the sequence of MPI kernels generated by the Mpi library( Isend, Irecv, wait, test) 2) Replay the timings applying a simple model to time the above kernels

20 BandeLa Profiling Tool Several topologies can be specified single host, clusters Several communication schemes O3000 like (receiving CPU does the transfer) Synchronous and Asynchronous transfers Interleaving (an arriving message shares the hardware immediately) No interleaving (an arriving message waits for the previous messages to fully complete)

21 BandeLa- Basic Functionality The MPI library transforms any MPI function in a sequence of 4 kernels : MPI_SGI_request_send mpi_isend MPI_SGI_request_recv mpi_irecv MPI_SGI_request_test mpi_test MPI_SGI_request_wait mpi_wait The Bandela instrumentation catches these sequences and records the computational time outside MPI. This is an application signature independent from the communication hardware

22 BandeLa- Instrumentation example No need to relink. Some environment variables can also be set in order to partially instrument the application without relinking setenv LD_LIBRARY64_PATH.../ACQUIRE_64 setenv RLD64_LIST libbandela.so:default f77 o test_bcast test_bcast.f lmpi setenv MPI_BUFFER_MAX 2000 mpirun np 4 test_bcast -One file is created for each process -4 Files are created for this example:fort.177, fort.178, fort.179, fort.180 (starting file.177 may be change with an environment Variable)

23 BandeLa- Parameters (single host) MPI Latency : The time which is accounted to the MPI software for doing its work (queuing messages, checking message arrivals, ) For the model this is just the amount of time simply added to the communication table of the particular CPU on entry of a MPI kernel function : 2.25 µs on Origin 3000 or 4.5 µs full send-receive on Origin3000 MPI Bandwidth : The speed at which bcopy is doing its job. 250Mb /s in average on the Origin3000

24 BandeLa- Validating the model with measurement (Origin3000 single host) CCM3 - spectral climate model on 16 CPU Origin Elapsed time (secs) Measured communication Data Physical Transfer MPI SW Latency Wait Computation MPI ranks Model using a Bandwidth of 250Mb/s( default

25 BandeLa- Validating the model with measurement (Origin3000 single host) CCM3 - spectral climate model on 16 CPU Origin Elapse Time(s) Measured communication Data Physical Transfer MPI SW Latency Wait Computation MPI ranks Model using tuned bandwidth 225 Mb/s

26 BandeLa- what- if analysis Topologies vailable SINGLE SHARED MEMORY HOST (0rigin3000 or Altix 3000) SWITCH (cluster s)

27 Bandela - Data Transfer methods Transfer synchronously done by the receiving CPU (Origin3000 or Altix 3000 host) Transfer synchronously/asynchronously done with interleaving/no interleaving, constant bandwidth Transfer asynchronously with interleaving bandwidth depending of request size(myrinet)

28 BandeLa Myrinet parameters? MPI SW Latency : Myrinet 2000 ping-pong latency on Origin300 has been measured: 17µs Bandwidth : As for the Origin300 single host, the workload may change the adapter performance but : -The bandwidth also depends of the size -The CPUs have to share the adapter(s) ( this is considered in the model used in BandeLa) -The number of adapters used change one adapter performance

29 BandeLa Myrinet parameters? Myrinet Bandwidth modeled Mb/s message size ndwidth chart given by Myricom.Bandwidth chart used by the m It depends only of the asympto which is not 250 Mb/s on the re

30 BandeLa Myrinet parameters? We use the following asymptotic values : 1 adapter : 93 Mb/s 2 adapters : 85 Mb/s 4 adapters : 75 Mb/s These values were set from runs with two Origin300 linked with 4 adapters on both machines. We think these are the asymptotic bandwidths really seen by the applications

31 CCM3 study Bandwidth effect CCM3 - large case on 64 CPUs Current perf. Model: MPI latency 5.5 msec, Bandwith 275Mb/s Elapse(s) x Bandwith (1.1Gb/s) Ratio unbalance Ratio CPU : MPI ranks Computation Load Unbalance MPI latency Physical Data Transfer

32 CCM3 study CCM3- large case on 64 CPU Origin3000 SSI and 4 x 16p Origin Elapsed time(secs) p O300 4x16p O300 1 Myrinet board 4x16p O300 4 Myrinet boards Data Physical Transfer Mpi SW Latency Wait Computation MPI ranks CCM3 performance highly depends of the number of communication channels

33 CASTEP - Latency effect CASTEP - 24p execution modelled for different latencies Latency HIPPI 800 Elapsed Time (s) Origin3000 -SSI Latency GSN Nr. of Processors Computation Load unbalancing MPI latency Physical Data Transfer

34 Communication vs. Computation Sweep3D on MHz Elapse time (s) Data Transfer MPI Latency Wait Computation Number of MPI tasks

35 BadeLa Pam-Crash study Scalability on Altix 3000 Tests on Altix 64 CPU, 64p 1.5MB L3 cache, Single System Image (SSI) Pam-Crash V2003 DMP-SP Using BMW model w/ 284,000 elements Run for 5000 time steps a special library is used to time or model the communication

36 Pam-Crash V2003 on Altix 3000 Automotive model, 5000 time steps PAM-Crash Altix 900 MHz Speed-up BMW elements # CPU Computation Speed up rank 0 Global speed-up Perfect Speed up Computation scales up to 64 Communication overhead is too high at 64p

37 Pam-Crash V2003 on Altix 3000 BMW6 model, 5000 time steps The Bandela model estimates that a perfect MPI machine ( zero latency, infinite bandwidth ) would not help. The run Altix 3000 is closed to a perfect MPI model Pam-Crash BMW6 Altix 900MHz Bandela 16 CPU Altix (model), Perfect MPI machine(model) 16 CPU Altix measured Elapse Measured Phys_transfer Mpi_Sw_late WAIT computation MPI ranks

38 BandeLa -Vampir compatibility BandeLa can generate a trace compatible with the Vampir Browser. Using Vampir you can zoom the latency and Bandwidth changes at any degree of details Chr

39 Requirements for Petaflops applications Memory and cache footprint amount of memory req. at each level of the memory hierarchy Degree of data reuse associated with core kernels of the apps, the scaling of these kernels, and the associated estimate of memory BW required at each level of the memory hierarchy Instruction mix (fp,integer, ld/st) IO/ requirements and storage for temp results and checkpoints Amount of concurrency available in the apps, and communication requirements bisection BW, latency, fast synchronization patterns Communication/computation ratio and degree of overlap Processor 3-50 cycle Cache(s) L1,L2,L cycl Main Memory 1.5 million cycles Disk

40 Big Datasets : Generic Tera-Scale ( ) z(1000) y(1000) x(1000) t(1000) program main real*8 pressure(1000,1000,1000,1000) real*8 volume(1000,1000,1000,1000) real*8 temperature(1000,1000,1000,1000) do k = 1,1000 do j = 1,1000 do i = 1,1000 do time_step=1,1000 pressure(i,j,k,time_step) = 0. volume(i,j,k, time_step) = 0. temperature(i,j,k, time_step) = 0. end do end do end do end do print *, pressure(1,1,1,1), volume(1,1,1,1), temperature(1,1,1,1) stop end C $f77-64 main.f (to compile) C $limit stacksize m (before run), 24TB C Only 3 attributes

41 Supercluster concept SGI Altix 3000 First industry-standard Linux cluster with global shared memory NUMA support for large nodes: Single node to 64 CPU, 512 GB of memory Global shared memory: Clusters to 2,048 CPU, 16 TB of memory All nodes can access one large memory space efficiently, so complex communication and data passing between nodes isn t needed, big data sets fit entirely in memory; less disk I/O is needed Conventional Clusters Supercluster SGI Altix 3000 Commodity interconnect mem mem mem mem mem node node node node node OS OS OS OS OS... mem node + OS node + OS NUMAFlex interconnect Global Shared Memory node node OS OS node + OS

42 Parallel Programming Models Intra-Partition Altix 3000 Inter-Partition Partition penmp threads PI HMEM Partition Partition MPI SHMEM XPMEM

43 MPI In Clusters and Global Addressable Memory MPI-1 2-sided send/receive latencies (short 8-byte mess Gigabit Ethernet (TCP/IP) Myrinet Quadrics 100 us(low-cost Clusters) 13 us (Mid-range Clusters) 4-5 us (High-end Clusters) MPT us [SMP] Altix 1.5 us [Supercluster] Goal us

44 Speedup SGI Altix 3000 Scalability for compute intensive applications Higher is Better Nr. of Processors Scalability on Altix 3000 in general similar to Origin3000 Gaussian (CCM) Amber (CCM ) Fasta (BIO) Star-CD (CFD) Vectis (CFD) Ls-Dyna (CSM ) TAU (CFD) HTC-Blast (BIO) Fastx (BIO) MM5 (CWO) CASTEP (CCM) GAM ESS (CCM ) NAMD (CCM) NWChem ICCM) VASP (CCM ) Ideal

45 Platform Directions Mainframe Era Corporate Resource is expensive and needs to be shared Total cost of Computing High Centralised computing more cost effective Time 1985 Low Low Cost of Comm. Bandwidth High Total Cost of Computing = cost of HW + SW + related support costs

46 Platform Directions Decentralized Computing Corporate Resource is cheap enough that I don t have to share Total cost of Computing High Low Centralised computing more cost effective Time Decentralised client server computing more cost effective Low Cost of Comm. Bandwidth High Total Cost of Computing = cost of HW + SW + related support costs

47 Platform Directions Server Consolidation Scale out Nodes Corporate Resource is cheap and can have as much as I want NOWs Clusters of SMPs Total cost of Computing High Low Centralised computing more cost effective 2000 Time Decentralised client server computing more cost effective Low SMPs Cost of Comm. Bandwidth Processors per node Scale up Total Cost of Computing = cost of HW + SW + related support costs High

48 Platform Directions Grid Computing Scale out Corporate Resource is cheap and can have as much as I want, but I don t have to own it. Nodes NOWs Super-Cluster of SMPs Total cost of Computing High Low Centralised computing more cost effective Decentralised client server computing more cost effective Low SMPs Cost of Comm. Bandwidth Processors per node Scale up Total Cost of Computing = cost of HW + SW + related support costs High

49 Conclusions Compute Intensive Applications are also Data Intensive Standard benchmarks define a performance corridor for applications Communication vs. computation profiling of compute intensive applications is essential in designing scalable parallel computer systems Load imbalance most influential factor on scalability Preserving globally addressable memory beyond the boundary of a single node in a cluster improves not only the communication efficiency through but also improves the load balancing Altix 3000 Super-Cluster is a very efficient MPI machine

50 Thank you Any questions or to

Cyclone SGI Cloud Computing for HPC. Christian Tanasescu Vice President Software Engineering

Cyclone SGI Cloud Computing for HPC. Christian Tanasescu Vice President Software Engineering Cyclone SGI Cloud Computing for HPC Christian Tanasescu Vice President Software Engineering Agenda Rationale for Cyclone SGI offering Role in SGI business model Cyclone service and usage models Partnerships

More information

FUSION1200 Scalable x86 SMP System

FUSION1200 Scalable x86 SMP System FUSION1200 Scalable x86 SMP System Introduction Life Sciences Departmental System Manufacturing (CAE) Departmental System Competitive Analysis: IBM x3950 Competitive Analysis: SUN x4600 / SUN x4600 M2

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Scalable x86 SMP Server FUSION1200

Scalable x86 SMP Server FUSION1200 Scalable x86 SMP Server FUSION1200 Challenges Scaling compute-power is either Complex (scale-out / clusters) or Expensive (scale-up / SMP) Scale-out - Clusters Requires advanced IT skills / know-how (high

More information

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia

More information

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture 4 th European LS-DYNA Users Conference MPP / Linux Cluster / Hardware I Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture Authors: Stan Posey, Nick

More information

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment 9 th International LS-DYNA Users Conference Computing / Code Technology (2) Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment Stanley Posey HPC Applications Development SGI,

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

Manufacturing Bringing New Levels of Performance to CAE Applications

Manufacturing Bringing New Levels of Performance to CAE Applications Solution Brief: Manufacturing Bringing New Levels of Performance to CAE Applications Abstract Computer Aided Engineering (CAE) is used to help manufacturers bring products to market faster while maintaining

More information

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations Ophir Maor HPC Advisory Council ophir@hpcadvisorycouncil.com The HPC-AI Advisory Council World-wide HPC non-profit

More information

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe: Cray events! Cray User Group (CUG):! When: May 16-19, 2005! Where: Albuquerque, New Mexico - USA! Registration: reserved to CUG members! Web site: http://www.cug.org! Cray Technical Workshop Europe:! When:

More information

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT: HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com

More information

THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION CREATING THE POWER OF ONE

THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION CREATING THE POWER OF ONE THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION CREATING THE POWER OF ONE ScaleMP Introduction August, 2012 - 2 - Server Virtualization PARTITIONING Subset of the physical resources AGGREGATION

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016 RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Assessment of LS-DYNA Scalability Performance on Cray XD1

Assessment of LS-DYNA Scalability Performance on Cray XD1 5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123

More information

Altix Usage and Application Programming

Altix Usage and Application Programming Center for Information Services and High Performance Computing (ZIH) Altix Usage and Application Programming Discussion And Important Information For Users Zellescher Weg 12 Willers-Bau A113 Tel. +49 351-463

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation GPU ACCELERATED COMPUTING 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation GAMING PRO ENTERPRISE VISUALIZATION DATA CENTER AUTO

More information

Commodity Cluster Computing

Commodity Cluster Computing Commodity Cluster Computing Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne http://capawww.epfl.ch Commodity Cluster Computing 1. Introduction 2. Characterisation of nodes, parallel machines,applications

More information

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of

More information

NOW and the Killer Network David E. Culler

NOW and the Killer Network David E. Culler NOW and the Killer Network David E. Culler culler@cs http://now.cs.berkeley.edu NOW 1 Remember the Killer Micro 100,000,000 10,000,000 R10000 Pentium Transistors 1,000,000 100,000 i80286 i80386 R3000 R2000

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Trends in systems and how to get efficient performance

Trends in systems and how to get efficient performance Trends in systems and how to get efficient performance Martin Hilgeman HPC Consultant martin.hilgeman@dell.com The landscape is changing We are no longer in the general purpose era the argument of tuning

More information

HPC Solution. Technology for a New Era in Computing

HPC Solution. Technology for a New Era in Computing HPC Solution Technology for a New Era in Computing TEL IN HPC & Storage.. 20 years of changing with Technology Complete Solution Integrators for Select Verticals Mechanical Design & Engineering High Performance

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Nehalem Hochleistungsrechnen für reale Anwendungen

Nehalem Hochleistungsrechnen für reale Anwendungen Nehalem Hochleistungsrechnen für reale Anwendungen T-Systems HPCN Workshop DLR Braunschweig May 14-15, 2009 Hans-Joachim Plum Intel GmbH 1 Performance tests and ratings are measured using specific computer

More information

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015 Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute

More information

Itanium 2 Impact Software / Systems MSC.Software. Jay Clark Director, Business Development High Performance Computing

Itanium 2 Impact Software / Systems MSC.Software. Jay Clark Director, Business Development High Performance Computing Itanium 2 Impact Software / Systems MSC.Software Jay Clark Director, Business Development High Performance Computing jay.clark@mscsoftware.com Agenda What MSC.Software does Software vendor point of view

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS

More information

Benchmark Results. 2006/10/03

Benchmark Results. 2006/10/03 Benchmark Results cychou@nchc.org.tw 2006/10/03 Outline Motivation HPC Challenge Benchmark Suite Software Installation guide Fine Tune Results Analysis Summary 2 Motivation Evaluate, Compare, Characterize

More information

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for ANSYS Mechanical, ANSYS Fluent, and

More information

Composite Metrics for System Throughput in HPC

Composite Metrics for System Throughput in HPC Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures Presented By: Dr. Olivier Schreiber, Application Engineering, SGI Walter Schrauwen, Senior Engineer, Finite Element Development, MSC

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007 Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Accelerated computing is revolutionizing the economics of the data center. HPC and hyperscale customers deploy accelerated

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

STAR-CCM+ Performance Benchmark and Profiling. July 2014

STAR-CCM+ Performance Benchmark and Profiling. July 2014 STAR-CCM+ Performance Benchmark and Profiling July 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: CD-adapco, Intel, Dell, Mellanox Compute

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Cluster Network Products

Cluster Network Products Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster

More information

Partitioning Effects on MPI LS-DYNA Performance

Partitioning Effects on MPI LS-DYNA Performance Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI 5416-1225 zais@us.ibm.com Abbreviations: MPI message-passing interface RISC - reduced instruction set computing

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

Stockholm Brain Institute Blue Gene/L

Stockholm Brain Institute Blue Gene/L Stockholm Brain Institute Blue Gene/L 1 Stockholm Brain Institute Blue Gene/L 2 IBM Systems & Technology Group and IBM Research IBM Blue Gene /P - An Overview of a Petaflop Capable System Carl G. Tengwall

More information

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Future Routing Schemes in Petascale clusters

Future Routing Schemes in Petascale clusters Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract

More information

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications

Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications Pawe l Gepner, David L. Fraser, and Micha l F. Kowalik Intel Corporation {pawel.gepner,david.l.fraser,michal.f.kowalik}@intel.com

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment

Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Silicon Graphics, Inc. Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Presented by: Jean-Pierre Panziera Principal Engineer Altix 3700 SSSI - Architecture and Software

More information

Full Vehicle Dynamic Analysis using Automated Component Modal Synthesis. Peter Schartz, Parallel Project Manager ClusterWorld Conference June 2003

Full Vehicle Dynamic Analysis using Automated Component Modal Synthesis. Peter Schartz, Parallel Project Manager ClusterWorld Conference June 2003 Full Vehicle Dynamic Analysis using Automated Component Modal Synthesis Peter Schartz, Parallel Project Manager Conference Outline Introduction Background Theory Case Studies Full Vehicle Dynamic Analysis

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete

More information

Parallel Programming with MPI

Parallel Programming with MPI Parallel Programming with MPI Science and Technology Support Ohio Supercomputer Center 1224 Kinnear Road. Columbus, OH 43212 (614) 292-1800 oschelp@osc.edu http://www.osc.edu/supercomputing/ Functions

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

Altair RADIOSS Performance Benchmark and Profiling. May 2013

Altair RADIOSS Performance Benchmark and Profiling. May 2013 Altair RADIOSS Performance Benchmark and Profiling May 2013 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Altair, AMD, Dell, Mellanox Compute

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

Dell HPC System for Manufacturing System Architecture and Application Performance

Dell HPC System for Manufacturing System Architecture and Application Performance Dell HPC System for Manufacturing System Architecture and Application Performance This Dell technical white paper describes the architecture of the Dell HPC System for Manufacturing and discusses performance

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Tom Atwood Business Development Manager Sun Microsystems, Inc. Takeaways Understand the technical differences between

More information

Clustering Optimizations How to achieve optimal performance? Pak Lui

Clustering Optimizations How to achieve optimal performance? Pak Lui Clustering Optimizations How to achieve optimal performance? Pak Lui 130 Applications Best Practices Published Abaqus CPMD LS-DYNA MILC AcuSolve Dacapo minife OpenMX Amber Desmond MILC PARATEC AMG DL-POLY

More information

New Features in LS-DYNA HYBRID Version

New Features in LS-DYNA HYBRID Version 11 th International LS-DYNA Users Conference Computing Technology New Features in LS-DYNA HYBRID Version Nick Meng 1, Jason Wang 2, Satish Pathy 2 1 Intel Corporation, Software and Services Group 2 Livermore

More information

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer Computer Comparisons Using HPCC Nathan Wichmann Benchmark Engineer Outline Comparisons using HPCC HPCC test used Methods used to compare machines using HPCC Normalize scores Weighted averages Comparing

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance The STREAM Benchmark John D. McCalpin, Ph.D. IBM eserver Performance 2005-01-27 History Scientific computing was largely based on the vector paradigm from the late 1970 s through the 1980 s E.g., the classic

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for

More information

Birds of a Feather Presentation

Birds of a Feather Presentation Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Single-Points of Performance

Single-Points of Performance Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein Parallel & Cluster Computing cs 6260 professor: elise de doncker by: lina hussein 1 Topics Covered : Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster

More information

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester NVIDIA GPU Computing A Revolution in High Performance Computing GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester John Ashley Senior Solutions Architect

More information

Cluster Computing. Cluster Architectures

Cluster Computing. Cluster Architectures Cluster Architectures Overview The Problem The Solution The Anatomy of a Cluster The New Problem A big cluster example The Problem Applications Many fields have come to depend on processing power for progress:

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

High Performance Computing Course Notes HPC Fundamentals

High Performance Computing Course Notes HPC Fundamentals High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small

More information

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the

More information

The State of Accelerated Applications. Michael Feldman

The State of Accelerated Applications. Michael Feldman The State of Accelerated Applications Michael Feldman Accelerator Market in HPC Nearly half of all new HPC systems deployed incorporate accelerators Accelerator hardware performance has been advancing

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Managing CAE Simulation Workloads in Cluster Environments

Managing CAE Simulation Workloads in Cluster Environments Managing CAE Simulation Workloads in Cluster Environments Michael Humphrey V.P. Enterprise Computing Altair Engineering humphrey@altair.com June 2003 Copyright 2003 Altair Engineering, Inc. All rights

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information