sgi Scalability Considerations for Compute Intensive Applications on Clusters Christian Tanasescu Daniel Thomas SGI Inc.

Size: px

Start display at page:

Download "sgi Scalability Considerations for Compute Intensive Applications on Clusters Christian Tanasescu Daniel Thomas SGI Inc."

Belinda Holmes
5 years ago
Views:

1 Scalability Considerations for Compute Intensive Applications on Clusters Chr Daniel Thomas SGI Inc.

2 Agenda Applications Segments HPC Computational Requirements Scalability and Application profiles Standard benchmarks vs application Communication vs Computation ratio BandeLa - profiling and modeling tool Platform Directions Conclusions

3 The Move to Technology Transparency Application drives the platform High Sellable Feature Feeds and speeds Architecture Feeds and speeds Application Architecture Feeds and speeds Low Time

4 Application Segments CSM - Computational Structural Mechanics CFD - Computational Fluid Dynamics CCM - Computational Chemistry and Material Sciense BIO - Bioinformatics SPI- Seismic Processing and Interpretation RES- Reservoir Simulation CWO- Climate/Weather/Ocean Simulation

5 Sweet Spot Scalability per Application Segment p p 8-64p 16-64p p 4-32p 4-16p CSM CFD CCM BIO SPI RES CWO

6 HPC Resource Demands for Energy System Resource Benefits/Requirements Energy Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability Seismic Processing 1 ProMAX Omega GeoDepth H H H L M 2 H ~4 to 500 Reservoir Simulation VIP Eclipse M H L H M ~100 1 Seismic processing packages such as ProMAX are comprised of a large number of executables. The data in this row are for the subsets of executables that are most time-consuming. 2 There are modules for which this entry would be H, but they comprise only about 10% of the total seismic processing workload.

7 HPC Resource Demands for CAE MCAE Segment IFEA Statics IFEA Dynamics EFEA CFD Unstructured CFD Structured Software ABAQUS ANSYS MSC.Nastran ABAQUS ANSYS MSC.Nastran LS-DYNA PAM-CRASH RADIOSS FLUENT STAR-CD PowerFLOW OVERFLOW System Resource Benefits/Requirements CPU Memory BW I/O BW Com. BW Latency Scalability H H M L L < 10p L H H H L < 10p H L L M M ~ 50p M H M H H ~ 100p H H L M M ~ 100p

8 HPC Resource Demands for Bioinformatics System Resource Benefits/Requirements Bio Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability Sequence Matching Blast, Fasta Smith-W. HMMER, Wise2 H M L L M 4-32 HTC + seq. match. code H H M L M ~100 Sequence Alignment ClustalW Phylip H M L M L 24 Sequence Assembly Phrap Paracel H M M M L 16

9 HPC Resource Demands for Computational Chemistry System Resource Benefits/Requirements Segment Software CPU Memory BW I/O BW Comm. Latency BW QM MM/MD "ab-initio" Gaussian Gamess ADF CASTEP Semiempirical Mopac Ampac Amber Charmm NAMD Scalability H H /M H /M L L 1-32 H L L L M 1-4 H M M M H 1-64 Docking Dock FleXx H L L L L 1-64 QM: Quantum Mechanics. A large variation for Memory BW, I/O BW and scalability MM/MD: Molecular Mechanics/Molecular Dynamics Docking: Scalability via throughput

10 HPC Resource Demands for Weather and Climate Models Segment explicit finite difference semi-implicit finite difference spectral climate models spectral weather models coupled climate models Software MM5 HIRLAM CCM3/CAM NOGAPS IFS ALADIN CCSM2 FMS System Resource Benefits/Requirements CPU Memory BW I/O BW Com. BW Latency Scalability H M L L H ~ p H M L H H ~ p H M L H M ~ p H M L H M ~ 200p H M L H H ~ 100p

11 rformance Dependency on Architecture 8MB L2 in O2000 and O3000 1,6 1,4 1,2 1 Memory Cpu,cache 1,19 Cpu 1,03 Performance corridor defined by Linpack (lower limit) and STREAM (higher limit) 1,64 1,29 1,26 1,25 1,21 1,17 1,17 1,18 1,11 1,05 1,07 1,03 1,04 0,8 Linpack Specfp2000 STREAM Abaqus/std-1 Nastran(103) Nastran(101) StarCD-1 LS-Dyna-1 Pamcrash Radioss-1 Vectis-1 Fluent-1 CASTEP Amber Gaussian Relative performance improvement in applications is greater than the factor indicated by Specfp2000 Exception are the I/O intensive apps like Nastran-NVH or Gaussian (BW steers performance)

12 rformance Dependency on Microprocessor ock Rate - same Architecture 1,4 1,2 1 0,8 0,6 Performance corridor defined by Linpack (higher limit) and STREAM (lower limit) 1,19 1,19 1,19 1,2 1,17 1,18 1,2 1,17 1,17 1,17 1,19 1,2 1,1 1,14 1,12 1,12 1,13 1,1 1,11 1,04 1,04 Linpack Specfp2000 STREAM Abaqus/std-1 Abaqus/Exp-1 Nastran(103) Nastran(111) Nastran(101) Nastran(108) Ansys StarCD-1 StarCD-8 LS-Dyna-1 Pamcrash Radioss-1 Madymo Vectis-8 Fluent-8 Fluent-1 Fire-1 Relative Performance to Cpu Cpu,cache Memory

13 rformance Dependency on Microprocessor che Size - same Architecture 1,1 Performance corridor defined by Specfp2000 (lower limit) and STREAM (higher limit) 0, ,95 0,92 0,93 0,9 0,91 0,92 0,9 0,89 0,9 0,85 0,84 0,91 0,93 0,89 0,9 0,86 0,87 0,88 0,85 0,8 0,7 0,6 Linpack Specfp2000 STREAM Abaqus/std-1 Abaqus/Exp-1 Nastran(103) Nastran(111) Nastran(101) Nastran(108) Ansys StarCD-1 StarCD-8 LS-Dyna-1 Pamcrash Radioss-1 Madymo Vectis-8 Fluent-8 Fluent-1 Fire-1 Powerflow-16 Cpu Cpu,cache Memory

14 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Key Applications Instruction Mix Floating Point Operations Integer Operations Memory access instructions Branche Instructions Nastran Ansys Pamcrash Ls-Dyna Radioss Powerflow Fluent StarHPC Fire Gaussi an Gamess Amber CASTEP ADF BLAST FASTA ClustalW MM5 HIRLAM CCM3 IFS ProMAX Omega Eclipse VIP CSM CFD CCM BIO CWO SPI RES

15 Instruction mix Real applications have between 5% - 45% FP instructions with an average of 22%, while the average of memory access instructions is 39% A higher number of INT than FP instructions. Exception are BLAS-like solvers as Nastran, Abaqus and ProMAX. Ratio of graduated loads and stores to FP operations is 1.7x Compute Intensive Applications are also Data Intensive Applications Vector systems had the the system balance=1 (one Flop per Byte) Next generation architectures need to address memory bandwidth issue IO puts additional burden on memory bandwidth

16 System balance Supercomputing platforms must balance Microprocessor power Memory size Bandwidth Latency I/O balance is another important consideration Balance # Cpus Lower is better CRAY T90 NEC SX-5 IBM SGI HP HP SGI Altix Supercomputers after Cray 1 began to lose balance

17 ommunication vs. Computation Ratio in ey Applications - measured with BandeLa Computation Wait MPI SW latency Data Transfer 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Nastran/ 4 Ansys/ 2 Pam-Crash/ 32 Ls-Dyna/ 48p Radioss/ 96 PowerFLOW/ 64 Fluent/ 64 StarHPC Fire/ 32 Gaussi an/ 16 Gamess/ 32 Amber/ 8 CASTEP/ 128 ADF/ 32 BLAST/ 16 FASTA/ 16 ClustalW/ 16 MM5 HIRLAM CCM3/ 16 IFS ProMAX Omega Eclipse/ 52 VIP/ 32 CSM CFD CCM BIO CWO SPI RES

18 Communication Details Computation : the time outside MPI Wait: The time a CPU is locked on mpi_wait load unbalance contention of the traffic through the interconnect gabric or the switch MPI SW Latency : the time accounted to the MPI library sensitive to MPI latency Data Transfer : the time the transfer engine is active (bcopy on Origin3000 or Altix 3000). Sensitive to MPI bandwidth An important inhibiting factor for scalability is the load imbalance (WAIT). It needs to be addressed by future architectures and programming models

19 BandeLa Profiling Tool An MPI tool to answer the question : What if the BANDwidth and LAtency change up or down. 1) Run the application with the targeted number of CPU in order to capture the timings outside the Mpi calls and to capture the sequence of MPI kernels generated by the Mpi library( Isend, Irecv, wait, test) 2) Replay the timings applying a simple model to time the above kernels

20 BandeLa Profiling Tool Several topologies can be specified single host, clusters Several communication schemes O3000 like (receiving CPU does the transfer) Synchronous and Asynchronous transfers Interleaving (an arriving message shares the hardware immediately) No interleaving (an arriving message waits for the previous messages to fully complete)

21 BandeLa- Basic Functionality The MPI library transforms any MPI function in a sequence of 4 kernels : MPI_SGI_request_send mpi_isend MPI_SGI_request_recv mpi_irecv MPI_SGI_request_test mpi_test MPI_SGI_request_wait mpi_wait The Bandela instrumentation catches these sequences and records the computational time outside MPI. This is an application signature independent from the communication hardware

22 BandeLa- Instrumentation example No need to relink. Some environment variables can also be set in order to partially instrument the application without relinking setenv LD_LIBRARY64_PATH.../ACQUIRE_64 setenv RLD64_LIST libbandela.so:default f77 o test_bcast test_bcast.f lmpi setenv MPI_BUFFER_MAX 2000 mpirun np 4 test_bcast -One file is created for each process -4 Files are created for this example:fort.177, fort.178, fort.179, fort.180 (starting file.177 may be change with an environment Variable)

23 BandeLa- Parameters (single host) MPI Latency : The time which is accounted to the MPI software for doing its work (queuing messages, checking message arrivals, ) For the model this is just the amount of time simply added to the communication table of the particular CPU on entry of a MPI kernel function : 2.25 µs on Origin 3000 or 4.5 µs full send-receive on Origin3000 MPI Bandwidth : The speed at which bcopy is doing its job. 250Mb /s in average on the Origin3000

24 BandeLa- Validating the model with measurement (Origin3000 single host) CCM3 - spectral climate model on 16 CPU Origin Elapsed time (secs) Measured communication Data Physical Transfer MPI SW Latency Wait Computation MPI ranks Model using a Bandwidth of 250Mb/s( default

25 BandeLa- Validating the model with measurement (Origin3000 single host) CCM3 - spectral climate model on 16 CPU Origin Elapse Time(s) Measured communication Data Physical Transfer MPI SW Latency Wait Computation MPI ranks Model using tuned bandwidth 225 Mb/s

26 BandeLa- what- if analysis Topologies vailable SINGLE SHARED MEMORY HOST (0rigin3000 or Altix 3000) SWITCH (cluster s)

27 Bandela - Data Transfer methods Transfer synchronously done by the receiving CPU (Origin3000 or Altix 3000 host) Transfer synchronously/asynchronously done with interleaving/no interleaving, constant bandwidth Transfer asynchronously with interleaving bandwidth depending of request size(myrinet)

28 BandeLa Myrinet parameters? MPI SW Latency : Myrinet 2000 ping-pong latency on Origin300 has been measured: 17µs Bandwidth : As for the Origin300 single host, the workload may change the adapter performance but : -The bandwidth also depends of the size -The CPUs have to share the adapter(s) ( this is considered in the model used in BandeLa) -The number of adapters used change one adapter performance

29 BandeLa Myrinet parameters? Myrinet Bandwidth modeled Mb/s message size ndwidth chart given by Myricom.Bandwidth chart used by the m It depends only of the asympto which is not 250 Mb/s on the re

30 BandeLa Myrinet parameters? We use the following asymptotic values : 1 adapter : 93 Mb/s 2 adapters : 85 Mb/s 4 adapters : 75 Mb/s These values were set from runs with two Origin300 linked with 4 adapters on both machines. We think these are the asymptotic bandwidths really seen by the applications

31 CCM3 study Bandwidth effect CCM3 - large case on 64 CPUs Current perf. Model: MPI latency 5.5 msec, Bandwith 275Mb/s Elapse(s) x Bandwith (1.1Gb/s) Ratio unbalance Ratio CPU : MPI ranks Computation Load Unbalance MPI latency Physical Data Transfer

32 CCM3 study CCM3- large case on 64 CPU Origin3000 SSI and 4 x 16p Origin Elapsed time(secs) p O300 4x16p O300 1 Myrinet board 4x16p O300 4 Myrinet boards Data Physical Transfer Mpi SW Latency Wait Computation MPI ranks CCM3 performance highly depends of the number of communication channels

33 CASTEP - Latency effect CASTEP - 24p execution modelled for different latencies Latency HIPPI 800 Elapsed Time (s) Origin3000 -SSI Latency GSN Nr. of Processors Computation Load unbalancing MPI latency Physical Data Transfer

34 Communication vs. Computation Sweep3D on MHz Elapse time (s) Data Transfer MPI Latency Wait Computation Number of MPI tasks

35 BadeLa Pam-Crash study Scalability on Altix 3000 Tests on Altix 64 CPU, 64p 1.5MB L3 cache, Single System Image (SSI) Pam-Crash V2003 DMP-SP Using BMW model w/ 284,000 elements Run for 5000 time steps a special library is used to time or model the communication

36 Pam-Crash V2003 on Altix 3000 Automotive model, 5000 time steps PAM-Crash Altix 900 MHz Speed-up BMW elements # CPU Computation Speed up rank 0 Global speed-up Perfect Speed up Computation scales up to 64 Communication overhead is too high at 64p

37 Pam-Crash V2003 on Altix 3000 BMW6 model, 5000 time steps The Bandela model estimates that a perfect MPI machine ( zero latency, infinite bandwidth ) would not help. The run Altix 3000 is closed to a perfect MPI model Pam-Crash BMW6 Altix 900MHz Bandela 16 CPU Altix (model), Perfect MPI machine(model) 16 CPU Altix measured Elapse Measured Phys_transfer Mpi_Sw_late WAIT computation MPI ranks

38 BandeLa -Vampir compatibility BandeLa can generate a trace compatible with the Vampir Browser. Using Vampir you can zoom the latency and Bandwidth changes at any degree of details Chr

39 Requirements for Petaflops applications Memory and cache footprint amount of memory req. at each level of the memory hierarchy Degree of data reuse associated with core kernels of the apps, the scaling of these kernels, and the associated estimate of memory BW required at each level of the memory hierarchy Instruction mix (fp,integer, ld/st) IO/ requirements and storage for temp results and checkpoints Amount of concurrency available in the apps, and communication requirements bisection BW, latency, fast synchronization patterns Communication/computation ratio and degree of overlap Processor 3-50 cycle Cache(s) L1,L2,L cycl Main Memory 1.5 million cycles Disk

40 Big Datasets : Generic Tera-Scale ( ) z(1000) y(1000) x(1000) t(1000) program main real*8 pressure(1000,1000,1000,1000) real*8 volume(1000,1000,1000,1000) real*8 temperature(1000,1000,1000,1000) do k = 1,1000 do j = 1,1000 do i = 1,1000 do time_step=1,1000 pressure(i,j,k,time_step) = 0. volume(i,j,k, time_step) = 0. temperature(i,j,k, time_step) = 0. end do end do end do end do print *, pressure(1,1,1,1), volume(1,1,1,1), temperature(1,1,1,1) stop end C $f77-64 main.f (to compile) C $limit stacksize m (before run), 24TB C Only 3 attributes

Supercluster concept SGI Altix 3000 First industry-standard Linux cluster with global shared memory NUMA support for large nodes: Single node to 64 CPU, 512 GB of memory Global shared memory:

41 Supercluster concept SGI Altix 3000 First industry-standard Linux cluster with global shared memory NUMA support for large nodes: Single node to 64 CPU, 512 GB of memory Global shared memory: Clusters to 2,048 CPU, 16 TB of memory All nodes can access one large memory space efficiently, so complex communication and data passing between nodes isn t needed, big data sets fit entirely in memory; less disk I/O is needed Conventional Clusters Supercluster SGI Altix 3000 Commodity interconnect mem mem mem mem mem node node node node node OS OS OS OS OS... mem node + OS node + OS NUMAFlex interconnect Global Shared Memory node node OS OS node + OS

42 Parallel Programming Models Intra-Partition Altix 3000 Inter-Partition Partition penmp threads PI HMEM Partition Partition MPI SHMEM XPMEM

43 MPI In Clusters and Global Addressable Memory MPI-1 2-sided send/receive latencies (short 8-byte mess Gigabit Ethernet (TCP/IP) Myrinet Quadrics 100 us(low-cost Clusters) 13 us (Mid-range Clusters) 4-5 us (High-end Clusters) MPT us [SMP] Altix 1.5 us [Supercluster] Goal us

44 Speedup SGI Altix 3000 Scalability for compute intensive applications Higher is Better Nr. of Processors Scalability on Altix 3000 in general similar to Origin3000 Gaussian (CCM) Amber (CCM ) Fasta (BIO) Star-CD (CFD) Vectis (CFD) Ls-Dyna (CSM ) TAU (CFD) HTC-Blast (BIO) Fastx (BIO) MM5 (CWO) CASTEP (CCM) GAM ESS (CCM ) NAMD (CCM) NWChem ICCM) VASP (CCM ) Ideal

45 Platform Directions Mainframe Era Corporate Resource is expensive and needs to be shared Total cost of Computing High Centralised computing more cost effective Time 1985 Low Low Cost of Comm. Bandwidth High Total Cost of Computing = cost of HW + SW + related support costs

Platform Directions Decentralized Computing Corporate Resource is cheap enough that I don t have to share Total cost of Computing High Low Centralised computing more cost

46 Platform Directions Decentralized Computing Corporate Resource is cheap enough that I don t have to share Total cost of Computing High Low Centralised computing more cost effective Time Decentralised client server computing more cost effective Low Cost of Comm. Bandwidth High Total Cost of Computing = cost of HW + SW + related support costs

Platform Directions Server Consolidation Scale out Nodes Corporate Resource is cheap and can have as much as I want NOWs Clusters of SMPs Total cost of Computing High Low Centralised computing more

47 Platform Directions Server Consolidation Scale out Nodes Corporate Resource is cheap and can have as much as I want NOWs Clusters of SMPs Total cost of Computing High Low Centralised computing more cost effective 2000 Time Decentralised client server computing more cost effective Low SMPs Cost of Comm. Bandwidth Processors per node Scale up Total Cost of Computing = cost of HW + SW + related support costs High

48 Platform Directions Grid Computing Scale out Corporate Resource is cheap and can have as much as I want, but I don t have to own it. Nodes NOWs Super-Cluster of SMPs Total cost of Computing High Low Centralised computing more cost effective Decentralised client server computing more cost effective Low SMPs Cost of Comm. Bandwidth Processors per node Scale up Total Cost of Computing = cost of HW + SW + related support costs High

49 Conclusions Compute Intensive Applications are also Data Intensive Standard benchmarks define a performance corridor for applications Communication vs. computation profiling of compute intensive applications is essential in designing scalable parallel computer systems Load imbalance most influential factor on scalability Preserving globally addressable memory beyond the boundary of a single node in a cluster improves not only the communication efficiency through but also improves the load balancing Altix 3000 Super-Cluster is a very efficient MPI machine

50 Thank you Any questions or to

Cyclone SGI Cloud Computing for HPC. Christian Tanasescu Vice President Software Engineering

Cyclone SGI Cloud Computing for HPC Christian Tanasescu Vice President Software Engineering Agenda Rationale for Cyclone SGI offering Role in SGI business model Cyclone service and usage models Partnerships