HPC Applications Performance and Optimizations Best Practices Pak Lui

Size: px
Start display at page:

Download "HPC Applications Performance and Optimizations Best Practices Pak Lui"

Transcription

1 HPC Applications Performance and Optimizations Best Practices Pak Lui

2 130 Applications Best Practices Published Abaqus CPMD LS-DYNA MILC AcuSolve Dacapo minife OpenMX Amber Desmond MILC PARATEC AMG DL-POLY MSC Nastran PFA AMR Eclipse MR Bayes PFLOTRAN ABySS FLOW-3D MM5 Quantum ESPRESSO ANSYS CFX GADGET-2 MPQC RADIOSS ANSYS FLUENT GROMACS NAMD SPECFEM3D ANSYS Mechanics Himeno Nekbone WRF BQCD HOOMD-blue NEMO CCSM HYCOM NWChem CESM ICON Octopus COSMO Lattice QCD OpenAtom CP2K LAMMPS OpenFOAM For more information, visit: 2

3 Agenda Overview Of HPC Application Performance Ways To Inspect/Profile/Optimize HPC Applications CPU/Memory, File I/O, Network System Configurations and Tuning Case Studies, Performance Optimization and Highlights LS-DYNA (CPU application) HOOMD-blue (GPU application) Conclusions 3

4 HPC Application Performance Overview To achieve scalability performance on HPC applications Involves understanding of the workload by performing profile analysis Tune for the most time spent (either CPU, Network, IO, etc) Underlying implicit requirement: Each node to perform similarly Run CPU/memory /network tests or cluster checker to identify bad node(s) Comparing behaviors of using different HW components Which pinpoint bottlenecks in different areas of the HPC cluster A selection of HPC applications will be shown To demonstrate method of profiling and analysis To determine the bottleneck in SW/HW To determine the effectiveness of tuning to improve on performance 4

5 Ways To Inspect and Profile Applications Computation (CPU/Accelerators) Tools: top, htop, perf top, pstack, Visual Profiler, etc Tests and Benchmarks: HPL, STREAM File I/O Bandwidth and Block Size: iostat, collectl, darshan, etc Characterization Tools and Benchmarks: iozone, ior, etc Network Interconnect Tools and Profilers: perfquery, MPI profilers (IPM, TAU, etc) Characterization Tools and Benchmarks: Latency and Bandwidth: OSU benchmarks, IMB 5

6 Case Study: LS-DYNA 6

7 LS-DYNA LS-DYNA A general purpose structural and fluid analysis simulation software package capable of simulating complex real world problems Developed by the Livermore Software Technology Corporation (LSTC) LS-DYNA used by Automobile Aerospace Construction Military Manufacturing Bioengineering 7

8 Objectives The presented research was done to provide best practices Performance benchmarking CPU/Memory performance comparison MPI Library performance comparison Interconnect performance comparison The presented results will demonstrate The scalability of the compute environment/application Considerations for higher productivity and efficiency 8

9 Test Cluster Configuration Dell PowerEdge R720xd/R node (640-core) cluster Dual-Socket Octa-Core Intel E GHz CPUs (Turbo Mode enabled) Memory: 64GB DDR MHz Dual Rank Memory Module (Static max Perf in BIOS) Hard Drives: 24x 250GB 7.2 RPM SATA 2.5 on RAID 0 OS: RHEL 6.2, MLNX_OFED InfiniBand SW stack Intel Cluster Ready certified cluster Mellanox Connect-IB FDR InfiniBand and ConnectX-3 Ethernet adapters Mellanox SwitchX 6036 VPI InfiniBand and Ethernet switches MPI: Intel MPI 4.1, Platform MPI 9.1, Open MPI w/ FCA 2.5 & MXM 2.1 Application: LS-DYNA mpp971_s_r3.2.1_intel_linux86-64 (for TopCrunch) ls-dyna_mpp_s_r7_0_0_79069_x64_ifort120_sse2 Benchmark datasets: 3 Vehicle Collision, Neon refined revised 9

10 PowerEdge R720xd Massive flexibility for data intensive operations Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from GigE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x or - 24 x x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch 10

11 LS-DYNA Performance Turbo Mode Turbo Boost enables processors to run above its base frequency Capability to allow CPU cores to run dynamically above the CPU clock When thermal headroom allows the CPU to operate The 2.8GHz clock speed could boost to Max Turbo Frequency of 3.6GHz Running with Turbo Boost provides ~8% of performance boost More power (~17%) would be drawn for running in Turbo Boost 8% 17% Higher is better FDR InfiniBand 11

12 LS-DYNA Performance Memory Module Dual rank memory module provides better speedup for LS-DYNA Using Dual Rank 1600MHz DIMM is 8.9% faster than Single Rank 1866MHz DIMM Ranking of the memory has more importance to performance than speed System components used: DDR3-1866MHz PC14900 CL13 Single Rank Memory Module DDR3-1600MHz PC12800 CL11 Dual Rank Memory Module 8.9% Higher is better Single Node 12

13 LS-DYNA Performance File I/O Effect of parallel file I/O appears as more cluster nodes participate When parallel file I/O access happens, additional traffic appears on network Staging I/O on local shm as test, but not a solution for production env This is a test to demonstrate the importance of good parallel file system It is advised to use on a cluster file system instead of a temporary storage 22% Higher is better FDR InfiniBand 13

14 I/O Profiling with collectl Collectl (collect for Linux) Example to run it: $ collectl -sdf >> ~/$HOSTNAME.$(date +%y%m%d_%h%m%s).txt Example of the collectl report waiting for 1 second sample... #< Disks ><------NFS Totals------> #KBRead Reads KBWrit Writes Reads Writes Meta Comm

15 LS-DYNA Profiling File I/O Some File I/O takes place during the beginning of the run For nodes with rank 0, as well as for other cluster nodes 15

16 LS-DYNA Performance Interconnects FDR InfiniBand delivers superior scalability in application performance Provides higher performance by over 7 times % for 1GbE Almost 5 times faster than 10GbE and 40GbE 1GbE stop scaling beyond 4 nodes, and 10GbE stops scaling beyond 8 nodes Only FDR InfiniBand demonstrates continuous performance gain at scale 492% 474% 707% Higher is better Intel E5-2680v2 16

17 LS-DYNA Performance Open MPI Tuning FCA and MXM enhance LS-DYNA performance at scale for Open MPI FCA allows MPI collective operation offloads to hardware while MXM provides memory enhancements to parallel communication libraries FCA and MXM provide a speedup of 18% over untuned baseline run at 32 nodes MCA parameters for enabling FCA and MXM: For enabling MXM: -mca mtl mxm -mca pml cm -mca mtl_mxm_np 0 -x MXM_TLS=ud,shm,self -x MXM_SHM_RNDV_THRESH= x MXM_RDMA_PORTS=mlx5_0:1 For enabling FCA: -mca coll_fca_enable 1 -mca coll_fca_np 0 -x fca_ib_dev_name=mlx5_0 18% 2% 10% Higher is better NFS File System 17

18 LS-DYNA Performance MPI Platform MPI performs better than Open MPI and Intel MPI Up to 1% better than Open MPI and 16% better than Intel MPI Tuning parameter used: Open MPI: -bind-to-core, FCA, MXM, KNEM Platform MPI: -cpu_bind, -xrc Intel MPI: -IB I_MPI_FABRICS shm:ofa I_MPI_PIN on I_MPI_ADJUST_BCAST 1 -genv MV2_USE_APM 0 -genv I_MPI_OFA_USE_XRC 1 -genv I_MPI_DAPL_PROVIDER ofa-v2-mlx5_0-1u 16% 1% Higher is better FDR InfiniBand 18

19 LS-DYNA Performance System Generations Intel E5-2600v2 Series (Ivy Bridge) outperforms prior generations Up to 20% higher performance than Intel Xeon E (Sandy Bridge) cluster Up to 70% higher performance than Intel Xeon X5670 (Westmere) cluster Up to 167% higher performance than Intel Xeon X5570 (Nehalem) cluster System components used: Ivy Bridge: 2-socket 10-core 1600MHz DIMMs, Connect-IB FDR Sandy Bridge: 2-socket 8-core 1600MHz DIMMs, ConnectX-3 FDR Westmere: 2-socket 6-core 1333MHz DIMMs, ConnectX-2 QDR Nehalem: 2-socket 4-core 1333MHz DIMMs, ConnectX-2 QDR 70% 167% 20% Higher is better FDR InfiniBand 19

20 LS-DYNA Performance TopCrunch HPC Advisory Council performs better than the previous best published results TopCrunch ( publishes LS-DYNA performance results HPCAC achieved better performance on per node basis 9% to 27% of higher performance than best published results on TopCrunch (Feb2014) Comparing to all platforms on TopCrunch HPC Advisory Council results are world best for systems for 2 to 32 nodes Achieving higher performance than larger node count systems 10% 24% 21% 10% 14% 25% 9% 27% 21% 11% Higher is better FDR InfiniBand 20

21 LS-DYNA Performance System Utilization Maximizing system productivity by running 2 jobs in parallel Up to 77% of increased system utilization at 32 nodes Run each job separately in parallel by splitting system resource in half System components used: Single Job: Use all cores and both IB ports to run a single job 2 Jobs in parallel: Cores of 1 CPU and 1 IB port for a job, the rest for another 77% 22% 38% Higher is better FDR InfiniBand 21

22 MPI Profiling with IPM IPM (Integrated Performance Monitoring) (existing) (new) Example to run it: LD_PRELOAD=/usr/mpi/gcc/openmpi-1.7.4/tests/ipm-2.0.2/lib/libipm.so mpirun -x LD_PRELOAD <rest of the cmd> To generate HTML for the report export IPM_KEYFILE=/usr/mpi/gcc/openmpi-1.7.4/tests/ipm /etc/ipm_key_mpi export IPM_REPORT=full export IPM_LOG=full ipm_parse -html <username>.<timestamp>.ipm.xml 22

23 LS-DYNA Profiling MPI/User Time Ratio Computation time is dominant compared to MPI communication time MPI communication ratio increases as the cluster scales Both computation time and communication declines as the cluster scales The InfiniBand infrastructure allows spreading the work without adding overheads Computation time drops faster compares to communication time Compute bound: Tuning for computation performance could yield better results FDR InfiniBand 23

24 LS-DYNA Profiling MPI Calls MPI_Wait, MPI_Send and MPI_Recv are the most used MPI calls MPI_Wait(31%), MPI_Send(21%), MPI_Irecv(18%), MPI_Recv(14%), MPI_Isend(12%) LS-DYNA has majority of MPI point-to-point calls for data transfers Either blocking or non-blocking point-to-point transfers are seen LS-DYNA has an extensive use of MPI APIs, over 23 MPI APIs are used 24

25 LS-DYNA Profiling Time Spent by MPI Calls Majority of the MPI time is spent on MPI_recv and MPI Collective Ops MPI_Recv(36%), MPI_Allreduce(27%), MPI_Bcast(24%) MPI communication time lowers gradually as cluster scales Due to the faster total runtime, as more CPUs are working on completing the job faster Reducing the communication time for each of the MPI calls 25

26 LS-DYNA Profiling MPI Message Sizes Most of the MPI messages are in the medium sizes Most message sizes are between 0 to 64B For the most time consuming MPI calls MPI_Recv: Most messages are under 4KB MPI_Bcast: Majority are less than 16B, but larger messages exist MPI_Allreduce: Most messages are less than 256B 3 Vehicle Collision 32 nodes 26

27 LS-DYNA Profiling MPI Data Transfer As the cluster grows, substantial less data transfers between MPI processes Drops from ~10GB per rank at 2-node vs to ~4GB at 32-node Rank 0 contains higher transfers than the rest of the MPI ranks Rank 0 responsible for file IO and uses MPI to communicates with the rest of the ranks 27

28 LS-DYNA Profiling Aggregated Transfer Aggregated data transfer refers to: Total amount of data being transferred in the network between all MPI ranks collectively Large data transfer takes place in LS-DYNA Seen around 2TB at 32-node for the amount of data being exchanged between the nodes FDR InfiniBand 28

29 LS-DYNA Profiling Memory Usage By Node Uniform amount of memory consumed for running LS-DYNA About 38GB of data is used for per node Each node runs with 20 ranks, thus about 2GB per rank is needed The same trend continues for 2 to 32 nodes 3 Vehicle Collision 2 nodes 3 Vehicle Collision 32 nodes 29

30 LS-DYNA Summary Performance Compute: Xeon E5-2680v2 (Ivy Bridge) and FDR InfiniBand enable LS-DYNA to scale Up to 20% over Sandy Bridge, Up to 70% over Westmere, 167% over Nehalem Compute: Running with Turbo Boost provides ~8% of performance boost More power (~17%) would be drawn for running in Turbo Boost Memory: Dual-ranked memory module provides better speedup than single ranked Using Dual Rank 1600MHz DIMM is ~9% faster than single rank 1866MHz DIMM I/O: File I/O becomes important at scale, running in /dev/shm performs to ~22% than NFS Network: FDR InfiniBand delivers superior scalability in application performance Tuning Provides higher performance by over 7 times for 1GbE and over 5 times for 10/40GbE at 32 nodes FCA and MXM enhances LS-DYNA performance at scale for Open MPI Provide a speedup of 7% over untuned baseline run at 32 nodes As the CPU/MPI time ratio shows significantly more computation is taken place Profiling Majority of MPI calls are for (blocking and non-blocking) point-to-point communications 30

31 Case Study: HOOMD-blue 31

32 HOOMD-blue Highly Optimized Object-oriented Many-particle Dynamics - Blue Edition Performs general purpose particle dynamics simulations Takes advantage of NVIDIA GPUs Free, open source Simulations are configured and run using simple python scripts The development effort is led by Glotzer group at University of Michigan Many groups from different universities have contributed code to HOOMD-blue 32

33 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Mellanox, NVIDIA Compute resource The Wilkes cluster at the University of Cambridge, HPC Advisory Council Cluster Center The following was done to provide best practices HOOMD-blue performance overview Understanding HOOMD-blue communication patterns MPI libraries comparisons For more info please refer to J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose molecular dynamics simulations fully implemented on graphics processing units Journal of Computational Physics 227(10): , May /j.jcp

34 Objectives The following was done to provide best practices HOOMD-blue performance benchmarking Interconnect performance comparisons MPI performance comparison Understanding HOOMD-blue communication patterns The presented results will demonstrate The scalability of the compute environment to provide nearly linear application scalability The capability of HOOMD-blue to achieve scalable productivity 34

35 Test Cluster Configuration 1 Dell PowerEdge R720xd/R720 cluster Dual-Socket Octa-core Intel E GHz CPUs (Static max Perf in BIOS) Memory: 64GB DDR MHz Dual Rank Memory Module Hard Drives: 24x 250GB 7.2 RPM SATA 2.5 on RAID 0 OS: RHEL 6.2, MLNX_OFED InfiniBand SW stack Mellanox Connect-IB FDR InfiniBand Mellanox SwitchX SX6036 InfiniBand VPI switch NVIDIA Tesla K40 GPUs (1 GPU per node), ECC enabled via nvidia-smi NVIDIA CUDA 5.5 Development Tools and Display Driver GPUDirect RDMA (nvidia_peer_memory tar.gz) MPI: Open MPI 1.7.4rc1 Application: HOOMD-blue (git master 28Jan14) Benchmark datasets: Lennard-Jones Liquid Benchmarks (16K, 64K Particles) 35

36 GPUDirect RDMA Eliminates CPU bandwidth and latency bottlenecks With GPUDirect RDMA Using PeerDirect Uses remote direct memory access (RDMA) transfers between GPUs Resulting in significantly improved MPI_Send/Recv efficiency between GPUs in remote nodes 36

37 GPUDirect RDMA Requirement 1. Mellanox OFED NVIDIA Driver or later 3. NVIDIA CUDA Toolkit 5.5 or Plugin to enable GPUDirect RDMA - nv_peer_mem 37

38 GPUDirect RDMA MPI stack MVAPICH2-GDR 2.0b Open MPI Requires CUDA 5.5 Requires CUDA 6.0 Required parameters: MV2_USE_CUDA MV2_USE_GPUDIRECT Required parameter: btl_openib_want_cuda_gdr Optional tuning parameters: MV2_GPUDIRECT_LIMIT MV2_USE_GPUDIRECT_RECEIVE_LIMIT MV2_RAIL_SHARING_POLICY=FIXED_MA PPING MV2_PROCESS_TO_RAIL_MAPPING=mlx5 _0:mlx5_1 MV2_RAIL_SHARING_LARGE_MSG_THRE SHOLD=1G Optional tuning parameters: btl_openib_cuda_rdma_limit Default 30,000B in size btl_openib_cuda_eager_limit btl_smcuda_use_cuda_ipc 38 38

39 HOOMD-blue Performance GPUDirect RDMA GPUDirect RDMA enables higher performance on a small GPU cluster Demonstrated up to 20% of higher performance at 4 nodes for 16K particles Showed up to 10% of performance gain at 4 nodes for 64K particles Adjusting OMPI MCA param can maximize GPUDirect RDMA usage Based on MPI profiling, limits for GDR for 64K particles was tuned to 65KB MCA Parameter to enable and tune GPUDirect RDMA for Open MPI: -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_cuda_rdma_limit XXXX 20% 10% Higher is better Open MPI 39

40 HOOMD-blue Profiling MPI Message Sizes HOOMD-blue utilizes non-blocking and collectives for most data transfers 16K particles: MPI_Isend/MPI_Irecv are concentrated between 4B to 24576B 64K particles: MPI_Isend/MPI_Irecv are concentrated between 4B to 65536B MCA parameter used to enable and tune for GPUDirect RDMA 16K particles: Default would allow all send/recv to use GPUDirect RDMA 64K particles: Maximize GDR by tuning MCA param to include up to 65KB -mca btl_openib_cuda_rdma_limit (Change for 64K particles case) 4 Nodes 16K Particles 4 Nodes 64K Particles 1 MPI Process/Node 40

41 Test Cluster Configuration 2 Dell PowerEdge T node (1536-core) Wilkes cluster at Univ of Cambridge Dual-Socket Hexa-Core Intel E GHz CPUs Memory: 64GB memory, DDR MHz OS: Scientific Linux release 6.4 (Carbon), MLNX_OFED InfiniBand SW stack Hard Drives: 2x 500GB 7.2 RPM 64MB Cache SATA 3.0Gb/s 3.5 Mellanox Connect-IB FDR InfiniBand adapters Mellanox SwitchX SX6036 InfiniBand VPI switch NVIDIA Tesla K20 GPUs (2 GPUs per node), ECC enabled in nvidia-smi NVIDIA CUDA 5.5 Development Tools and Display Driver GPUDirect RDMA (nvidia_peer_memory tar.gz) MPI: Open MPI 1.7.4rc1, MVAPICH2-GDR 2.0b Application: HOOMD-blue (git master 28Jan14) Benchmark datasets: Lennard-Jones Liquid Benchmarks (256K and 512K Particles) 41

42 The Wilkes Cluster at University of Cambridge The University of Cambridge in partnership with Dell, NVIDIA and Mellanox The UK s fastest academic cluster, deployed November 2013 Produces a LINPACK performance of 240TF on the Top500 position of 166 in the November 2013 list Ranked most energy efficient air cooled supercomputer in the world Ranked second in the worldwide Green500 ranking Extremely high performance per watt of 3631 MFLOP/W Architected to utilize the NVIDIA RDMA communication acceleration Significantly increase the system's parallel efficiency 42

43 GPUDirect RDMA at Wilkes Cluster GPU Direct peer-to-peer Two GPU on the same PCI lane Intra-node communication GPU Direct over RDMA NIC and GPU on the same PCI lane Inter-node communication 43

44 GPU Direct RDMA Open MPI 1.7.5: mpirun -np $NP -map-by ppr:1:socket --bind-to socket -mca rmaps_base_dist_hca mlx5_0:1 -mca coll_fca_enable 0 -mca mtl ^mxm -mca btl_openib_want_cuda_gdr 1 -mca btl_smcuda_use_cuda_ipc 0 -mca btl_smcuda_use_cuda_ipc_same_gpu 0 <app> MVAPICH2-GDR 2.0b : mpirun -np $NP -ppn 2 -genvall -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING mlx5_0 -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER -genv MV2_USE_CUDA 1 -genv MV2_CUDA_IPC 0 -genv MV2_USE_GPUDIRECT 1 <app> 44

45 HOOMD-blue Performance GPUDirect RDMA GPUDirect RDMA unlocks performance between GPU and IB Demonstrated up to 102% of higher performance at 64 nodes GPUDirect RDMA provides a direct P2P data path between GPU and IB This new technology significantly lowers GPU-GPU communication latency Completely offload CPU from all GPU communications across the network MCA param to enable GPUDirect RDMA between 1 GPU and IB per node --mca btl_openib_want_cuda_gdr 1 (Default value for btl_openib_cuda_rdma_limit) 102% Higher is better Open MPI 45

46 HOOMD-blue Performance GPUDirect RDMA Higher is better Open MPI 46

47 HOOMD-blue Performance Scalability FDR InfiniBand empowers Wilkes to surpass Titan on scalability Titan showed higher per-node performance but Wilkes outperformed in scalability Titan: K20x GPUs which computes at higher clock rate than the K20 GPU Wilkes: K20 GPUs at PCIe Gen2, and FDR InfiniBand at Gen3 rate Wilkes exceeds Titan in scalability performance with FDR InfiniBand Outperformed Titan by up to 114% at 32 nodes 114% 91% 1 Process/Node 47

48 HOOMD-blue Performance MPI Open MPI performs better than MVAPICH2-GDR At lower scale, MVAPICH2-GDR performs slight better At higher scale, Open MPI shows better scalability Locality of IB interface used was explicitly specified with flags when tests were run Both MPI implementations are in their beta releases Scalability performance are expected to improve on their release versions An Issue prevented MVAPICH2-GDR from running for 8 and 64 nodes Higher is better 1 Process/Node 48

49 HOOMD-blue Performance-Host-buffer Staging HOOMD-blue can run w/ non-cuda aware MPI using Host Buffer Staging HOOMD-blue is built using ENABLE_MPI=ON" and "ENABLE_MPI_CUDA=OFF flags Non-CUDA aware (or host) MPI has lower latency than CUDA aware MPI With GDR: CUDA-aware MPI is copied Individually. Slightly higher latency with MPI With HBS: Only single large buffers are copied as needed. Lower latency using MPI GDR performs on par with HBS on large scale, better in some cases On large scale, HBS performance appears to perform slightly faster than GDR On small scale, GDR can be faster than HBS when small number of particles per GPU Higher is better 1 Process/Node 49

50 HOOMD-blue Profiling % Time Spent on MPI HOOMD-blue utilizes both non-blocking and collective ops for comm Changes in network communications take place as cluster scales 4 nodes: MPI_Waitall(75%), the rest are MPI_Bcast and MPI_Allreduce 96 nodes: MPI_Bcast (35%), the rest are MPI_Allreduce, MPI_Waitall 4 Nodes 512K Particles 96 Nodes 512K Particles Open MPI 50

51 HOOMD-blue Profiling MPI Communication Each rank engages in similar network communication Except for rank 0, which spends less time in MPI_Bcast 4 Nodes 512K Particles 96 Nodes 512K Particles 1 MPI Process/Node 51

52 HOOMD-blue Profiling MPI Message Sizes HOOMD-blue utilizes non-blocking and collectives for most data transfers 4 Nodes: MPI_Isend/MPI_Irecv are concentrated between 28KB to 229KB 96 Nodes: MPI_Isend/MPI_Irecv are concentrated between 64B to 16KB GPUDirect RDMA is enabled for messages between 0B to 30KB MPI_Isend/_Irecv messages are able to take advantage of GPUDirect RDMA Messages fitted within the (tunable default of) 30KB window can be benefited 4 Nodes 512K Particles 96 Nodes 512K Particles 1 MPI Process/Node 52

53 HOOMD-blue Profiling Point to Point Transfer Distribution of data transfers between the MPI processes Non-blocking point-to-point data communications between processes are involved Less data is being transferred as more ranks being part of the run 4 Nodes 512K Particles 96 Nodes 512K Particles 1 MPI Process/Node 53

54 HOOMD-blue Summary HOOMD-blue demonstrates good use of GPU and InfiniBand at scale FDR InfiniBand is the interconnect allows HOOMD-blue to scale Ethernet solutions would not scale beyond 1 node GPUDirect RDMA This new technology provides a direct P2P data path between GPU and IB This provides a significant decrease in GPU-GPU communication latency GPUDirect RDMA unlocks performance between GPU and IB Demonstrated up to 20% of higher performance at 4 nodes for 16K case Demonstrated up to 102% of higher performance at 96 nodes for 512K case InfiniBand empowers Wilkes to surpass Titan on scalability performance Titan has higher per-node performance but Wilkes outperforms in scalability Outperforms Titan by 114% at 32 nodes GPUDirect RDMA performs on par with Host Buffer Staging On large scale, HBS performance appears to perform slightly faster than GDR On small scale, GDR can be faster than HBS when small num. of particles per GPU 54

55 Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein 55 55

LS-DYNA Performance Benchmark and Profiling. April 2015

LS-DYNA Performance Benchmark and Profiling. April 2015 LS-DYNA Performance Benchmark and Profiling April 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

Clustering Optimizations How to achieve optimal performance? Pak Lui

Clustering Optimizations How to achieve optimal performance? Pak Lui Clustering Optimizations How to achieve optimal performance? Pak Lui 130 Applications Best Practices Published Abaqus CPMD LS-DYNA MILC AcuSolve Dacapo minife OpenMX Amber Desmond MILC PARATEC AMG DL-POLY

More information

Application Performance Optimizations. Pak Lui

Application Performance Optimizations. Pak Lui Application Performance Optimizations Pak Lui 2 140 Applications Best Practices Published Abaqus CPMD LS-DYNA MILC AcuSolve Dacapo minife OpenMX Amber Desmond MILC PARATEC AMG DL-POLY MSC Nastran PFA AMR

More information

STAR-CCM+ Performance Benchmark and Profiling. July 2014

STAR-CCM+ Performance Benchmark and Profiling. July 2014 STAR-CCM+ Performance Benchmark and Profiling July 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: CD-adapco, Intel, Dell, Mellanox Compute

More information

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015 Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute

More information

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017 LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource

More information

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015 LAMMPS-KOKKOS Performance Benchmark and Profiling September 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, NVIDIA

More information

NAMD Performance Benchmark and Profiling. January 2015

NAMD Performance Benchmark and Profiling. January 2015 NAMD Performance Benchmark and Profiling January 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

GROMACS (GPU) Performance Benchmark and Profiling. February 2016 GROMACS (GPU) Performance Benchmark and Profiling February 2016 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Mellanox, NVIDIA Compute

More information

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations Ophir Maor HPC Advisory Council ophir@hpcadvisorycouncil.com The HPC-AI Advisory Council World-wide HPC non-profit

More information

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work

More information

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations Pak Lui HPC Advisory Council June 7, 2016 1 Agenda Introduction to HPC Advisory Council Benchmark Configuration

More information

MILC Performance Benchmark and Profiling. April 2013

MILC Performance Benchmark and Profiling. April 2013 MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

CPMD Performance Benchmark and Profiling. February 2014

CPMD Performance Benchmark and Profiling. February 2014 CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

GROMACS Performance Benchmark and Profiling. August 2011

GROMACS Performance Benchmark and Profiling. August 2011 GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011 CESM (Community Earth System Model) Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,

More information

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017 LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource

More information

SNAP Performance Benchmark and Profiling. April 2014

SNAP Performance Benchmark and Profiling. April 2014 SNAP Performance Benchmark and Profiling April 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox For more information on the supporting

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012 ANSYS Fluent 14 Performance Benchmark and Profiling October 2012 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information

More information

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit Pak Lui 1, David Cho 1, Gilad Shainer 1, Scot Schultz 1, Brian Klaff 1 1 Mellanox Technologies, Inc. 1 Abstract From

More information

OCTOPUS Performance Benchmark and Profiling. June 2015

OCTOPUS Performance Benchmark and Profiling. June 2015 OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the

More information

GROMACS Performance Benchmark and Profiling. September 2012

GROMACS Performance Benchmark and Profiling. September 2012 GROMACS Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, Altair Compute

More information

OpenFOAM Performance Testing and Profiling. October 2017

OpenFOAM Performance Testing and Profiling. October 2017 OpenFOAM Performance Testing and Profiling October 2017 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Huawei, Mellanox Compute resource - HPC

More information

AMBER 11 Performance Benchmark and Profiling. July 2011

AMBER 11 Performance Benchmark and Profiling. July 2011 AMBER 11 Performance Benchmark and Profiling July 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

ICON Performance Benchmark and Profiling. March 2012

ICON Performance Benchmark and Profiling. March 2012 ICON Performance Benchmark and Profiling March 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource - HPC

More information

LAMMPS Performance Benchmark and Profiling. July 2012

LAMMPS Performance Benchmark and Profiling. July 2012 LAMMPS Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox, Altair Compute

More information

Altair RADIOSS Performance Benchmark and Profiling. May 2013

Altair RADIOSS Performance Benchmark and Profiling. May 2013 Altair RADIOSS Performance Benchmark and Profiling May 2013 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Altair, AMD, Dell, Mellanox Compute

More information

NAMD Performance Benchmark and Profiling. February 2012

NAMD Performance Benchmark and Profiling. February 2012 NAMD Performance Benchmark and Profiling February 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

LAMMPSCUDA GPU Performance. April 2011

LAMMPSCUDA GPU Performance. April 2011 LAMMPSCUDA GPU Performance April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory Council

More information

HYCOM Performance Benchmark and Profiling

HYCOM Performance Benchmark and Profiling HYCOM Performance Benchmark and Profiling Jan 2011 Acknowledgment: - The DoD High Performance Computing Modernization Program Note The following research was performed under the HPC Advisory Council activities

More information

NAMD GPU Performance Benchmark. March 2011

NAMD GPU Performance Benchmark. March 2011 NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory

More information

NEMO Performance Benchmark and Profiling. May 2011

NEMO Performance Benchmark and Profiling. May 2011 NEMO Performance Benchmark and Profiling May 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for

More information

Himeno Performance Benchmark and Profiling. December 2010

Himeno Performance Benchmark and Profiling. December 2010 Himeno Performance Benchmark and Profiling December 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource

More information

NAMD Performance Benchmark and Profiling. November 2010

NAMD Performance Benchmark and Profiling. November 2010 NAMD Performance Benchmark and Profiling November 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox Compute resource - HPC Advisory

More information

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia

More information

MM5 Modeling System Performance Research and Profiling. March 2009

MM5 Modeling System Performance Research and Profiling. March 2009 MM5 Modeling System Performance Research and Profiling March 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center

More information

ABySS Performance Benchmark and Profiling. May 2010

ABySS Performance Benchmark and Profiling. May 2010 ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011 The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011 HPC Scale Working Group, Dec 2011 Gilad Shainer, Pak Lui, Tong Liu,

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

Performance Analysis of LS-DYNA in Huawei HPC Environment

Performance Analysis of LS-DYNA in Huawei HPC Environment Performance Analysis of LS-DYNA in Huawei HPC Environment Pak Lui, Zhanxian Chen, Xiangxu Fu, Yaoguo Hu, Jingsong Huang Huawei Technologies Abstract LS-DYNA is a general-purpose finite element analysis

More information

Maximizing Cluster Scalability for LS-DYNA

Maximizing Cluster Scalability for LS-DYNA Maximizing Cluster Scalability for LS-DYNA Pak Lui 1, David Cho 1, Gerald Lotto 1, Gilad Shainer 1 1 Mellanox Technologies, Inc. Sunnyvale, CA, USA 1 Abstract High performance network interconnect is an

More information

Mellanox GPUDirect RDMA User Manual

Mellanox GPUDirect RDMA User Manual Mellanox GPUDirect RDMA User Manual Rev 1.2 www.mellanox.com NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ( PRODUCT(S) ) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS-IS

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Mellanox GPUDirect RDMA User Manual

Mellanox GPUDirect RDMA User Manual Mellanox GPUDirect RDMA User Manual Rev 1.0 www.mellanox.com NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ( PRODUCT(S) ) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS-IS

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness HPC Advisory Council studies performance evaluation, scalability analysis and optimization tuning of RADIOSS 12.0 on a modern

More information

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA, S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com A TALE OF ENLIGHTENMENT Basic OK List 10 for x = 1 to 3 20 print

More information

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Game-changing Extreme GPU computing with The Dell PowerEdge C4130 Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.

More information

GRID Testing and Profiling. November 2017

GRID Testing and Profiling. November 2017 GRID Testing and Profiling November 2017 2 GRID C++ library for Lattice Quantum Chromodynamics (Lattice QCD) calculations Developed by Peter Boyle (U. of Edinburgh) et al. Hybrid MPI+OpenMP plus NUMA aware

More information

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments LS-DYNA Productivity and Power-aware Simulations in Cluster Environments Gilad Shainer 1, Tong Liu 1, Jacob Liberman 2, Jeff Layton 2 Onur Celebioglu 2, Scot A. Schultz 3, Joshua Mora 3, David Cownie 3,

More information

Mellanox GPUDirect RDMA User Manual

Mellanox GPUDirect RDMA User Manual Mellanox GPUDirect RDMA User Manual Rev 1.5 www.mellanox.com Mellanox Technologies NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ( PRODUCT(S) ) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers

Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers This Dell technical white paper evaluates and provides recommendations for the performance of several HPC applications

More information

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017 Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017 This work supported by the National Science Foundation, award ACI-1341698.

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System INSTITUTE FOR PLASMA RESEARCH (An Autonomous Institute of Department of Atomic Energy, Government of India) Near Indira Bridge; Bhat; Gandhinagar-382428; India PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE

More information

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has

More information

IBM Power AC922 Server

IBM Power AC922 Server IBM Power AC922 Server The Best Server for Enterprise AI Highlights More accuracy - GPUs access system RAM for larger models Faster insights - significant deep learning speedups Rapid deployment - integrated

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016 RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Optimal BIOS settings for HPC with Dell PowerEdge 12 th generation servers

Optimal BIOS settings for HPC with Dell PowerEdge 12 th generation servers Optimal BIOS settings for HPC with Dell PowerEdge 12 th generation servers This Dell technical white paper analyses the various BIOS options available in Dell PowerEdge 12 th generation servers and provides

More information

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR The use of InfiniBand as interconnect technology for HPC applications has been increasing over the past few years, replacing the aging Gigabit

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for ANSYS Mechanical, ANSYS Fluent, and

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

The Visual Computing Company

The Visual Computing Company The Visual Computing Company Update NVIDIA GPU Ecosystem Axel Koehler, Senior Solutions Architect HPC, NVIDIA Outline Tesla K40 and GPU Boost Jetson TK-1 Development Board for Embedded HPC Pascal GPU 3D

More information

Birds of a Feather Presentation

Birds of a Feather Presentation Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard

More information

IBM Power Systems HPC Cluster

IBM Power Systems HPC Cluster IBM Power Systems HPC Cluster Highlights Complete and fully Integrated HPC cluster for demanding workloads Modular and Extensible: match components & configurations to meet demands Integrated: racked &

More information

Ravindra Babu Ganapathi

Ravindra Babu Ganapathi 14 th ANNUAL WORKSHOP 2018 INTEL OMNI-PATH ARCHITECTURE AND NVIDIA GPU SUPPORT Ravindra Babu Ganapathi Intel Corporation [ April, 2018 ] Intel MPI Open MPI MVAPICH2 IBM Platform MPI SHMEM Intel MPI Open

More information

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms Dr. Jeffrey Layton Enterprise Technologist HPC Why GPUs? GPUs have very high peak compute capability! 6-9X CPU Challenges

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Accelerated computing is revolutionizing the economics of the data center. HPC and hyperscale customers deploy accelerated

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

IBM Power Advanced Compute (AC) AC922 Server

IBM Power Advanced Compute (AC) AC922 Server IBM Power Advanced Compute (AC) AC922 Server The Best Server for Enterprise AI Highlights IBM Power Systems Accelerated Compute (AC922) server is an acceleration superhighway to enterprise- class AI. A

More information

HP GTC Presentation May 2012

HP GTC Presentation May 2012 HP GTC Presentation May 2012 Today s Agenda: HP s Purpose-Built SL Server Line Desktop GPU Computing Revolution with HP s Z Workstations Hyperscale the new frontier for HPC New HPC customer requirements

More information

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester NVIDIA GPU Computing A Revolution in High Performance Computing GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester John Ashley Senior Solutions Architect

More information

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation GPU ACCELERATED COMPUTING 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation GAMING PRO ENTERPRISE VISUALIZATION DATA CENTER AUTO

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Accelerated computing is revolutionizing the economics of the data center. HPC enterprise and hyperscale customers deploy

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

Architecting High Performance Computing Systems for Fault Tolerance and Reliability Architecting High Performance Computing Systems for Fault Tolerance and Reliability Blake T. Gonzales HPC Computer Scientist Dell Advanced Systems Group blake_gonzales@dell.com Agenda HPC Fault Tolerance

More information