ABySS Performance Benchmark and Profiling. May 2010

Similar documents
MM5 Modeling System Performance Research and Profiling. March 2009

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

HYCOM Performance Benchmark and Profiling

AcuSolve Performance Benchmark and Profiling. October 2011

CP2K Performance Benchmark and Profiling. April 2011

GROMACS Performance Benchmark and Profiling. August 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

GROMACS Performance Benchmark and Profiling. September 2012

AMBER 11 Performance Benchmark and Profiling. July 2011

ICON Performance Benchmark and Profiling. March 2012

NEMO Performance Benchmark and Profiling. May 2011

AcuSolve Performance Benchmark and Profiling. October 2011

Altair RADIOSS Performance Benchmark and Profiling. May 2013

Himeno Performance Benchmark and Profiling. December 2010

NAMD Performance Benchmark and Profiling. November 2010

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NAMD Performance Benchmark and Profiling. February 2012

NAMD Performance Benchmark and Profiling. January 2015

LAMMPS Performance Benchmark and Profiling. July 2012

STAR-CCM+ Performance Benchmark and Profiling. July 2014

NAMD GPU Performance Benchmark. March 2011

2008 International ANSYS Conference

CP2K Performance Benchmark and Profiling. April 2011

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

OCTOPUS Performance Benchmark and Profiling. June 2015

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

LS-DYNA Performance Benchmark and Profiling. April 2015

SNAP Performance Benchmark and Profiling. April 2014

CPMD Performance Benchmark and Profiling. February 2014

MILC Performance Benchmark and Profiling. April 2013

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

LAMMPSCUDA GPU Performance. April 2011

OpenFOAM Performance Testing and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Performance Benchmark and Profiling. October 2017

Birds of a Feather Presentation

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

The Future of Interconnect Technology

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Introduction to High-Speed InfiniBand Interconnect

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Optimizing LS-DYNA Productivity in Cluster Environments

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Solutions for Scalable HPC

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

Future Routing Schemes in Petascale clusters

Voltaire Making Applications Run Faster

GRID Testing and Profiling. November 2017

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

The Future of High Performance Interconnects

Six-Core AMD Opteron Processor

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Quad-core Press Briefing First Quarter Update

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

Interconnect Your Future

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Single-Points of Performance

High Performance Computing

Unified Runtime for PGAS and MPI over OFED

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

QLogic TrueScale InfiniBand and Teraflop Simulations

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Oak Ridge National Laboratory Computing and Computational Sciences

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

Analytics of Wide-Area Lustre Throughput Using LNet Routers

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

AMD EPYC and NAMD Powering the Future of HPC February, 2019

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Sharing High-Performance Devices Across Multiple Virtual Machines

HIGH PERFORMANCE COMPUTING FROM SUN

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Tightly Coupled Accelerators Architecture

Parallel Computing at DESY Zeuthen. Introduction to Parallel Computing at DESY Zeuthen and the new cluster machines

High-Performance Lustre with Maximum Data Assurance

Red Storm / Cray XT4: A Superior Architecture for Scalability

InfiniBand-based HPC Clusters

Transcription:

ABySS Performance Benchmark and Profiling May 2010

Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.com http://www.bcgsc.ca/platform/bioinfo/software/abyss 2

ABySS ABySS is a de novo, parallel, paired-end sequence assembler designed for short reads Capable of assembling larger genomes Implemented using MPI ABySS was developed at Canada's Michael Smith Genome Sciences Centre 3

Objectives The presented research was done to provide best practices ABySS performance benchmarking Performance tuning with different communication libraries and compilers Interconnect performance comparisons Understanding ABySS communication patterns Power-efficient simulations The presented results will demonstrate Balanced compute system enables Good application scalability Power saving 4

Test Cluster Configuration Dell PowerEdge SC 1435 16-node cluster Quad-Core AMD Opteron 2382 ( Shanghai ) CPUs Mellanox InfiniBand ConnectX 20Gb/s (DDR) HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 800MHz per node OS: RHEL5U3, OFED 1.5 InfiniBand SW stack Compiler: GCC 4.1.2, Open64-4.2.3.1 Filesystem: Lustre-1.10.0.36 MPI: OpenMPI-1.3.3, MVAPICH2-1.4 Application: ABySS 1.1.2 Benchmark Workload Illumina Transcriptome Sequencing of Drosophila melanogaster embryo 5

Mellanox Connectivity: Taking HPC to New Heights World Highest Efficiency The world s only full transport-offload CORE-Direct - MPI and SHMEM offloads GPU-Direct - direct connectivity GPU-IB World Fastest InfiniBand 40Gb/s node to node, 120G IB switch to switch Highest dense switch solutions - 51.8TB in a single switch World s lowest switch latency 100ns 100% load HPC Topologies for Scale Fat-tree, mesh, 3D-Torus, Hybrid Advanced adaptive routing capabilities Highest reliability, lowest bit error rate, real-time adjustments 6

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC 4x 512K L2 cache 6MB L3 Cache Direct Connect Architecture HyperTransport Technology Up to 24 GB/s peak per processor Floating Point 128-bit FPU per core 4 FLOPS/clk peak per core Integrated Memory Controller Up to 12.8 GB/s DDR2-800 MHz or DDR2-667 MHz Scalability 48-bit Physical Addressing Compatibility Same power/thermal envelopes as 2 nd / 3 rd generation AMD Opteron processor 8 GB/S 8 GB/S PCI-E Bridge I/O Hub 8 GB/S 8 GB/S 8 GB/S USB PCI PCI-E Bridge Dual Channel Reg DDR2 7 November5, 2007 7

Dell PowerEdge Servers helping Simplify IT System Structure and Sizing Guidelines 16-node cluster build with Dell PowerEdge SC 1435 Servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis 8

Dell PowerEdge Server Advantage Dell PowerEdge servers incorporate AMD Opteron and Mellanox ConnectX InfiniBand to provide leading edge performance and reliability Building Block Foundations for best price/performance and performance/watt Investment protection and energy efficient Longer term server investment value Faster DDR2-800 memory Enhanced AMD PowerNow! Independent Dynamic Core Technology AMD CoolCore and Smart Fetch Technology Mellanox InfiniBand end-to-end for highest networking performance 9

Lustre File System Configuration Lustre Configuration 1 MDS 4 OSS (Each has 2 OST) InfiniBand based Backend storage All components are connected through InfiniBand interconnect Lustre MDS/OSS OSS OSS OSS IB QDR Switch IB DDR Switch Storage Cluster Nodes 10

ABySS Benchmark Results InfiniBand enables higher performance and scalability Up to 88% higher performance than GigE and 81% higher than 10GigE Both GigE and 10GigE don t scale well beyond 8 nodes 88% 81% Higher is better 8-cores per node 11

ABySS Benchmark Results - Compilers Two different compilers were used to compile ABySS Open64 provides better performance than GCC at 128 cores 13% Higher is better 8-cores per node 12

ABySS Benchmark Results - MPI Libraries Open MPI enables better performance at higher core count Up to 5% at 128 cores 5% Higher is better 8-cores per node 13

Power Cost Savings with Different Interconnect Dell economical integration of AMD CPUs and Mellanox InfiniBand To achieve same number of ABySS jobs over GigE InfiniBand saves power up to $5700 versus GigE and $5300 versus 10GigE Yearly based for 16-node cluster As cluster size increases, more power can be saved $5700 $/KWh = KWh * $0.20 For more information - http://enterprise.amd.com/downloads/svrpwrusecompletefinal.pdf 14

ABySS Benchmark Summary Interconnect comparison shows InfiniBand delivers superior performance in every cluster size versus GigE and 10GigE Performance advantage extends as cluster size increases Open64 compiler delivers higher performance Open MPI provides higher performance InfiniBand enables power saving Up to $5700/year power savings versus GigE and $5300 versus 10GigE on16 node cluster Dell PowerEdge server blades provides Linear scalability (maximum scalability) and balanced system By integrating InfiniBand interconnect and AMD processors Maximum return on investment through efficiency and utilization 15

ABySS MPI Profiling MPI Functions Mostly used MPI functions MPI_Send/Irecv and MPI_Test are the mostly used MPI functions Number of collectives increases as cluster size scales 16

ABySS MPI Profiling Communication Overhead Send, Irecv, and Allreduce are three major functions creating big overhead Communication overhead increases as cluster size scales 17

ABySS Profiling Summary ABySS was profiled to identify its communication patterns MPI point-t-point create the big communication overhead Number of collectives increases with cluster size Interconnects effect to ABySS performance Both latency and bandwidth are critical to application performance Balanced system CPU, memory, Interconnect that match each other capabilities, is essential for providing application efficiency 18

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 19 19