DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Similar documents
HYCOM Performance Benchmark and Profiling

HMEM and Lemaitre2: First bricks of the CÉCI s infrastructure

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

AMBER 11 Performance Benchmark and Profiling. July 2011

CP2K Performance Benchmark and Profiling. April 2011

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

AcuSolve Performance Benchmark and Profiling. October 2011

2008 International ANSYS Conference

ABySS Performance Benchmark and Profiling. May 2010

Altair RADIOSS Performance Benchmark and Profiling. May 2013

NAMD Performance Benchmark and Profiling. February 2012

Illinois Proposal Considerations Greg Bauer

High Performance Computing with Accelerators

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CP2K Performance Benchmark and Profiling. April 2011

InfiniBand-based HPC Clusters

Maximizing Memory Performance for ANSYS Simulations

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

University at Buffalo Center for Computational Research

Leibniz Supercomputer Centre. Movie on YouTube

UCX: An Open Source Framework for HPC Network APIs and Beyond

Interconnect Your Future

MM5 Modeling System Performance Research and Profiling. March 2009

GROMACS Performance Benchmark and Profiling. September 2012

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NAMD Performance Benchmark and Profiling. November 2010

Architectures for Scalable Media Object Search

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

NAMD GPU Performance Benchmark. March 2011

Optimizing LS-DYNA Productivity in Cluster Environments

A Case for High Performance Computing with Virtual Machines

Himeno Performance Benchmark and Profiling. December 2010

LAMMPSCUDA GPU Performance. April 2011

Six-Core AMD Opteron Processor

ANSYS HPC Technology Leadership

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

The rcuda middleware and applications

HPC Architectures. Types of resource currently in use

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Lecture 2 Parallel Programming Platforms

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST

Habanero Operating Committee. January

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Experiences with HP SFS / Lustre in HPC Production

Available Resources Considerate Usage Summary. Scientific Computing Resources and Current Fair Usage. Quentin CAUDRON Ellak SOMFAI Stefan GROSSKINSKY

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

Modern computer architecture. From multicore to petaflops

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Purchasing Services SVC East Fowler Avenue Tampa, Florida (813)

AcuSolve Performance Benchmark and Profiling. October 2011

CS500 SMARTER CLUSTER SUPERCOMPUTERS

arxiv: v1 [physics.comp-ph] 4 Nov 2013

The Optimal CPU and Interconnect for an HPC Cluster

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Parallel Computer Architecture - Basics -

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Scalability and Classifications

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

NUMA-Aware Shared-Memory Collective Communication for MPI

Large Scale Remote Interactive Visualization

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

OP2 FOR MANY-CORE ARCHITECTURES

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

Comet Virtualization Code & Design Sprint

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CSCS HPC storage. Hussein N. Harake

n N c CIni.o ewsrg.au

HPC Technology Update Challenges or Chances?

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

MegaGauss (MGs) Cluster Design Overview

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Hardware Recommendations for SOLIDWORKS 2017

Introducing the next generation of affordable and productive massively parallel processing (MPP) computing the Cray XE6m supercomputer.

Future Routing Schemes in Petascale clusters

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Genius Quick Start Guide

SNAP Performance Benchmark and Profiling. April 2014

GPUs and Emerging Architectures

Outline. March 5, 2012 CIRMMT - McGill University 2

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

The Future of Interconnect Technology

Advanced Topics in High Performance Scientific Computing [MA5327] Exercise 1

Remote and Collaborative Visualization

Future Trends in Hardware and Software for use in Simulation

Cluster Network Products

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Transcription:

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011

NextMuSE 2 Next generation Multi-mechanics Simulation Environment

Cluster configuration The original EIGER vizualization and analysis cluster (installed in April 2010) includes 19 nodes based on the six-core dual-socket AMD Istanbul Opteron 2427 processor and running at 2.2 GHz. Four nodes are reserved to specific tasks: one for the login, one for the administration and two for the file system IO routing; leaving 15 nodes to which we have now added (March 2011) added an extension of four nodes based on the 12-core dual socket AMD Magny-Cours Opteron 6174 running at 2.2 GHz. Standard nodes offers 24 GB of main system memory, whereas fat (memory) nodes and extension nodes offer 48 GB per node. We therefore get a total of 276 cores and 664 GB of memory. In addition to the CPUs, every node hosts two GPU cards, GeForce or Tesla. The latest nodes come with Fermi cards providing 448 cuda cores each and have either 3 or 6 GB of memory onboard. More details are given in Table 1. For the high speed network interconnect, the cluster relies on a dedicated Infiniband QDR fabric infrastructure, able to support both parallel-mpi traffic and parallel data file system traffic to IO nodes. In addition, a commodity 10 GbE LAN ensures interactive login access, home, project and application file sharing among the cluster nodes. A standard 1 GbE administration network is also reserved for cluster management purposes. Altair PBS Professional V 10.2 is the main batch queuing system installed and supported on the cluster. A CSCS user project has been created which allows external partners to access the cluster The accounting system has a partition reserved for NextMuSE so that CPU hours consumed by (external) NextMuSE users can be automatically recorded. Node configuration (extension) Nodes are dual socket nodes with 48 GB of memory. As shown in Figure 1, one socket has 32 GB of memory whereas the other one has 16 GB, NUMA effects must therefore be considered when using more than the amount of memory a single socket provides, ie. 32 or 16 GB. Nevertheless each core comes with a L1 cache of 64 KB, a L2 cache of 512 KB and a shared L3 cache of 10 MB (2x6 MB but only 10 MB visible). Figure 1: Magny-Cours Node Topology Next generation Multi-mechanics Simulation Environment 3

Node Node CPU Type # cores per # sockets per Memory per CPU GPU # GPU per Name Type node node node frequency type node eiger160 login AMD Istanbul 12 2 24 GB 2.2 Ghz Matrox 1 eiger170 admin AMD Istanbul 6 1 8 GB 2.2 Ghz Matrox 1 eiger180 gpfs AMD Istanbul 12 2 24 GB 2.2 Ghz Matrox 1 eiger181 gpfs AMD Istanbul 12 2 24 GB 2.2 Ghz Matrox 1 1 eiger200 vis AMD Istanbul 12 2 24 GB 2.2 Ghz GTX 285 1 2 eiger201 vis AMD Istanbul 12 2 24 GB 2.2 Ghz GTX 285 1 3 eiger202 vis AMD Istanbul 12 2 24 GB 2.2 Ghz GTX 285 1 4 eiger203 vis AMD Istanbul 12 2 24 GB 2.2 Ghz GTX 285 1 5 eiger204 vis AMD Istanbul 12 2 24 GB 2.2 Ghz GTX 285 1 6 eiger205 vis AMD Istanbul 12 2 24 GB 2.2 Ghz GTX 285 1 7 eiger206 vis AMD Istanbul 12 2 24 GB 2.2 Ghz GTX 285 1 8 eiger207 visfat AMD Magny-Cours 24 2 48 GB 2.2 Ghz M2050 2 9 eiger208 visfat AMD Magny-Cours 24 2 48 GB 2.2 Ghz M2050 2 10 eiger209 visfat AMD Magny-Cours 24 2 48 GB 2.2 Ghz C2070 2 11 eiger210 visfat AMD Magny-Cours 24 2 48 GB 2.2 Ghz C2070 2 12 eiger220 visfat AMD Istanbul 12 2 48 GB 2.2 Ghz GTX 285 1 13 eiger221 visfat AMD Istanbul 12 2 48 GB 2.2 Ghz GTX 285 1 14 eiger222 visfat AMD Istanbul 12 2 48 GB 2.2 Ghz GTX 285 1 15 eiger223 visfat AMD Istanbul 12 2 48 GB 2.2 Ghz GTX 285 1 16 eiger240 a.d.n. AMD Istanbul 12 2 24 GB 2.2 Ghz S1070 2 17 eiger241 a.d.n. AMD Istanbul 12 2 24 GB 2.2 Ghz S1070 2 18 eiger242 a.d.n. AMD Istanbul 12 2 24 GB 2.2 Ghz C2070 2 19 eiger243 a.d.n. AMD Istanbul 12 2 24 GB 2.2 Ghz C2070 2 Table 1: Eiger System Configuration with newly installed NextMuSE Extension 4 Next generation Multi-mechanics Simulation Environment

MPI configuration The default MPI distribution installed on the system is MVAPICH2, which provides a good and reliable implementation of MPI over InfiniBand more details are available on http://mvapich.cse.ohiostate.edu. Below are presented benchmark of the measured bandwidth and latency between eiger nodes using inter-process communication or intra-process communication. Note that an additional kernel module is used for one-copy intra-node message passing, optimizing the performance for this type of configuration. Between two nodes with an Infiniband QDR 4X link, the theoretical bandwidth is expected to be 4 GB/s, here the achieved bandwidth appears to be only 3 GB/s. Note also that since these measurements were made, the system has been constantly updated and the acheivable bandwidth should be slightly higher, though on the newly installed nodes, bandwidth is slightly lower due to the internal hardware configuration. Inter-node Two Sided Operations (OFA-IB-Nemesis) Intra-node Two Sided Operations (KNEM) Next generation Multi-mechanics Simulation Environment 5

Remote access configuration Remote access to the cluster is provided using either ssh through the main CSCS front-end machine ELA, or using remote desktop viewer solutions such as TurboVNC or TigerVNC which allow the use of OpenGL applications (e.g. ParaView) at a reliable frame rate. An example of the connection procedure using the TurboVNC software is available at the following address: http://user.cscs.ch/systems/dalco_sm_system_eiger/eiger_as_visualization_facility/remote_visualizatio n_access_procedure/index.html. Below is a screen-shot of what any NextMuSE partner should be able to get: Launching parallel paraview server jobs Additional information on how to configure paraview to launch reverse connection jobs for HPC visualization is available via the pv-meshless wiki https://hpcforge.org/plugins/mediawiki/wiki/pvmeshless/index.php/launching_paraview_on_hpc_machines pv-meshless is a ParaView plugin developed by CSCS which forms the main host for the SPH analysis modules developed within the NextMuSE project. 6 Next generation Multi-mechanics Simulation Environment