LSF HPC :: getting most out of your NUMA machine

Similar documents
Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!

HPC Architectures. Types of resource currently in use

Moab Workload Manager on Cray XT3

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

NUMA replicated pagecache for Linux

Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment

COSC 6385 Computer Architecture - Multi Processor Systems

Experiences with LSF and cpusets on the Origin3800 at Dresden University of Technology

Cerebro Quick Start Guide

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

EIC system user manual

Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50

Windows Server 2012: Server Virtualization

OS impact on performance

Practical Introduction to

Imperial College London. Simon Burbidge 29 Sept 2016

PARALLEL ARCHITECTURES

The SGI Message Passing Toolkit

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Advanced Job Launching. mapping applications to hardware

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Using Docker in High Performance Computing in OpenPOWER Environment

Getting Performance from OpenMP Programs on NUMA Architectures

Cluster Network Products

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

An introduction to checkpointing. for scientific applications

Graham vs legacy systems

PCS - Part Two: Multiprocessor Architectures

Scheduling. Jesus Labarta

Introduction to GALILEO

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Experiences in Managing Resources on a Large Origin3000 cluster

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Non-uniform memory access (NUMA)

NUMA-aware OpenMP Programming

Architecting and Managing GPU Clusters. Dale Southard, NVIDIA

Update on Windows Persistent Memory Support Neal Christiansen Microsoft

CS4500/5500 Operating Systems File Systems and Implementations

Cray Operating System and I/O Road Map Charlie Carroll

A Case for High Performance Computing with Virtual Machines

Why you should care about hardware locality and how.

Parallel Programming with MPI

Introduction Workshop 11th 12th November 2013

The Google File System (GFS)

MPI versions. MPI History

Coherent HyperTransport Enables The Return of the SMP

Technical Computing Suite supporting the hybrid system

When we start? 10/24/2013 Operating Systems, Beykent University 1

Regional & National HPC resources available to UCSB

HTC Brief Instructions

Introduction to HPC Using zcluster at GACRC

The MOSIX Scalable Cluster Computing for Linux. mosix.org

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

Altix Usage and Application Programming

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Lecture 17. NUMA Architecture and Programming

SGI UV 300RL for Oracle Database In-Memory

Bright Cluster Manager

I/O Monitoring at JSC, SIONlib & Resiliency

Parallel and Distributed Computing

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment

Batch Systems. Running calculations on HPC resources

BlueGene/L (No. 4 in the Latest Top500 List)

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer Architecture

COSC 6374 Parallel Computation. Parallel Computer Architectures

Introduction to HPC Using zcluster at GACRC

WHY PARALLEL PROCESSING? (CE-401)

Windows-HPC Environment at RWTH Aachen University

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

Practical Scientific Computing

COSC 6374 Parallel Computation. Parallel Computer Architectures

Best practices. Using Affinity Scheduling in IBM Platform LSF. IBM Platform LSF

20/12/12. X86_64 Architecture: NUMA Considerations

Introduction to Parallel Programming

InfiniBand-based HPC Clusters

NovoalignMPI User Guide

Introduction to PICO Parallel & Production Enviroment

Research on the Implementation of MPI on Multicore Architectures

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Compute Node Linux (CNL) The Evolution of a Compute OS

Improving User Accounting and Isolation with Linux Kernel Features. Brian Bockelman Condor Week 2011

Towards NUMA Support with Distance Information

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

High Performance Computing (HPC) Using zcluster at GACRC

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Understanding vnuma (Virtual Non-Uniform Memory Access)

MapReduce. U of Toronto, 2014

MPI History. MPI versions MPI-2 MPICH2

COS 318: Operating Systems. Overview. Jaswinder Pal Singh Computer Science Department Princeton University

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Introduction to UBELIX

UAntwerpen, 24 June 2016

Slurm Version Overview

1 Bull, 2011 Bull Extreme Computing

Transcription:

Leopold-Franzens-Universität Innsbruck ZID Zentraler Informatikdienst (ZID) LSF HPC :: getting most out of your NUMA machine platform computing conference, Michael Fink

who we are & what we do university of innsbruck founded 1669, state funded 25000 students, 5000 employees, external partners (university spinoff) central compute services (ZID) complete IT infractructure for research, teaching and administration central servers, computer labs (teaching) city-wide network (3 campuses + scattered sites) applications (ISP for all university members, database, HPC & al) staff: 80 ZID HPC group clusters and NUMA machines, mass storage staff: 4 HPC user consortium 15 member institutes coordination, exchange of knowlege and methods (seminar)

our SGI altix ccnuma machine SGI altix 350 why? plan 32 s + 128 GB ccnuma memory SLES 4, SGI propack 4 hierarchical sets efficient shm (openmp, posix threads) + message passing (MPI) large memory jobs (esp. abaqus) strategic preference "open source" software use SUN grid engine did not work out decision grid engine not NUMA-aware stay with LSF (origin 3800, compute cluster)

motivation parallel job in distributed memory cluster mpirun or batch system (LSF) message passing Switch places threads on n nodes processes stay within nodes memory access strictly intranode internode traffic limited to message passing LSF aware of layout physical node LSF node

parallel job in SMP machine disk IP OS assumes SMP paradigm IO n s 1 shared memory uniform access: same cost for accessing any part of memory arbitrary placement of processes arbitrary migration of processes LSF: 1 LSF-node, n s 1 LSF node SMP does not scale > 8 s need NUMA

parallel job in NUMA machine disk IP NUMA non uniform memory access (virtual shared memory) logical: SMP memory + I/O globally visible to all s, single OS instance physical: interconnect topology latency no. of hops ( 60ns/hop) internode traffic memory access + message passing OS (+LSF): behaves as in SMP 1 LSF node arbitrary placement+migration no dynamic memory page migration 1 LSF node why is this bad?

parallel job in NUMA machine :: job start disk IP what happens experiment job 4 threads uses 4 s and memory OS arbitrarily assigns 4 s initially internode traffic limited to message passing non-optimal placement: more hops than necessary

parallel job in NUMA machine :: the problem disk IP some time later threads migrate on different s used memory stays put (first touch) threads get separated from their memory new memory on new nodes internode traffic message passing memory access the same happens to other jobs fragmentation interconnect & I/O contention poor performance/throughput vanilla LSF: OS-instance granular does not address this problem

solution :: SGI propack4 sets + LSF HPC boot OS + I/O login batch disk IP set layout boot (2): OS, I/O (boot set) login (2): interactive work batch (28): LSF what are sets tell OS scheduler where to allocate and memory hierarchical: nesting allowed LSF HPC can create sets implementation activate boot-set develop persistent sets restrain interactive logins platform support secret LSF HPC option LSF_ROOT_SET

boot set goal bind all O/S + I/O processes to boot set how have kernel start /sbin/bootcpuset instead of /etc/init in /etc/elilo.conf add line append = "init=/sbin/bootcpuset" create file /etc/bootcpuset.conf how it works /sbin/bootcpuset reads config file /etc/bootcpuset.conf creates boot set binds itself to boot set exec's /etc/init /etc/bootcpuset.conf cpus 0-1 mems 0 see http://techpubs.sgi.com - linux resource admin guide

persistent sets fact propack4 sets are dynamic: lost on reboot goal sets persistent across boots how startup script /var/local/adm/cpuset/init.d/cpuset reads cpuset descriptions from files in /var/local/adm/cpuset/defs executed on system boot, creates all cpusets in defs /.../defs/login cpus 2-3 mems 1 /.../defs/lsfroot cpus 4-31 mems 2-15

restrain interactive logins goal bind interactive logins (we allow only ssh) to login set how in /etc/init.d/sshd replace startproc -f -p $SSHD_PIDFILE \ /usr/sbin/sshd $SSHD_OPTS -o "PidFile=$SSHD_PIDFILE" by /usr/bin/cpuset -i /login -I startproc -- -f -p $SSHD_PIDFILE \ /usr/sbin/sshd $SSHD_OPTS -o "PidFile=$SSHD_PIDFILE"

lsf root cpuset fact by default, LSF manages all s goal restrict LSF to manage batch set how create persistent set /lsfroot add line LSF_ROOT_SET=/lsfroot to lsf.conf result LSF creates sub-sets /dev/cpuset/lsfroot/hostname@jobid

how to use simple bsub -n 4 mpirun -np 4 program arg... OMP_NUM_THREADS=4 bsub -n 4 program arg... advanced control allocation within LSF-created set bsub -n 4 dplace -s 1 -c 0-3 mpirun -np 4 program arg... OMP_NUM_THREADS=4 bsub -n 4 dplace -x 2 -c 0-3 program arg... how it works LSF knows about topology and running jobs picks optimal set of s and creates set places job on set cpu # always starts at 0

result :: LSF HPC manages batch load boot OS + I/O login batch disk IP benefits threads + memory stay together internode traffic reduced to program semantics minimal distance minimal contention it really works this way! /dev/cpuset/lsfroot # head */cpus ==> altix32@1225/cpus <== 4-7,24-25 ==> altix32@1250/cpus <== 8-13 ==> altix32@1256/cpus <== 18-19,26-27 ==> altix32@1257/cpus <== 20-21,28-29 /dev/cpuset/lsfroot # uptime 5:25pm up 56 days 2:28, 8 users, load average: 19.72, 19.64, 19.64

parerga & paralipomena setup is available http://homepage.uibk.ac.at/~c102mf follow link altix-cpusets acknowledgments platform computing platform support martin pöll invitation to this conference very fast and effective response sysadmin, 3rd party software questions?