LSF HPC :: getting most out of your NUMA machine

Leopold-Franzens-Universität Innsbruck ZID Zentraler Informatikdienst (ZID) LSF HPC :: getting most out of your NUMA machine platform computing conference, Michael Fink

who we are & what we do university of innsbruck founded 1669, state funded 25000 students, 5000 employees, external partners (university spinoff) central compute services (ZID) complete IT infractructure for research, teaching and administration central servers, computer labs (teaching) city-wide network (3 campuses + scattered sites) applications (ISP for all university members, database, HPC & al) staff: 80 ZID HPC group clusters and NUMA machines, mass storage staff: 4 HPC user consortium 15 member institutes coordination, exchange of knowlege and methods (seminar)

our SGI altix ccnuma machine SGI altix 350 why? plan 32 s + 128 GB ccnuma memory SLES 4, SGI propack 4 hierarchical sets efficient shm (openmp, posix threads) + message passing (MPI) large memory jobs (esp. abaqus) strategic preference "open source" software use SUN grid engine did not work out decision grid engine not NUMA-aware stay with LSF (origin 3800, compute cluster)

motivation parallel job in distributed memory cluster mpirun or batch system (LSF) message passing Switch places threads on n nodes processes stay within nodes memory access strictly intranode internode traffic limited to message passing LSF aware of layout physical node LSF node

parallel job in SMP machine disk IP OS assumes SMP paradigm IO n s 1 shared memory uniform access: same cost for accessing any part of memory arbitrary placement of processes arbitrary migration of processes LSF: 1 LSF-node, n s 1 LSF node SMP does not scale > 8 s need NUMA

parallel job in NUMA machine disk IP NUMA non uniform memory access (virtual shared memory) logical: SMP memory + I/O globally visible to all s, single OS instance physical: interconnect topology latency no. of hops ( 60ns/hop) internode traffic memory access + message passing OS (+LSF): behaves as in SMP 1 LSF node arbitrary placement+migration no dynamic memory page migration 1 LSF node why is this bad?

parallel job in NUMA machine :: job start disk IP what happens experiment job 4 threads uses 4 s and memory OS arbitrarily assigns 4 s initially internode traffic limited to message passing non-optimal placement: more hops than necessary

parallel job in NUMA machine :: the problem disk IP some time later threads migrate on different s used memory stays put (first touch) threads get separated from their memory new memory on new nodes internode traffic message passing memory access the same happens to other jobs fragmentation interconnect & I/O contention poor performance/throughput vanilla LSF: OS-instance granular does not address this problem

solution :: SGI propack4 sets + LSF HPC boot OS + I/O login batch disk IP set layout boot (2): OS, I/O (boot set) login (2): interactive work batch (28): LSF what are sets tell OS scheduler where to allocate and memory hierarchical: nesting allowed LSF HPC can create sets implementation activate boot-set develop persistent sets restrain interactive logins platform support secret LSF HPC option LSF_ROOT_SET

boot set goal bind all O/S + I/O processes to boot set how have kernel start /sbin/bootcpuset instead of /etc/init in /etc/elilo.conf add line append = "init=/sbin/bootcpuset" create file /etc/bootcpuset.conf how it works /sbin/bootcpuset reads config file /etc/bootcpuset.conf creates boot set binds itself to boot set exec's /etc/init /etc/bootcpuset.conf cpus 0-1 mems 0 see http://techpubs.sgi.com - linux resource admin guide

persistent sets fact propack4 sets are dynamic: lost on reboot goal sets persistent across boots how startup script /var/local/adm/cpuset/init.d/cpuset reads cpuset descriptions from files in /var/local/adm/cpuset/defs executed on system boot, creates all cpusets in defs /.../defs/login cpus 2-3 mems 1 /.../defs/lsfroot cpus 4-31 mems 2-15

restrain interactive logins goal bind interactive logins (we allow only ssh) to login set how in /etc/init.d/sshd replace startproc -f -p $SSHD_PIDFILE \ /usr/sbin/sshd $SSHD_OPTS -o "PidFile=$SSHD_PIDFILE" by /usr/bin/cpuset -i /login -I startproc -- -f -p $SSHD_PIDFILE \ /usr/sbin/sshd $SSHD_OPTS -o "PidFile=$SSHD_PIDFILE"

lsf root cpuset fact by default, LSF manages all s goal restrict LSF to manage batch set how create persistent set /lsfroot add line LSF_ROOT_SET=/lsfroot to lsf.conf result LSF creates sub-sets /dev/cpuset/lsfroot/hostname@jobid

how to use simple bsub -n 4 mpirun -np 4 program arg... OMP_NUM_THREADS=4 bsub -n 4 program arg... advanced control allocation within LSF-created set bsub -n 4 dplace -s 1 -c 0-3 mpirun -np 4 program arg... OMP_NUM_THREADS=4 bsub -n 4 dplace -x 2 -c 0-3 program arg... how it works LSF knows about topology and running jobs picks optimal set of s and creates set places job on set cpu # always starts at 0

result :: LSF HPC manages batch load boot OS + I/O login batch disk IP benefits threads + memory stay together internode traffic reduced to program semantics minimal distance minimal contention it really works this way! /dev/cpuset/lsfroot # head */cpus ==> altix32@1225/cpus <== 4-7,24-25 ==> altix32@1250/cpus <== 8-13 ==> altix32@1256/cpus <== 18-19,26-27 ==> altix32@1257/cpus <== 20-21,28-29 /dev/cpuset/lsfroot # uptime 5:25pm up 56 days 2:28, 8 users, load average: 19.72, 19.64, 19.64

parerga & paralipomena setup is available http://homepage.uibk.ac.at/~c102mf follow link altix-cpusets acknowledgments platform computing platform support martin pöll invitation to this conference very fast and effective response sysadmin, 3rd party software questions?