Cerebro Quick Start Guide - PDF Free Download

Cerebro Quick Start Guide Overview of the system Cerebro consists of a total of 64 Ivy Bridge processors E5-4650 v2 with 10 cores each, 14 TB of memory and 24 TB of local disk. Table 1 shows the hardware details of Cerebro in comparison to the Thinking cluster: Thinking Cerebro Total nodes 112/32 2 Processor type Ivy bridge Ivy bridge Base Clock Speed 2.8 GHz 2.4 GHz Cores per node 20 480/160 Total cores 2,880 640 Memory per node (GB) 112x64/32x128 1x12,000/1x2,000 Memory per core (GB) 3.2/6.4 12.8/25.6 Peak performance (TF) 64.5 12.3 Network Infiniband QDR 2:1 Numalink 6 Cache (L1 KB/L2 KB/L3 MB) 10x(32i+32d)/10x256/ 25 10x(32i+32d)/10x256/25 Total L3 cache per core 2.5 MB 2.5 MB Table 1 Hardware overview The system is split in two partitions, a large partition with 480 cores and 12 TB RAM and a smaller partition with 160 cores and 2 TB RAM. Each system has its own 10 TB local scratch disk space. Each partition has 10 cores dedicated to the operating system and 10 cores dedicated to

interactive batch jobs, so the total maximum number of cores available for batch jobs is 460 and 160 respectively. Figure 1 shows a scheme of the system configuration: Connecting to Cerebro Cerebro does not have a dedicated login node, so in order to work with the system users must connect to the same login node used for Thinking. All users having an active VSC account on Thinking can connect to the login node with the same credentials using the command: $ ssh vscxxxxx@login.hpc.kuleuven.be Users without an active VSC account on Thinking wanting to test the machine can request an account and Introductory credits through the webpage: https://account.vscentrum.be/ Take into account that this site is only accessible from within VSC universities domains, so the page will not load from, e.g., home, unless by using VNP or a proxy.

Accessing your Data All global storage areas available on Thinking are also available on the shared memory machine, so no data migration is required. Table 2 summarizes the available storage areas and their characteristics: Name Variable Type Access Backup Default Quota /user/leuven/30x/vsc30xxx $VSC_HOME NFS Global YES 3 GB /data/leuven/30x/vsc30xxx $VSC_DATA NFS Global YES 75 GB /scratch/leuven/30x/vsc30xxx $VSC_SCRATCH $VSC_SCRATCH_SITE GPFS Global NO 100 GB /node_scratch $VSC_SCRATCH_NODE ext4 local NO 10 TB Table 2 Storage areas overview $VSC_HOME: A regular home directory which contains all files that a user might need to log on to the system, and small 'utility' scripts/programs/source code/... The capacity that can be used is restricted by quota and this directory should not be used for I/O intensive programs. Regular backups are performed. $VSC_DATA: A data directory which can be used to store programs and their results.. Regular backups are performed. This are shpuld not be used for I/O intensive programs. There is a default quota of 75 GB, but it can be enlarged. You can find more information about the price and conditions here: https://icts.kuleuven.be/sc/hpc $VSC_SCRATCH/$VSC_SCRATCH_SITE: On each cluster you have access to a scratch directory that is shared by all nodes on the cluster. This directory is also accessible from the login nodes, so it is accessible while your jobs run, and after they finish. No backups are made for that area and files can be removed automatically if they have not been accessed for 21 days. $VSC_SCRATCH_NODE: is a scratch space local to each compute node. Thus, on each node, this directory points to a different physical location and the content is only accessible from that particular worknode, and only during the runtime of your job.

Software The software stack on Thinking is available on Cerebro. Those binaries can in most cases also be used on Cerebro. In addition, some applications have also been recompiled specifically for Cerebro and placed in a separate directory only available from Cerebro. If your application uses MPI libraries, we strongly recommend you to recompile your codes using the SGI MPI libraries (MPT) as they are tuned for the system. The modules software manager tool is available on Cerebro as it on ThinKing. By default, once connected to the login node all software for Thinking is available. In order to also list the software available specifically for Cerebro users must load the module: module load cerebro/2014a The naming scheme for modules is as follows: PackageName/version-ToolchainName-ToolchainVersion Where PackageName is the official name of the software, keeping capital and lower letters. On Thinking/Cerebro: $ module av Python -------------- /apps/leuven/thinking/2014a/modules/all --------------- Python/2.7.6-foss-2014a Python/2.7.6-intel-2014a Python/3.2.5-foss-2014a TIP: Revise your batch scripts and your.bashrc to ensure the appropriate software package name is used. Use always the complete name of the package (name and version) and do not rely on defaults. Currently, only few software have been recompiled specifically for Cerebro, e.g., NWChem and picard. However, as mentioned before, most of the Thinking software can be used on Cerebro. Table 3 shows a list of some of the available software on Thinking grouped by category: Biosciences Chemistry Mathematics Visualization Others BEAST GAMESS JAGS Paraview Abaqus BEDTools Gromacs Matlab Molden Ansys

BWA NAMD Octave Comsol MrBayes NWChem R SAMTools Bowtie/Bowtie2 TRIQS Quantum ESPRESSO Siesta SAS Table 3. Available software To obtain a complete of the software list on both systems you can execute: $ module load cerebro/2014a $ module available If you need additional software installed, please send a request to: hpcinfo@kuleuven.be Compiling and Running your Code Several compilers and libraries are available on Cerebro, as well as two toolchains flavors intel (based on Intel software components) and foss (based on free and open source software). In Cerebro, each Toolchain flavor has also an extra version where the MPI libraries (Intel MPI and OpenMPI) have been replaced for the SGI MPT library. A toolchain is a collection of tools to build (HPC) software consistently. It consists of: compilers for C/C++ and Fortran a communications library (MPI) mathematical libraries (linear algebra, FFT). Toolchains are versioned and refreshed twice a year. All software available on the cluster is rebuilt when a new version of a toolchain is defined to ensure consistency. Version numbers consist of the year of their definition, followed by either a or b, e.g., 2014a. Note that the software components are not necessarily the most recent releases; rather they are selected for stability and reliability. Table 4 summarizes the toolchains available on Cerebro and their components: Intel compilers Open source Name intel Intel-mpt foss Foss-mpt version 2014a 2014a 2014a 2014a

Compilers Intel compilers icc, icpc, ifort Intel compilers icc, icpc, ifort GNU compilers gcc, g++, gfortran GNU compilers gcc, g++, gfortran MPI Library Intel MPI SGI MPT OpenMPI SGI MPT Math libraries Intel MKL Intel MKL OpenBLAS, Lapack FFTW ScaLAPACK Table 4. Toolchains on Cerebro OpenBLAS, Lapack FFTW ScaLAPACK TIP: When recompiling your codes for using them on Cerebro, check the results of the recompiled codes before starting production runs, and use the available Toolchains for compiling whenever possible. If your application uses MPI libraries, use the SGI MPI libraries (MPT) as they are tuned for the system (intel-mpt and foss-mpt toolchains) In order compile programs specifically for Cerebro you will need to start an interacticve job in the machine. Running Jobs Torque/Moab is used for scheduling jobs on Cerebro, so the same commands and scripts used on Thinking will work. Jobs can be submitted from the login node but it must be noted that in order to submit jobs to Cerebro it is needed to always load first the module: $ module load cerebro/2014a Without loading this module your jobs will be submitted to ThinKing and will never run on Cerebro. A node in Cerebro Due to the shared memory architecture of the system the definition of a node in Cerebro is not a physical node as it is in the classical thin node clusters like Thinking. A node unit in Cerebro is a physical CPU, which in this case contains 10 cores. Thus, it is important when submitting jobs to Cerebro to keep in mind that the ppn option accepts a maximum value of 10. The scheduling policy is SINGLE_JOB, which means that only one job per CPU is allowed even when the jobs are single core jobs of the same user. Queues The available queues on Cerebro are: q1h, q24h, q72h and q7d.

However, we strongly recommend that instead of specifying queue names on the batch scripts you use the PBS l option to define your needs. Some useful are l options are: Resources usage -l walltime=4:30:00 (job will last 4h 30 min) -l nodes=2:ppn=10 (job needs 2 nodes and 10 cores per node) -l mem=40gb (jobs request 40 GB of memory, sum for all processes) -l pmem=4gb (job request 4 GB of memory per core) Partitions As explained before, Cerebro is split into two partitions with different number of cores and memory configuration. Table 5 summarizes the differences: smp1 smp2 Max. available cores 460 140 Max. memory 11,77 TB 1,79 TB Memory per CPU 256 GB 128 GB Memory type DDR3-1600 32GB DDR3-1600 16GB Table 5 Cerebro partitions By default, jobs will be scheduled by the system in one of the two partitions according to the resources requested, and the availability. However, it is also possible to manually select one partition and have full control over where the jobs are executed. To specify partitions use the following PBS option: -l partition=partition_name Where partition_name can be either smp1 or smp2. An example of a job submitted using resource request could be: $ qsub l nodes=10:ppn=10 -l walltime=1:00:00 \ -l pmem=4gb -l partition=smp2 -A project-name \ myprogram.pbs

In this case the job requests 10 nodes with 10 cores each. It will need a walltime of 1 hour and 4 GB of memory per core and it is requested to be started on the smp2 partition. Note that as Cerebro nodes are equivalent to a physical processor/cpu, the judicious selection of Y and X for the option l node=x:ppn=y allows the user to easily submit jobs that require fewer cores per CPU than the total available number in a very efficient way. Suppose the application should run on 64 cores, and is started by the following line in the PBS script myprogram.pbs: mpirun np 64 myprogram This MPI job can be submitted using the following command: $ qsub l nodes=8:ppn=8,walltime=1:00:00,pmem=4gb A project-name \ myprogram.pbs This will allocate 8 nodes (8 physical CPUs) and on each node will allocate 8 cores. This distribution will be much more efficient that the one obtained with the command: $ qsub l nodes=7:ppn=10,walltime=1:00:00,pmem=4gb A project-name \ myprogram.pbs Where the first 6 nodes will be executing 10 processes while the last one only 4. TIP: Remember to load the cerebro/2014a module to submit jobs to Cerebro. Revise your batch scripts to remove any references to queue names and modify them to be based on resources instead on queue names. Revise your scripts to ensure the ppn option does not have a value greater than 10. Cpusets Cpusets are an operating system construct for restricting processes to use only certain cores and memory. In a shared memory system such as Cerebro, cpusets help to guarantee that every job gets the requested resources and also to avoid that different jobs running simultaneously in the system interfere with one another. In addition, when used together with process placement tools, they help to achieve better data locality, and hence improved performance. For each job, the batch system will dynamically create a cpuset containing only the number of cpus (cores) and memory requested for the job. All processes started inside the batch job script will be restricted to be executed inside the cpuset and only the allocated cpus and memory will be used.

Process placement In order to work efficiently on Cerebro, it is important to ensure data locality as much as possible. There are several ways to achieve that, but we recommend to use the SGI dplace tool to bind a related set of processes to specific cpus or nodes to prevent process migrations. In most cases, this will improve performance, since a higher percentage of memory accesses will take place to the local node. As mention before, the batch system will create a cpuset for every job, and dplace can then be used for pinning OpenMP, MPI and hybrid codes to the cpus assigned to that cpuset. The syntax of the command is: $ dplace --help Usage: dplace [-e] [-v 1 2] [-o <file>] [-c <cpu_list>] [-n <process_name>] [-r <opts>] [-s <skip_count>] [-x <no_place>] command [args...] where: dplace [-v 1 2] [-rn rs -rt] [-o <file>] -p <placement_file> command [args...] dplace [-q -qq -qqq] -e determines exact placement. As processes are created, they are bound to CPUs in the exact order specified in the CPU list. CPU numbers may appear multiple times in the list. -c cpu_numbers specifies a list of CPUs, optionally strided CPU ranges, or a striding pattern. For example: -c 1 -c 12-8 (equivalent to -c 12,11,10,9,8) -c 1,4-8,3 (equivalent to c 1,4,5,6,7,8,3) -c 2-8:3 (equivalent to -c 2,5,8) -c CS -c BT Note that CPU numbers are not physical CPU numbers but logical CPU numbers within the current cpuset. CPU numbers start with 0. So, if a job request 20 cores the allowed cpu_numbers for the dplace command will be in the range 0-19. A CPU value of "x" (or * ), in the argument list for the -c option, indicates that binding should not be done for that process. The value "x" should be used only if the -e option is also used.

For stride patterns, any subset of the characters (B)lade, (S)ocket, (C)ore, (T)hread may be used; their ordering specifies the nesting of the iteration. For example, SC means to iterate all the cores in a socket before moving to the next CPU socket, while CB means to pin to the first core of each blade, then the second core of every blade, and so on. For best results, use the -e option when using stride patterns. If the -c option is not specified, all CPUs of the current cpuset are available. The command itself (which is "executed" by dplace) is the first process to be placed by the -c cpu_numbers. Without the -e option, the order of numbers for the -c option is not important. -x skip_mask Provides the ability to skip the placement of processes. The skip mask argument is a bitmask. If bit N of skip_mask is set, then the N+1th process that is forked is not placed. For example, setting the mask to 6 prevents the second and third processes from being placed. The first process (the process named by the command) will be assigned to the first CPU. The second and third processes are not placed. The fourth process is assigned to the second CPU, and so on. This option is useful for certain classes of threaded applications that spawn a few helper processes that typically do not use much CPU time. -s skip_count Skips the first skip_count processes before starting to place processes onto CPUs. This option is useful if the first skip_count processes are "shepherd" processes used only for launching the application. If skip_count is not specified, a default value of 0 is used. -o log_file Writes a trace file to log_file that describes the placement actions that were made for each fork, exec, etc. Each line contains a time-stamp, process id:thread number, CPU that task was executing on, taskname and placement action. It can be useful to check the correct placement scheme is used. Some examples on how to use it in different scenarios are shown below: MPI using SGI MPT When using SGI MPT the best way to start MPI jobs is using the command mpiexec_mpt. If the program is started using all the cores allocated by the batch system then mpiexec_mpt will pin the processors to the cores sequentially starting from core 0 of the cpuset. As an example, to start an MPI job that uses 40 cores the following batch script will be needed: #!/bin/bash l #PBS l nodes=4:ppn=10,pmem=6gb #PBS l walltime=2:00:00 #PBS A project-name module load intel-mpt/2014a

cd $PBS_O_WORKDIR mpiexec_mpt./myprogram > myoutput If for performance reasons it is necessary not to use all the cores per CPU, that must be controlled using the l nodes=x:ppn=y option. So, if a job must be started using only 1 process per CPU and with a total of 8 processes, it will only be needed to modify the PBS requirements as follows: #!/bin/bash l #PBS l nodes=8:ppn=1,pmem=6gb #PBS l walltime=2:00:00 #PBS A project-name module load intel-mpt/2014a cd $PBS_O_WORKDIR mpiexec_mpt./myprogram > myoutput The batch system will allocate 8 nodes, will then create a cpuset containing 8 cores, one on each CPU and the 8 memory boards. The mpiexec_mpt command will then pin the processes to the allocated CPUs. MPI jobs using Intel MPI A job that uses all allocated cores and request 40 cores should be submitted: #!/bin/bash l #PBS l nodes=4:ppn=10,pmem=6gb #PBS l walltime=2:00:00 #PBS A project-name module load intel/2014a cd $PBS_O_WORKDIR NP=`cat $PBS_NODEFILE wc l` mpirun np $NP dplace c 0-$((NP-1))./myprogram > myoutput Again, if for performance reasons it is necessary to not use all the cores per CPU, that must be controlled using the l nodes=x:ppn=y option. MPI jobs using OpenMPI A job that uses all allocated cores and request 40 cores should be submitted: #!/bin/bash l #PBS l nodes=4:ppn=10,pmem=6gb #PBS l walltime=2:00:00 #PBS A project-name module load foss/2014a

cd $PBS_O_WORKDIR NP=`cat $PBS_NODEFILE wc l` mpirun dplace c 0-$((NP-1))./myprogram > myoutput TIP: Try to use the SGI MPT as MPI libraries on Cerebro whenever possible. OpenMP program compiled using Intel compilers To execute an OpenMP program compiled with Intel compilers using 10 cores, execute: #!/bin/bash l #PBS l nodes=1:ppn=10,pmem=6gb #PBS l walltime=2:00:00 #PBS A project-name module load intel/2014a cd $PBS_O_WORKDIR export KMP_AFFINITY=disabled export OMP_NUM_THREADS=`cat $PBS_NODEFILE wc l` dplace x 2 c 0-$((NP-1))./myprogram > myoutput With Intel compiler versions 10.1.015 and later it is necessary to set KMP_AFFINITY to disabled to avoid interference between dplace and Intel's thread affinity interface. The x 2 option provides a skip map of 010 (binary 2) to specify that the 2nd thread should not be bound. This is because under the new kernels, the master thread (first thread) will fork off one monitor thread (2nd thread) which does not need to be pinned. OpenMP program compiled using GNU compilers To execute an OpenMP program compiled with Gnu compilers using 10 cores, execute: #!/bin/bash l #PBS l nodes=1:ppn=10 #PBS l walltime=2:00:00 #PBS A project-name module load foss/2014a cd $PBS_O_WORKDIR export OMP_NUM_THREADS=`cat $PBS_NODEFILE wc l` dplace c 0-$((NP-1))./myprogram > myoutput