Genius - introduction - PDF Free Download

Genius - introduction HPC team ICTS, Leuven 5th June 2018

VSC HPC environment GENIUS 2

IB QDR IB QDR IB EDR IB EDR Ethernet IB IB QDR IB FDR Numalink6 /IB FDR IB IB EDR ThinKing Cerebro Accelerators Genius (2018) 176+32 nodes 4160 cores 2x Intel Ivy Bridge 10 cores 64 GB RAM 128 GB RAM 48+96 nodes 3456 cores 2x Intel Haswell 12 cores 64 GB RAM 128 GB RAM 1 nodes 480 cores 48x Intel Ivy Bridge 10 cores 12 TB RAM 20 TB scratch 1 nodes 160 cores 16x Intel Ivy Bridge 10 cores 2 TB RAM 8 nodes 2xNVIDIA Tesla K20X 2688 GPGPU cores 6 GB RAM 5 nodes 2xNVIDIA Tesla K40 2880 GPGPU cores 12 GB RAM 8 nodes Intel Xeon Phi 5110P 120 CoCPU cores 8 GB RAM 86+10 nodes 3456 cores 2x Intel Skylake 18 cores 192 GB RAM 768 GB RAM 20 nodes 720 cores 2x Intel Skylake 18 cores 4 x NVIDIA P100 Belnet KU Leuven 8 nodes 160 cores 2x Intel Ivy Bridge 10 cores 64 GB RAM 2 nodes Haswell 2xNVIDIA Quadro K5200 8 GB RAM 64 GB RAM 2 nodes 72 cores 2x Intel Skylake 18 cores 384 GB RAM 2 nodes 72 cores 2x Intel Skylake 18 cores 384 GB RAM 1 x NVIDIA P600 NAS 70 TB HOME DATA GPFS Scratch DDN 1,2 PB GPFS Archive DDN 600 TB Login nodes Visualisation 3

Genius Overview GPU nodes distributed over 3 racks r22g35..41 r23g34..39 r24g35..41 24 nodes per chassis/enclosure Compute nodes Large Memory nodes r22i13n01..24 r22i27n01..24 r23i13n01..24 r23i27n01..24 r22 r23 r24 4 ICTS

Genius overview Type of node CPU type Interconnect # cores installed mem local discs # nodes skylake Xeon 6140 IB-EDR 36 192 GB 800 GB 86 skylake large mem skylake GPU Xeon 6140 IB-EDR 36 768 GB 800 GB 10 Xeon 6140 4xP100 SXM2 IB-EDR 36 192 GB 800 GB 20 5

System comparison Tier 2 ThinKing Cluster Genius (2018) Total nodes 176 / 32 48 / 96 86 / 10 Processor type Ivybridge Haswell Sky Lake Base Clock Speed 2.8 GHz 2.5 GHz 2.3 GHz Cores per node 20 24 36 Total cores 4,160 3,456 3,456 Memory per node (GB) 64 / 128 64 / 128 192 / 768 Memory per core (GB) 3.2 / 6.4 2.7 / 5.3 5.3 / 21.3 Peak performance (Flops/cycle) 4 DP FLOPs/cycle: 4-wide AVX addition OR 4-wide AVX multiplication 8 DP FLOPs/cycle: 4-wide FMA (fused multiply-add) instructions AVX2 16 DP FLOPs/cycle: 8-wide FMA (fused multiply-add) instructions AVX-512 Network Infiniband QDR 2:1 Infiniband FDR Infiniband EDR 6 Cache ( KB/ KB/L3 MB) 10x(32i+32d) / 10x256 / 25 MB 12x(32i+32d) / 12x256 / 30MB 18x(32i+32d) / 18x1024 / 25 MB

Skylake compute node 7 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 QPI QPI core0 Socket 0 Numa node 0 Socket 1 Numa node 1 IB I/O L3 core0 core1 core2 core3 core4 core5 core6 core7 core8 core9 core10 core11 core12 core13 core14 core15 core16 core17 L3 core18 core19 core20 core21 core22 core23 core24 core25 core26 core27 core28 core29 core30 core31 core32 core33 core34 core35

GPU comparison K20Xm K40c P100 (2018) Total number of nodes 8 5 20 GPUs per node 2 2 4 Total CUDA cores 2688 2880 3584 Memory 6GB 12GB 16 GB Base Clock Speed cores 732MHz 745MHz 1328 MHz Max clock speed cores 784MHz 874MHz 1480 MHz Memory Bandwidth 249,6GB/s 288GB/s 732 GB/s Peak double precision floating point performance 1,31Tflops 1,43Tflops 5,3 Tflops Peak single precision floating point performance 3,95Tflops 4,29Tflops 10,6 Tflops Features SMX, Dynamic Parallelism, Hyper-Q, GPUBoost SMX, Dynamic Parallelism, Hyper-Q, GPUboost NVLink, GPUBoost 8

Production phase Viewpoint MOAB/Torque New MOAB/Torque MAM Pilot phase MAM ThinKing Cerebro Genius 176+32 nodes 4160 cores 2x Intel Ivy Bridge 10 cores 48+96 nodes 3456 cores 2x Intel Haswell 12 cores c 1 nodes 480 cores 48x Intel Ivy Bridge 10 cores 1 nodes 160 cores 16x Intel Ivy Bridge 10 cores 86 +10 nodes 3,456 cores 2x Intel Skylake 18 cores 20 nodes 720 cores 2x Intel Skylake 18 cores 64 GB RAM 128 GB RAM 64 GB RAM 128 GB RAM 12 TB RAM 20 TB scratch 2 TB RAM 192GB RAM 768GB RAM 4 x NVIDIA P100 GPFS DDN 14K 9 Login nodes

Login nodes ssh vsc3xxxx@login-node-name login nodes - different purpose, different limits 2 login nodes (different from ThinKing login nodes) login1-tier2.hpc.kuleuven.be login2-tier2.hpc.kuleuven.be nx1 nx2 GUI login to Thinking, terminal to Genius, Use to open Viewpoint Basic command line login 2 login nodes with a visualization capabilities (nvidia Quadro P6000 GPU) login3-tier2.hpc.kuleuven.be Basic login4-tier2.hpc.kuleuven.be command line login + GPU rendering 2 nx nodes, access through server 10

Storage areas (same as on ThinKing) Name Variable Type Access Backup Quota /user/leuven/30x/vsc30xxx $VSC_HOME NFS Global YES 3 GB /data/leuven/30x/vsc30xxx $VSC_DATA NFS Global YES 75 GB /scratch/leuven/30x/vsc30xxx $VSC_SCRATCH $VSC_SCRATCH_SITE GPFS Global NO 100 GB /node_scratch (ThinKing) $VSC_SCRATCH_NODE ext4 Local NO 100-250 GB /node_scratch (Cerebro) $VSC_SCRATCH_NODE xfs Local NO 10 TB /node_scratch (Genius) $VSC_SCRATCH_NODE ext4 Local NO 100GB /staging/leuven/stg_xxxxx n/a GPFS Global NO Minimum 1TB /archive/leuven/arc_xxxxx n/a Object Global NO (Mirror) /mnt/beeond/ (Genius) $VSC_SCRATCH_JOB BeeGFS Nodes in the job NO Minimum 1TB 300GB To check available space: $ quota s ($VSC_HOME and $VSC_DATA) $ mmlsquota vol_ddn2:leuven_scratch --block-size auto ($VSC_SCRATCH) 11

Available GPUs at KU Leuven/UHasselt Tesla K20 Tesla K40 Pascal P100 SP cores 14x192=2,688 15x192=2880 56x64=3584 DP cores 14x64=896 15x64=960 56x32=1792 Clock freq. (MHz) 732 745 1481 DRAM (GB) 5.7 11.2 16 DRAM freq. (GHz) 2.6 (384-bit) 3.0 (384-bit) Compute capability 3.5 3.5 6.0 cache (MB) 1.5 1.5 4.0 Constant mem. (KB) 64 64 64 Shared mem. per block (KB) 48 48 48 Registers per block (x1024) 64 64 64 0.71 (4096-bit) 12

PCI-e PCI-e Peer-to-Peer Bandwidth High Med Low GPU 0 GPU 1 GPU 3 GPU 2 Dev. ID 0 1 2 3 0 509 10 19 18 1 10 508 18 18 2 19 18 508 10 3 18 18 10 507 Bi-directional P2P: PCIe Dev. ID 0 1 2 3 0 508 37 37 61 1 37 507 61 37 2 36 61 508 37 3 62 37 37 506 Bi-directional P2P: NVLink (P100@Leuven) 13

How to Start-to-GPU? Approach 1: Users Does your software already use GPUs? Check Nvidia Application Catalog: (https://www.nvidia.com/en-us/data-center/gpu-acceleratedapplications/catalog/) Machine Learning: Tensorflow, Keras, PyTorch, CAFFE2, Chemistry: Abinit, BigDFT, CP2K, Gaussian, QuantumEspresso, BEAGLE-lib, VASP, Phys. & Eng.: OpenFOAM, Fluent, COSMO, Biophysics: NAMD, CHARM, GROMACS, Tools: Alinea-Forge, Cmake, MAGMA, 14

How to Start-to-GPU? Approach 2: Porting Incrementally porting your code to use GPUs! Check Nvidia Libraries: https://developer.nvidia.com/gpu-accelerated-libraries cublas cufft cusparse curand THRUST Replace function calls in your application with one from the CUDA libraries. E.g. SGEMM( ) -> cublassgemm( ) (Image taken from Nvidia CUDA 9.2 Libraries) 15

Low-level APIs High-level APIs How to Start-to-GPU? Approach 3: Developer Tailor your software development to the GPU hardware! Python: Numba, Numbapro, pycuda, Quasar Matlab: Overloaded functions and gpuarrays R: rcuda, rpud Language Directives Programming Model OpenACC CUDA CUF Kernels CUDA (C/C++/Fortran) OpenCL 16

Torque/Moab Jobs have to be submitted from new (Genius) login nodes Some commands: $ qsub : Submit a job, returns a job ID $ qsub test.sh 50001435.tier2-p-moab-2.icts.hpc.kuleuven.be $ qdel <job-id> : Delete a queued or running job $ qdel 50001435 $ qsub -A lpt2_pilot_2018 : credits during pilot phase Later: project with A (even default_project for introductory credits) will be required CPU nodes: SINGLE user policy (only 1 user per node), Single core jobs can end up on the same node, but are accounted on a job basis. GPU nodes: SHARED user policy MULTIPLE users per node is allowed. 17

Moab Allocation Manager 0.000278 walltime nodes ftype # credits # 1/3600 Project credits valid for all Tier-2 clusters: ThinKing, Cerebro, GPU, Genius after the pilot phase f type = 4.76 6.68 2.86 3.45 10 20 ThinKing IvyBridge ThinKing Haswell ThinKing GPU Cerebro Genius CPU Genius GPU (full node 4xP100) Example: -l nodes=1:ppn=1,walltime=1:00:00 #credits = (0.000278 3600 1) 10 = 10 -l nodes=1:ppn=36,walltime=1:00:00 #credits = (0.000278 3600 1) 10 = 10 18

BeeOND BeeOND("BeeGFS On Demand") was developed to enable easy creation of one or multiple BeeGFS instances on the fly. BeeOND is typically used to aggregate the performance and capacity of internal SSDs or hard disks in compute nodes for the duration of a compute job. This provides additional performance and a very elegant way of burst buffering. Temporary fast storage (during the job execution) Dedicated to the user (not shared), SSDs fast for I/O operations schedule a job with a BeeOND FS $ qsub -lnodes=2:ppn=36:beeond 19

Single island The compute nodes are bundled into several domains (islands). Within one island, the network topology is a 'fat tree' topology for highly efficient communication. The connection between the islands is much weaker. Choice to request running a job in one island (max number of nodes=24) $ qsub l nodes=24:ppn=36:singleisland 20

Queues The current available queues on Genius are: q1h, q24h, q72h and q7d. There will be no 21 day queue during the pilot phase. As before, we strongly recommend that instead of specifying queue names on the batch scripts you use the PBS l option to define your needs. Some useful are l options for resources usage: -l walltime=4:30:00 (job will last 4h 30 min) -l nodes=2:ppn=36 (job needs 2 nodes and 36 cores per node) -l pmem=5gb (job request 5 GB of memory per core, which is the default for the thin node) 21

Extra submission options GPUs: $ qsub l nodes=1:ppn=1:gpus=1 l partition=gpu $ qsub l nodes=1:ppn=36:gpus=4 l partition=gpu Large memory nodes: $ qsub -l partition=bigmem Debugging nodes: $ qsub -l qos=debugging l partition=gpu qsub l nodes=1:ppn=36 -l walltime=30:00 \ -l qos=debugging -l partition=gpu -A lpt2_pilot_2018 \ myprogram.pbs 22

Credits (after pilot phase) Credits card concept: Preauthorization: holding the balance as unavailable until the merchant clears the transaction Balance to be held as unavailable: based on requested resourced (walltime, nodes) Actual charge based on what was really used: used walltime (you pay only what you use, e.g. when job crashes) See output file: How to check available credits? (no module for accounting) $ mam-balance Resource List: neednodes=2:ppn=6,nodes=2:ppn=6,pmem=1gb,walltime=01:00:00 Resources Used: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:02 23

Viewpoint portal Ease-of-Use Job Submission and Management Viewpoint is a rich, easy-to-use portal for end-users and administrators, designed to increase productivity through its visual web-based interface, powerful job management features, and other workload functions. Allows to speed the submission process and reduce errors by automating best practices. Expands an HPC user base to include even non-it skilled users. Helps to gain admin insight into workload and resource utilization for better management and troubleshooting. 24

Viewpoint portal Who should use it? Researchers that like GUI or are not very familiar with Linux command Line Researchers that work in NX environment Group administrators that can create templates/workflows for the whole group Group members that share the data Researchers for whom template exists (but defining own templates is also possible) 25

Viewpoint portal Interested? Contact us for the initial login procedure Setup : Access from ThinKing NX (Firefox) Later to be moved outside HPC http://tier2-p-viewpoint-1.icts.hpc.kuleuven.be:8081 26

Viewpoint portal 27

Viewpoint portal 28

Viewpoint portal 29

Viewpoint portal 30

Viewpoint portal - Home 31

Viewpoint portal - Workload 32

Viewpoint portal - Templates Contact us for help 33

Viewpoint portal File Manager 34

Viewpoint portal - Home 35

Viewpoint portal Create Job 36

Viewpoint portal R with Worker 37

Viewpoint portal Free form 38

Software Operating system CentOS 7.4.1708, 64 bit Kernel 3.10.0-693.17.1.el7.x86_64 Applications For development Compilers & basic libraries tool chains Libraries Tools: debuggers, profilers Use modules Different from ThinKing modules! 39

Available tool chains intel tool chain Name intel foss version 2018a 2018a Compilers Intel compilers (v 2018.1.163) icc, icpc, ifort foss tool chain MPI Library Intel MPI OpenMPI GNU compilers (v 6.4.0-2.28) gcc, g++, gfortran Math libraries Intel MKL OpenBLAS, LAPACK FFTW ScaLAPACK 40

Software Mostly used software is installed. Own builds need to be rebuild for Genius. If missing please contact us! 41

Software By default 2018a software is listed ($ module available) The modules software manager is now Lmod. Lmod is a Lua based module system, but it is fully compatible with the TCL modulefiles we ve used in the past. All the module commands that you are used to will work. But Lmod is somewhat faster and adds a few additional features on top of the old implementation. To (re)compile ask for interactive job Default module at the time of loading. Subjest to changes. 42

Modules $ module available or module av R Lists all installed software packages $ module av & grep -i python To show only the modules that have the string 'python' in their name, regardless of the case $ module load foss Adds the matlab command in your PATH $ $ module list Lists all loaded modules in current session $ module unload R/3.4.4-intel-2018a-X11-20180131 Removes all only the selected module, other loaded modules dependencies are still loaded $ module purge Removes all loaded modules from your environment 43

Modules $ module swap foss intel = module unload foss; module load intel $ module try-load packagexyz try to load a module with no error message if it does not exist $ module keyword word1 word2... Keyword searching tool, searches any help message or whatis description for the word(s) given on the command line $ module help foss Prints help message from modulefile $ module spider foss Describes the module 44

Modules ml convenient tool $ ml = module list $ ml foss =module load foss $ ml -foss =module unload foss (not purge!) $ ml show foss Info about the module Possible to create user collections: module save <collection-name> module restore <collection-name> module describe <collection-name> module savelist module disable <collection-name> More info: http://lmod.readthedocs.io/en/latest/010_user.html 45

Questions Now Helpdesk: hpcinfo@icts.kuleuven.be or https://admin.kuleuven.be/icts/hpcinfo_form/hpc-info-formulier VSC web site: http://www.vscentrum.be/ VSC documentation Genius Quick Start Guide: https://www.vscentrum.be/assets/1355 Slides from the session available under session webpage VSC agenda: training sessions, events Systems status page: http://status.kuleuven.be/hpc or https://www.vscentrum.be/en/user-portal/system-status 46