Monitoring and Trouble Shooting on BioHPC

Size: px

Start display at page:

Download "Monitoring and Trouble Shooting on BioHPC"

Alberta Lloyd
5 years ago
Views:

1 Monitoring and Trouble Shooting on BioHPC [web] [ ] portal.biohpc.swmed.edu 1 Updated for

2 Why Monitoring & Troubleshooting data code Monitoring jobs running on the cluster Understand how current HPC resource is used Optimize usage to maximum capacities 2

Why Monitoring & Troubleshooting Try to understand if the job is: CPU intensive Memory intensive I/O intensive A combination of above Try to figure out: Where are the

3 Why Monitoring & Troubleshooting Try to understand if the job is: CPU intensive Memory intensive I/O intensive A combination of above Try to figure out: Where are the bottlenecks How to boost the computational efficiency -Completing more tasks during available time window -Run an analysis with larger data set in the same amount of time 3

4 What to Monitor First, start by profiling the application on an interactive node. CPU Usage - lscpu - pstree - top Memory Usage - free - vmstat I/O Usage - iostat Network/Bandwidth - ifstat 4

5 CPU Usage Achieve speedup on HPC? Increased frequencies Increased scalability lscpu: display information about CPU architecture 5

6 CPU Usage: command line tools Job running on the compute node: astrocyte_cli test <workflow> align-bowtie-se.sh bowtie/ samples pstree: display a tree of processes * You may also use top and pstree command to verify if your job is running across multiple nodes 6

7 CPU Usage: command line tools top: display Linux tasks, provides a dynamic real-time view of a running system. 7

8 Memory Usage: The Memory Hierarchy 8

9 Memory Usage: command line tools free: displays the total amount of free and used physical and swap memory in the system, as well as the buffers used by the kernel Mem (RAM): can be used by currently-running process Swap (Virtual Memory): is used when the amount of physical memory (RAM) is full. Constant swapping should be avoided buffers: file system metadata cached: pages with actual contents of files for future faster access, not currently used memory 9

10 Memory Usage: command line tools vmstat: (Virtual Memory Statistics) outputs instantaneous reports about your system's processes, memory, paging, block I/O, interrupts and CPU activity. 10

11 Disk Usage & I/O Parallel Filesystems on BioHPC Advantages: scalability the capability to distribute large files across multiple nodes Issues Inadequate I/O capability can severely degrade overall cluster performance 11

12 Disk Usage & I/O: command line tools iostat: generates reports that can be used to change system configuration to better balance the input/output load between physical disks. %iowait is the percentage of time your processors are waiting on the disk 12

13 Network/Bandwidth Usage Minimizing communication 13

14 Network/Bandwidth Usage: command line tools ifstat: reports the network bandwidth in a batch style mode 14

15 All-in-One tools Too many tools? All-in-on tools - Dstat - Linux Collectl Profile - HPCTools 15

16 Dstat: Versatile resource statistics tool DAG:a versatile replacement for vmstat, iostat, netstat and ifstat. 16

17 Linux Collectl Profiler Information from monitoring an application can aid the user to run it optimally Collectl is a tool which monitors a broad set of subsystems of a server while user application is running on it Helpful to know your application s usage of cpu, memory, disk, etc to determine if system resources are being stressed or over utilized Many subsystems in summary or detail available to monitor, but initial interest to a user running an application. CPU Memory Disk Lustre InfiniBand NFS usage TCP summary 17

18 collectl --showsubsys Shows ALL subsystems that data can be collected for and plotted in Summary plots: b - buddy info (memory fragmentation) c - cpu d - disk f - nfs i - inodes j - interrupts by CPU m - memory n - network s - sockets t - tcp x - interconnect (currently supported: OFED/Infiniband) y - slabs 18

19 collectl --showsubsys Shows all subsystems that data collected can be shown in Detailed plots: C - individual CPUs, including interrupts if -sj or -sj D - individual Disks E - environmental (fan, power, temp) [requires ipmitool] F - nfs data J - interrupts by CPU by interrupt number M - memory numa/node N - individual Networks T - tcp details (lots of data!) X - interconnect ports/rails (Infiniband/Quadrics) Y - slabs/slubs Z - processes L - lustre 19

20 Why Monitoring - Linux Collectl Profiler Getting LUSTRE metrics In your script that you sbatch to run a job, execute collectl running in the background: #!/bin/bash module add collectl/4.1.2 cd /project/biohpcadmin/s mkdir test collectl -sclmx -P -f /project/biohpcadmin/s175049/test &>/dev/null & dd if=/dev/zero of=stripe4 bs=4m count=4096 kill %1 Data is collected for subsystems that are listed in s option Collectl data files are written to user directory test above 20

21 Why Monitoring - Linux Colplot Visualizer View data with Gnuplot either while job is running or after it is finished: % colplot dir /project/biohpcadmin/s175049/test plot cpu,mem,inter,cltdet % colplot showplot shows ALL the different args to plot to display the plots you want May need to refine timeline by specifying specific timeframe to view: % colplot dir /project/biohpcadmin/s175049/test plot \ cpu,mem,inter,cltdet -time 08:20-08:30 21

22 Why Monitoring - Linux Collectl & Colplot Documentation with examples and tutorials: collectl.sourceforge.net/documentation.html colplot.sourceforge.net/documentation.html Collectl and colplot man pages: linux.die.net/man/1/collectl collectl-utils.sourceforge.net/coplot.html 22

23 23 What s next

24 Optimization: Use appropriate compiler options Intel Math Kernel Library: a library of optimized math routines for science, engineering and financial applications. Basic Linear Algebra Subroutines LAPCK Fast Fourier Transform (FFT) Vector Math Library Build in OpenMP multithreading (set OMP_NUM_THREADS>1) Modules with MKL on BioHPC R/ Intel R/3.3.2-gccmkl julia/0.4.6 JAGS/4.2.0 Compile your own MKL using the mkl complier option (detailed options refer to: 24

25 Optimization : Load big data into memory to reduce I/O 8GB RAM 256GB RAM Significantly reduced I/O 25

26 Optimization : Single-Instruction, Multiple-Data Vector Processing Unit Scalar Loop for ( i = 0; i < n; i++) A[i] = A[i] + B[i]; SIMD Loop for ( i = 0; i < n; I += 8) A[I : (i+8)] = A[I : (i+8)] + B[i : (i+8)]; * Each SIMD addition operator acts on 8 numbers at a time Intel AVX data types allow packing of up to 32 elements in a register if bytes are used. The number of elements depends upon the element type: 8 single-precision floating point types or 4 doubleprecision floating point types. Another example is GPU 26

27 Optimization: GNU Parallel If all jobs are independent to each other... A shell tool for executing jobs in parallel using one or more computers. Make best use of CPU resource with balanced job load Predefined the job pool to match the total number of Cores Spawns a new process when one finishes module load parallel keeping the CPUs active and thus saving time 27

28 Optimization: Multithreading If communication between jobs are needed... Shared memory Advantages: user friendly programming fast data sharing between tasks Disadvantage: programmer s responsibility for synchronization construction that ensure correct access of shared memory libs pthread openmp tools phenix bowtie2 28

Optimization: Shared Memory concurrent read: Maybe concurrent write: No Modified from Figure 1 in https://developer.

29 Optimization: Shared Memory concurrent read: Maybe concurrent write: No Modified from Figure 1 in Possible bottleneck: 29

30 Optimization: Message Passing Interface If communication between jobs are needed... e.g.: MPI job across multiple nodes slave node 1 master node slave node 2 slave node 3 30

31 Optimization: Message Passing Interface Possible bottleneck: communication cost unbalanced load Decompose dataset in a smart way to: Minimize the overlaps (proportion to What is the maximum speed-up you could achieve? communication cost) Balance the data between nodes Example: METIS Graph partition tool verview 31

32 Optimization: Multithreading & Message Passing MPI + pthread If you try to run relion job across 2 nodes on 256GB partition, 48*2 = 96 cores No. of MPI jobs No. of threads No. MPI * No. threads Q: Which one has the shortest computation time? 32

Demo: Project Gutenberg big data reader Data: 18792 books Size: 10 GB Type: plain/text Count the number of

33 Demo: Project Gutenberg big data reader Data: books Size: 10 GB Type: plain/text Count the number of occurrences of the words: dog cat boy girl Goal: Complete as fast as possible by reducing bottlenecks and inefficiencies 33

34 Demo: Project Gutenberg big data reader: Solution I (single-processor, many files) file_00.txt file_01.txt file_02.txt LUSTRE LUSTRE LUSTRE Read text into node RAM Read text into node RAM Read text into node RAM CPU_00 count keywords CPU_00 count keywords CPU_00 count keywords 34

35 Demo: Project Gutenberg big data reader: Solution II (multi-processor, partition file set) file_00.txt file_01.txt file_02.txt file_03.txt file_04.txt file_05.txt LUS TRE LUS TRE LUS TRE LUS TRE LUS TRE LUS TRE Read line of text into node RAM Read line of text into node RAM Read line of text into node RAM Read line of text into node RAM Read line of text into node RAM Read line of text into node RAM CPU_00 count keywords CPU_00 count keywords CPU_00 count keywords CPU_01 count keywords CPU_01 count keywords CPU_01 count keywords file_06.txt file_07.txt file_08txt file_09.txt file_10.txt file_11.txt LUS TRE LUS TRE LUS TRE LUS TRE LUS TRE LUS TRE Read line of text into node RAM Read line of text into node RAM Read line of text into node RAM Read line of text into node RAM Read line of text into node RAM Read line of text into node RAM CPU_02 count keywords CPU_02 count keywords CPU_02 count keywords CPU_03 count keywords CPU_03 count keywords CPU_03 count keywords 35

36 Demo: Project Gutenberg big data reader: Solution III (single-processor, one large file, chunked) large_txt.bin (all text from all books in one large file) LUSTRE Distribute file chunks to RAM Distribute memory to CPU_00 in limited chunks chunk_00 chunk_01 chunk_02 CPU_00 count keywords CPU_00 count keywords CPU_00 count keywords

37 Demo: Project Gutenberg big data reader: Solution IV (multiple-processors, one large file, chunked) large_txt.bin (all text from all books in one large file) LUSTRE Load all text into node memory Partition memory to all procesors in chunks multiple chunks multiple chunks multiple chunks multiple chunks CPU_00 count keywords CPU_01 count keywords CPU_02 count keywords CPU_03 count keywords

38 Demo: Project Gutenberg big data reader: Results time python inefficient_reader.py: Solution I time python multithreaded_inefficient_reader.py Solution II time python efficient_reader.py: Solution III time python multithreaded_efficient_reader.py Solution IV 7.2 min 2.0 min 3.5 min 0.7 min 38

The BioHPC Nucleus Cluster & Future Developments

1 The BioHPC Nucleus Cluster & Future Developments Overview Today we ll talk about the BioHPC Nucleus HPC cluster with some technical details for those interested! How is it designed? What hardware does