Introduction to High Performance Computing

Size: px

Start display at page:

Download "Introduction to High Performance Computing"

Andrew McDaniel
6 years ago
Views:

1 Introduction to High Performance Computing By Pier-Luc St-Onge September 11, 2014

2 Objectives Familiarize new users with the concepts of High Performance Computing (HPC). Outline how to use the HPC infrastructure. Learning to utilize the support. Analysts are your best assets. Please contact them! Support Site names: {briaree, colosse, cottos, guillimin, hades, mammouth, psi} 2

3 Outline Distinction between HPC and desktop computing Understanding your applications Understanding the infrastructure Understanding the batch queue systems Registration, access to resources and usage policies Using the infrastructure Best practices A few get started exercises 3

4 Distinction between HPC and desktop computing 4

5 Defnitions Building Blocks Compute cluster is composed of multiple servers also known as Compute Nodes Compute Nodes Network Compute Nodes Compute racks 5

6 Defnitions Building Blocks The login node permits users to interact with the cluster where they can compile, test, transfer fles, etc. This node is used by multiple users at the same time and is a shared resource. Compute Node Compute Node Compute Node Compute Node Login Node Compute Node 6

7 Defnitions Building Blocks A compute node server is similar to an offce computer. We shall see what sets them apart and how to choose between them. Memory Memory Memory Processor Processor Memory Memory Memory I/O Controller Network Disk 7

8 Defnitions Building Blocks A processor is composed of multiple independent compute cores. It also contains a memory cache that is smaller but faster than the main system memory. Processor Compute Core Compute Core Compute Core Compute Core Memory Cache Memory Memory Memory 8

9 Defnitions Building Blocks Each core is composed with processing units and registers. Registers are small but very fast memory spaces. Their number and characteristics vary between systems. Processor Processing Units Registers Memory Cache Memory Memory Memory 9

10 Defnitions Units The base unit is the bit noted «b». A bit has two possible values : 0 or 1. Computers never manipulate the value of a single bit. Here are several examples of used units: Byte (octet) : noted as «B», is composed of 8 bits Character : generally composed of 1 Byte ex : in ascii is «a» Integer : generally composed of 32 bits (4 Bytes) ex : represents 77 10

11 Defnitions Units The binary base is a power of 2 (1+1) and not 10 (9+1). The units frequently used are: 8 b = 1 Byte (Register unit) 1024 B = 1 kb (Cache L1/2 unit) 1024 kb = 1 MB (Cache L3 unit) 1024 MB = 1 GB (Memory unit) 1024 GB = 1 TB (Hard drive unit) 1024 TB = 1 PB (Cluster storage unit) Caution! According to international standard, it should rather be noted with an i. ex : kb KiB 11

12 Defnitions Bandwidth Bandwidth is a measure of the quantity of information that can be transferred per unit of time. This measure is valid if the quantity of data being transferred is large. 00:12 00:24 00:36 00:48 00:00 Node 1 GB 1 1 GB 1 GBNetwork 1 GB 1024 MB / 48 sec = 21.3 MB/s Node 1 GB 2 12

13 Defnitions Latency Latency corresponds to the minimum communication time. It is measured as the time it takes to transfer a small quantity of data. 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 Node 1 B 1 1 B 1 BNetwork 1 B latency = 7 seconds Node 1 B 2 13

14 Characteristics Networking The servers used for HPC are characterized by performant networks. Here are some examples of networks and their characteristics: Type Latency ( s) Bandwidth (Gb/s) ethernet 100 Mb/s ethernet 1 Gb/s 30 1 ethernet 10 Gb/s infniband SDR ~ 2 10 infniband DDR ~ 2 20 Infniband QDR ~ 2 40 Numalink 4 ~

15 Characteristics Storage The storage and fle systems at the sites are both very different. HPC centres use storage arrays with parallel fle systems. Type Latency (ms) sync / async Bandwidth (MB/s) 1 fle sync / async Capacity (TB) SATA - theoretical 1 ~120 6 SSD - theoretical 0.1 ~500 1 SATA (ext3) 1 / / Mp2 (lustre) 75 / / Briaree (gpfs) 15 / / Colosse (lustre) 100 / / Guillimin (gpfs) 0.5 / / These measurements were made on systems in production. The performance varies greatly as a function of time.

16 Characteristics Size Colosse (Univ. Laval) Guillimin (McGill/ÉTS) Briarée (Univ. de Montréal) Mammouth (Univ. de Sherbrooke) 16

17 Characteristics Shared Resources A queuing system permits the sharing of resources and the application of usage policies. We describe queuing systems in more details in another section. Resources compute cluster Delay of 00:40 in the start of the job The job terminates at 01:30 00:00 00:20 00:10 00:30 00:40 00:50 01:00 01:10 01:20 01:30 01:40 01:50 02:10 02:00 02:20 02:30 02:40 02:50 03:00 offce computer The job terminates at 03:00 Time 17

18 Section Summary HPC resources give users access to high performance, shared computers and storage for small and large scale scientifc calculation needs. Also users beneft from highly qualifed support. 18

19 Understanding your application 19

20 Performance Compute Intensive The performances of compute cores and memory access is described in terms of cycles. For example, 3 GHz means it is able to perform cycles per second. The processors work with a stream of instructions. Each instruction requires a different number of cycles (depending upon the processor). Instruction real 32 bits Cycles Sandybridge + 4 * 6 / sqrt() sin()

21 Performance Compute Intensive Modern processors (cores) divide the work into steps, like done on an assembly line. This functionality is called a pipeline that permits the acceleration of instructions. For example, if we add a=1.0 with b=2.0, we can use the following steps: decode the instruction obtain the registers of a and b add a and b place the result in a register 21

22 Performance Compute Intensive Therefore, if we do c1 = a1+b1 and c2 = a2+b2 and c3 = a3+b3, the pipeline functions as follows : pipeline depth DI 1 DI i : decode instruction i DI 2 OR a1,b1 OR i : obtain register i time DI 3 OR a2,b2 a 1 +b 1 SR i : save register i OR a3,b3 a 2 +b 2 SR c1 a 3 +b 3 SR c2 SR c3 22

23 Performance Compute Intensive Another important functionality of modern processors is vectorization. It combines several data values and performs a single operation on them. r1=1.0 r2=1.0 r3=1.0 r4=1.0 x1={1.0,1.0,1.0,1.0} Ex: We want to add 1.0 to our values. conventional r5= r1+1.0 r6= r2+1.0 r7= r3+1.0 r8= r4+1.0 vectorized x2= x1+{1.0,1.0,1.0,1.0} 4 instructions! 1 instruction! 23

24 Performance Memory Access The organization of memory access strongly affects application performance. Processor Cycles Registers 3 cycles Cache 15 cycles RAM 146 cycles Size (Bytes) 24

25 Serial Computations There are suites of instructions that are done one after another. A= B= C= initialisation, loop over the index i (from 1 to 10) A i =1 B i =i calculate the sum, loop over the index j (from 1 to 10) C j =A j +B j time i=1 i=2. i=10 j=1 j=2. j=10 25

26 Parallel Computations It is a suite of instructions that are done at the same time. A= B= C= initialisation, loop over the index i (1 to 10) A i =1 B i =i calculate the sum, loop j (1 to 5) loop k (6 to 10) C j =A j +B j C k =A k +B k time i=1 i=2. i=10 j=1 j=2. j=5 k=6 k=7. k=10 26

27 Parallel Computations Why? The frequency of the processors has not increased in the last 10 years! Therefore if we want more compute power it is necessary to parallelize. The memory space available on a server can be insuffcient. Therefore it is necessary to use more compute nodes and distribute the data and work on these. 27

28 Parallel Computing - Implications Parallelizing an application is not easy. There are several possible diffculties. The algorithms that are performant in serial computations are not generally so in parallel The organization of the data and work is not simple The memory is not necessarily accessible from all the child processes The network now affects the performance 28

29 Parallelism and Memory When all processors have access to the same memory, it is said to be shared. Conversely, if the processor only sees a small portion of memory, it is said to be distributed. Shared Memory Distributed Memory Memory Nowadays, almost all systems have a shared memory component. 29

30 Parallelism and Communications In the case of a distributed memory application, there are communications to transfer data between the processing threads. The organization of these communications is important for the performance. Here is an example : Mail delivery Can bring 10 letters in 10 minutes - latency = 10 minutes - bandwidth = 0.02 letters/second Can bring 1 million letters in 60 minutes - latency = 60 minutes - bandwidth = 300 letters/second 30

31 Parallelism and Communications So if I have one letter to send: takes 1 trip, therefore 10 minutes takes 1 trip, therefore 60 minutes And if I have : takes 1000 trips, therefore almost 7 days takes 1 trip, therefore 60 minutes 31

32 Diffculties of Parallelism Certain algorithms cannot be parallelized or are not effcient. When that is the case, it is necessary to approach the problem using a different method. Ex : dependencies Ex : too little work Loop over i: a i = a i-1 + a i-2 Loop over i from 1 to 10: a i = a i

33 Diffculties of Parallelism Two execution threads can access the same memory at almost the same time. In this case, one can have a race condition. There are methods to synchronize these accesses but result in a degradation of performance. a=12 sequential section if a > 10 : a = 0 if a > 10 : a = 1 parallel section a=0 or 1? sequential section 33

34 Diffculties of Parallelism During synchronization of access it can occur that all the execution threads await an event that no one can create. This problem is named deadlock. a=0 and b=0 sequential section Infnite loop : if a = 1: b = b+1 end loop Infnite loop : if b = 1: a = a+1 end loop parallel section 34

35 Diffculties of Parallelism In parallel code, it is in general impossible to determine the order of execution for instructions. Also, if one repeats the same calculation multiple times, one can fnd differences in the numerical errors. An example in single precision : = =

36 Diffculties in Parallelism The distribution of work performed in parallel is important for performance and is sometimes diffcult to optimise. perfect distribution out of balance time 36

37 Performance Disk Access How a fle is written is very important for the software performance. HPC storage often performs better for large fles. Zoomed graph Bandwidth MB/sec raid + gpfs sata + ext3 sata + ext3 raid + gpfs Size (kb) Size (kb) 37

38 Performance Disk Access >Different sites organise disk differently. >Guillimin example: there is a global GPFS flesystem and localscratch space on the compute nodes. >Three partitions in the global fle system - /sb : home directories and old project spaces - /gs : new project spaces and user scratch spaces - /lb : special project spaces (large block allocations) >Localscratch on compute nodes : -Limited and shared -Used for running jobs 38

39 Disk on Calcul Québec sites Different sites have different disk usage policies >Good to familiarize yourself with each. Ref: 39

40 Compute Resources Ref: 40

41 Understanding the queuing systems 41

42 Queuing Systems Why? Maximize the usage of available resources; Avoid tasks that can affect each other; Moderate the usage of resources according to defned policies and allocations. 42

43 Queuing Systems - Parameters Job Submission System: User interface for job submission to the cluster. Ref: Each cluster possesses a group of queues with different properties (number of concurrent jobs, duration of jobs, maximum number of processors, etc.) 43

44 Queuing Systems - Priority The scheduler establishes the priority of jobs so that the target allocation of resources can be reached. Therefore, if the recent usage by a group is less than the target, then the job priority increases; otherwise, the priority decreases. Factors that determine the job priority : time waiting in the queue; recent group utilisation (including decay factors as a function of time); the resource allocation of the group. See the link below for more information on Moab/Torque 44

45 Queuing Systems - Parameters When submitting a job it is important to specify : the total memory and/or the memory per task/process, the number of cores, the duration, the desired queue. Permits short jobs to pass into «holes»!!! Each cluster possesses a group of queues with different properties (number of concurrent jobs, duration of jobs, maximum number of processors, etc.) To learn the details, see the documentation on our web sites or contact our analysts. 45

46 Queuing Systems time 46

47 Registration, access to resources and policies 47

48 Registration with Compute Canada Compute Canada is the organization which federates the regional HPC consortia of which Calcul Québec is a part of. The frst step to use the resources of Calcul Québec is to register with Compute Canada: This step must frst be made by the Professor who leads a group, and then by each sponsored member (students, postdocs, researchers, external collaborators). Each user must be registered in the database. 48

49 Registration with Calcul Québec Sites Currently registration is done through CCDB with options for a common consortia Calcul Quebec Sites in Calcul Quebec are: U. Montreal: Briarée, Cottos, Psi U. Laval: Colosse, Helios U. McGill: Guillimin U. Sherbrooke: MpII and MsII More information at 49

50 Acceptable Use Policy By obtaining an account with Compute Canada one must abide by the following policies: : 1. An account holder is responsible for all activity associated with their account. 2. An account holder must not share their account with others or try to access another user s account. Access credentials must be kept private and secure. 3. Compute Canada resources must only be used for projects/programs for which they have been duly allocated. 4. Compute Canada resources must be used in an effcient and considerate fashion. 50

51 Acceptable Use Policy 1. Compute Canada resources must not be used for illegal purposes. 2. An account holder must respect the privacy of their own, other users and the underlying systems data. 3. An account holder must provide reporting information in a timely manner and cite Compute Canada in all publications that resulted from work undertaken with Compute Canada resources. 4. An account holder must observe computing policies in effect at the relevant centre and their residing institution. 5. account holder may lose access if any of these policies are transgressed. 51

52 Use of Resources 52

53 Obtaining SSH Available through linux/unix, Mac OS X : Windows : cygwin (for graphics, you will need an X-emulator or X-server) Xming X Server ( Putty (putty.exe) Tunnelier ( see

54 Using SSH Connection in a terminal: ssh bob@guillimin.hpc.mcgill.ca Transferring fles: The login node name is obtained following activation of your account, or via the support webpages security scp local_fle bob@guillimin.hpc.mcgill.ca:destination sftp bob@guillimin.hpc.mcgill.ca There are other methods of fle transfer : bbcp, Globus connect,... ssh keys (ssh-keygen) protect with a passphrase. Important: never use pass-phraseless ssh-keys! Confguration fle (.ssh/confg) 54

55 Software Software used by more than one user is generally installed centrally on each system by analysts. Versioning and dependencies is handled by a tool called module. A module contains information that permits the modifcation of a user s environment so as to use a given version of the software. List the modules currently used:- module list List the modules currently available:- module av Add (Remove) a module from your environment module add <module_name> module rm <module_name> 55

56 Software > "module av " shows the installed modules >User specifc software can be installed in the project space >Usually done by the users themselves but can ask help from analysts 56

57 Storage Utilization The environment variable $HOME refers to the default directory for each user. This directory is sometimes protected via regular back-up. With an ssh connection the user arrives at this location. A directory named $SCRATCH is available for short-term storage. This directory is not backed-up, is of large capacity and has high performance. Project space is provided for projects and certain projects often have larger allocations of long-term project space. 57

58 Storage Utilization MpII, MSII, Guillimin, Cottos, Briaree et Psi : - the variable $SCRATCH indicates the location of scratch. - $HOME is backed-up. Guillimin only : - /gs/scratch/username is the scratch for each user. - /sb/project/rap_id or /gs/project/rap_id provide some persistent space per group. Colosse : Typing colosse-info in a terminal will tell the user their RAP_ID. - /scratch/rap_id/ is the scratch for the project. - $HOME is not backed-up. 58

59 Job Submission Set the options for the job. Writing a script to run. Submit the script. In the next cycle of resource allocation, the scheduler determines the job priority. The jobs with the highest priority are executed frst if the requested resources are available. Queuing of the jobs is possible. The calculated priority increases with the time spent waiting. Job execution. Return of the standard output and standard error of the job. 59

60 Job Submission - Briarée Queue Name Maximum Duration (h) Constraints / notes all : 2520 cores per group normale jobs max / user 1416 cores max / user 4 Nodes max / job courte jobs max / user 4 nodes max / job hp jobs max / user 2052 cores max / user hpcourte nodes max / jobs longue cores max / user 60 node available test 1 4 nodes available 60

61 Job Submission - Briarée #!/bin/bash #PBS -l walltime=02:00:00 #PBS -l nodes=1:ppn=4 #PBS -l mem=14gb cd $SCRATCH/my_directory module load module_used./execution The obtained node is reserved for the user. The next job does the same if possible. One can add #PBS q courte but by default the submission system chooses the queue based upon what is requested. qsub script qstat u user_name : submit the script : see the status of my jobs 61

62 Job Submission - Guillimin Job type Serial (less than 11 cores) Serial-short (less than 11 cores) Maximum Duration (h) Constraints / notes 720 Less than 11 cores, SW and SW2 nodes: 2:1 blocking network, 2.7GB per core, ~350 nodes 36 Less than 11 cores, SW, SW2 and AW2 nodes: 2:1 blocking network, 2.7GB per core, ~400 nodes Parallel 720 Blocking (SW,SW2) and Non-blocking networks ; 24, 36, 48, 72 and 128 GB of memory per node, ~1200 nodes debug 2 Maximum of 1560 cores per job (default) Per group maximum number of core*seconds for running jobs - Allows fexibility between many short duration and fewer longer duration jobs 62

63 Job Submission - Guillimin #!/bin/bash #PBS -l walltime=02:00:00 #PBS -l nodes=1:ppn=4 #PBS -l pmem=5700m #PBS -q sw cd $SCRATCH/my_directory module load module_needed./execution One can specify the queue to which the job is submitted based upon the memory and other requirements. The default queue is called 'metaq'. qsub q queue_name script : submit the job qstat u user_name : see the state of my jobs showq u user_name : see the state of my jobs (delayed) checkjob v jobid : see detailed job information 63

64 Job Submission - Colosse Queue Name Maximum Duration (h) Constraints / notes short cores maximum med cores maximum long 168 test ¼ 16 cores maximum 64

65 Job Submission - Colosse #!/bin/bash #$ -l h_rt=7200 #$ -pe default 8 #$ -P abc cd $SCRATCH/my_directory module load module_needed./execution A job obtains a complete node. The number of requested cores is therefore a multiple of 8. One can add #$ q short but by default the submission system chooses the queue based upon the requested resources. colosse-info qsub script qstat u user_name : obtain your abc : submit the script : see the status of my jobs 65

66 Job Submission - MpII Queue Name Maximum Duration (h) Constraints / notes qwork 120 qfbb 120 portion with non-blocking network qfat nodes available (48 cores per node) qfat nodes available (48 cores per node) The size of the jobs that can be executed depends on the allocation and other tasks in queue. Ex : if there are 2400 cores available and 3 jobs in the queue. - group 1 : allocation = 100 can use 1200 cores - group 2 : allocation = 50 can use 600 cores - group 3 : allocation = 50 can use 600 cores 66

67 Job Submission - MpII #!/bin/bash #PBS -l walltime=02:00:00 #PBS -l nodes=1 #PBS -l mem=14gb #PBS -q qwork@mp2 A job obtains a complete node. The number of requested cores must be 1. cd $SCRATCH/my_directory module load module_needed./execution qsub script qstat u user_name : submit the job script : see the status of my jobs 67

68 Best Practices 68

69 Grouping Tasks It is sometimes ineffcient to launch many jobs one by one: For i from 1 to 100 : qsub l nodes=1:ppn=1 l walltime... This approach is potentially ineffcient and should be avoided: Certain systems limit the number of jobs Certain systems allocate whole nodes to jobs #!/bin/bash #PBS -l walltime=02:00:00 #PBS -l nodes=1 #PBS l mem=14gb module load module_needed cd $SCRATCH/my_directory1./execution & cd $SCRATCH/my_directory2./execution & cd $SCRATCH/my_directory3./execution & wait 69

70 Grouping Tasks Methods and tools are available by which job parameters can be automatically adjusted when running programs repeatedly, and can aid to optimize the submission of related jobs. On Guillimin, Colosse and Briarée: The submission or scheduler systems support the use of job arrays that simplify the submission of identical workloads that operate on different sets of parameters or data. On MpII: The grouping of jobs can be automated through the use of bqtools. 70

71 Job Duration Estimate your job execution time. A job that requests less time waits less time in the queue! The shorter queues generally allow the user to run more jobs. In addition there is less risk of suffering the effects of a job failure. If you do not know how to do it, the analysts can help you. 71

72 Check Pointing Checkpoint saves the state of a running job to a fle. Later the previous job state can be recovered from that fle. If possible write your codes with check-pointing possibilities Advantages: Able to restart your jobs in case of nodes crash or sudden machine downtimes or during planned downtimes. 72

73 Storage For handling large fles, use the scratch space on the systems. It is generally preferable to use large block reads and writes. Sometimes it is useful to use disks local to the compute nodes where the jobs are running. On Guillimin, $LSCRATCH is a temporary folder created for each job in the localscratch space of the reserved compute nodes. Contact the analysts to obtain advice! 73

74 Hands on Exercises 74

75 Login to Guillimin Workstation login: Username: csuser<number> Password: <enter the given password> Guillimin login: ssh Password: <enter the given-class-acc-pass> 75

76 Explore the login node top command--> to see the running processes. df command-->to see the fle-system mkdir workdir create a work directory cd workdir ls -la See available software module av Never run your codes on the login nodes Submit your code to the scheduler using qsub. 76

77 Submitting a Job Copy two fles: job_serial.sh and job_parallel.sh from /sb/software/workshop/intro-hpc/ to <workdir> Submit a serial job: qsub -q class job_serial.sh Exercise: job monitoring commands: qstat -u <user> qstat -f <job_id> and qdel <job_id> Corresponding Moab commands: showq -u <user> checkjob <job_id> and canceljob <job_id> Submit a parallel Job: Submit qsub -q class job_parallel.sh Do checkjob <job-id> See output fles. 77

78 Conclusion - How To Get Started >Know your PI (Principal Investigator/Supervisor) >Getting an account ccdb >Request for an account with a HPC site of your choice >You will be ready to log on once you have a confrmation 78

79 Useful Documentation and Support Combined documentation for all Calcul Quebec sites: Documentation specifc to Guillimin can be seen here: 79

80 Useful Documentation and Support Support questions: General: Guillimin specifc: Be sure to include:» Cluster name» User name» JobID» Exact error messages» Full path to submission scripts and output fles 80

Outline. March 5, 2012 CIRMMT - McGill University 2

Outline. March 5, 2012 CIRMMT - McGill University 2 Outline CLUMEQ, Calcul Quebec and Compute Canada Research Support Objectives and Focal Points CLUMEQ Site at McGill ETS Key Specifications and Status CLUMEQ HPC Support Staff at McGill Getting Started