Using SDSC Systems (part 2)

Size: px

Start display at page:

Download "Using SDSC Systems (part 2)"

Marianna McCoy
5 years ago
Views:

1 Using SDSC Systems (part 2) Running vsmp jobs, Data Transfer, I/O SDSC Summer Institute August Mahidhar Tatineni San Diego Supercomputer Center " 1

2 vsmp Runtime Guidelines: Overview" Identify type of job serial (large memory), threaded (pthreads, openmp), or MPI! Workshop directory has examples for the different scenarios. Hands on section (today and during ScaleMP session later) will walk through different types.! Use affinity in conjunction with automatic process placement utility (numabind).! Optimized MPI (mpich2 tuned for vsmp) is available.!

3 vsmp Guidelines for Threaded Codes" 3

4 Compiling OpenMP Example" Change to the workshop directory:! cd ~/SI12_basics/GORDON_PART2" " Compile using openmp flag:! ifort -o hello_vsmp -openmp hello_vsmp.f90" " Verify executable was created:! ls -lt hello_vsmp" -rwxr-xr-x 1 train61 gue May 9 10:31 hello_vsmp"

5 Hello World on vsmp node (using OpenMP)" hello_vsmp.cmd! #!/bin/bash" #PBS -q vsmp" #PBS -N hello_vsmp" #PBS -l nodes=1:ppn=16:vsmp" #PBS -l walltime=0:10:00" #PBS -o hello_vsmp.out" #PBS -e hello_vsmp.err" #PBS -V" #PBS M username@xyz123.edu" #PBS -m abe" #PBS A gue998" cd ~/SI12_basics/GORDON_PART2" export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so" export PATH="/opt/ScaleMP/numabind/bin:$PATH"" export KMP_AFFINITY=compact,verbose,0,`numabind --offset 8`" export OMP_NUM_THREADS=8"./hello_vsmp"

6 Hello World on vsmp node (using OpenMP)" Code written using OpenMP! PROGRAM OMPHELLO INTEGER TNUMBER INTEGER OMP_GET_THREAD_NUM!$OMP PARALLEL DEFAULT(PRIVATE) TNUMBER = OMP_GET_THREAD_NUM() PRINT *, 'HELLO FROM THREAD NUMBER = ', TNUMBER!$OMP END PARALLEL STOP END

7 vsmp OpenMP binding info (from hello_vsmp.err file)" " " " OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {504}" OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {505}" OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {506}" OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {507}" OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {508}" OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {509}" OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {511}" OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {510}"

8 Hello World (OpenMP version) Output" HELLO FROM THREAD NUMBER = 1! HELLO FROM THREAD NUMBER = 6! HELLO FROM THREAD NUMBER = 5! HELLO FROM THREAD NUMBER = 4! HELLO FROM THREAD NUMBER = 3! HELLO FROM THREAD NUMBER = 2! HELLO FROM THREAD NUMBER = 0! HELLO FROM THREAD NUMBER = 7! Nodes: gcn-3-11! 8

9 vsmp Pthreads Example" cd ~/SI12_basics/GORDON_PART2! # PATH to numabind! export PATH=/opt/ScaleMP/numabind/bin:$PATH! # ScaleMP preload library that throttles down unnecessary system calls.! export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so! # Specify sleep duration for each pthread. Default = 60 sec if not set.! export SLEEP_TIME=30! # 16 pthreads would be created.! NP=16! log=log-$np-`date +%s`.txt!./ptest $NP >> $log 2>&1 &! # Waiting for 15 seconds for all the threads to start.! sleep 15! echo "ptest threads affinity before numabind" >> $log 2>&1! ps -elo pid,lwp,time,ucmd,psr grep ptest >> $log 2>&1! # Start numabind with a config file that has a rule for pthread,! # which would place all threads to consecutive cpus.! numabind --config myconfig >> $log 2>&1! echo "ptest threads affinity after numabind" >> $log 2>&1! ps -elo pid,lwp,time,ucmd,psr grep ptest >> $log 2>&1! sleep 300!! 9

10 Data Transfer (scp, globus-url-copy)" scp is o.k. to use for simple file transfers and small file sizes (<1GB). Example:! $ scp w.txt train40@gordon.sdsc.edu:/home/train40/w.txt 100% 15KB 14.6KB/s 00:00 " globus-url-copy for large scale data transfers between XD resources (and local machines w/ a globus client).! Uses your XSEDE-wide username and password " Retrieves your certificate proxies from the central server" Highest performance between XSEDE sites, uses striping across multiple servers and multiple threads on each server." 10

11 Data Transfer globus-url-copy" Step 1: Retrieve certificate proxies:! $ module load globus" $ myproxy-logon l xsedeusername" Enter MyProxy pass phrase:" A credential has been received for user xsedeusername in /tmp/ x509up_u " " Step 2: Initiate globus-url-copy:! $ globus-url-copy -vb -stripe -tcp-bs 16m -p 4 gsiftp:// gridftp.ranger.tacc.teragrid.org:2811///scratch/00342/username/test.tar gsiftp:// trestles-dm2.sdsc.xsede.org:2811///oasis/scratch/username/temp_project/testgordon.tar" Source: gsiftp://gridftp.ranger.tacc.teragrid.org:2811///scratch/00342/username/" Dest: gsiftp://trestles-dm2.sdsc.xsede.org:2811///oasis/scratch/username/ temp_project/" test.tar -> test-gordon.tar" 11

12 Data Transfer Globus Online" Works from Windows/Linux/Mac via globus online website:! Gordon and Trestles endpoints already exist. Authentication can be done iusing XSEDE-wide username and password.! Globus Connect application (available for Windows/Linux/Mac can turn your laptop/ desktop into an endpoint.! 12

13 Data Transfer Globus Online" Step 1: Create a globus online account! 13

14 Data Transfer Globus Online" Step 2: Set up local machine as endpoint using Globus Connect.! 14

15 Data Transfer Globus Online" 15

16 Data Transfer Globus Online" Step 3: Pick Endpoints and Initiate Transfers!! 16

17 Data Transfer Globus Online" 17

18 Gordon : Filesystems" Lustre filesystems Good for scalable large block I/O! Accessible from both native and vsmp nodes." /oasis/scratch/gordon 1.6 PB, peak measured performance ~50GB/s on reads and writes." /oasis/projects ~ 400TB" SSD filesystems! /scratch local to each native compute node 300 GB each." /scratch on vsmp node 4.8TB of SSD based filesystem." NFS filesystems (/home)! 18

19 Gordon Network Architecture" XSEDE & R&E Networks Data Movers (4x) Mgmt. Nodes (2x) SDSC Network Mgmt. Edge & Core Ethernet Public Edge & Core Ethernet Login Nodes (4x) NFS Server (2x) Dual- rail IB Dual 10GbE storage GbE management GbE public Round robin login Mirrored NFS Redundant front- end IO Nodes Compute Node Compute Node Compute Node Compute Node 1,024 Data Oasis Lustre PFS 4 PB IO Nodes 64 3D torus: rail 1 3D torus: rail 2 GbE! 2x10GbE" 10GbE! QDR 40 Gb/s!

Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology" 4X4X4 Mesh! Ends are folded on all three! Dimensions to form a 3DTorus" Dual-Rail Network! increased Bandwidth & Redundancy!

20 Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology" 4X4X4 Mesh! Ends are folded on all three! Dimensions to form a 3DTorus" Dual-Rail Network! increased Bandwidth & Redundancy! 48GB/sec Single Connection to each Network! 16 Compute Nodes, 2 IO Nodes! 18 x 4X IB Network Connections 18 x 4X IB Network Connections 48GB/sec 36 Port Fabric Switch IO IO 36 Port Fabric Switch CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN

Data Oasis Heterogeneous Architecture Lustre-based Parallel

Myrinet cluster Mellanox 5020 Bridge 12 GB/s 64 Lustre LNET

Network Architectures MDS MDS Arista 7508 10G Arista 7508

Metadata Servers OSS 72TB OSS 72TB OSS 72TB OSS 72TB 64 OSS

>4PB Raw Capacity JBOD 90TB JBOD 90TB JBOD 90TB JBOD 90TB

21 Data Oasis Heterogeneous Architecture Lustre-based Parallel File System" TRESTLES IB cluster GORDON IB cluster TRITON Myrinet cluster Mellanox 5020 Bridge 12 GB/s 64 Lustre LNET Routers 100 GB/s Myrinet 10G Switch 25 GB/s 3 DisUnct Network Architectures MDS MDS Arista G Arista G Redundant Switches for Reliability and Performance MDS Metadata Servers OSS 72TB OSS 72TB OSS 72TB OSS 72TB 64 OSS (Object Storage Servers) Provide 100GB/s Performance and >4PB Raw Capacity JBOD 90TB JBOD 90TB JBOD 90TB JBOD 90TB JBODs (Just a Bunch Of Disks) Provide Capacity Scale- out to an AddiVonal 5.8PB

22 Data Oasis from Gordon Itʼs the Routers!!" Gordon has 64 I/O nodes which host the flash and also serve as routers for the lustre filesystems.! Lustre clients configured to use the local I/O node if available. This maximizes the overall write performance on the system.! Reads round robin over the available routers.! Workshop examples illustrate the locality of the write operations.! 22

23 Lustre Examples" Two example scripts in the ~/SI12_basics/ GORDON_PART2 directory! IOR_lustre_0_hops.cmd Runs jobs with all nodes on one switch." IOR_lustre_4_hops.cmd Runs jobs with nodes up to 4 hops away." Example output! ior_maxhops0.out All nodes on same switch and hence use only *one* router. Max Write MB/s." Ior_maxhops4.out The nodes ended up on two switches and hence we had two routers in play during the write. Max Write MB/s. " 23

24 Data Oasis Performance"

25 Model A: One SSD per Compute Node (only 4 of 16 compute nodes shown)" Lustre" Compute Node" Compute Node" Compute Node" One 300 GB flash drive exported to each compute node appears as a local file system " Lustre parallel file system is mounted identically on all nodes." " Use cases:" Applications that need local, temporary scratch" Gaussian" Abaqus" Hadoop" Compute Node" Logical View! File system appears as:! /scratch/$user/$pbs_jobid!

26 Using SSD Scratch (Native Nodes)" #!/bin/bash! #PBS -q normal! #PBS -N ior_native! #PBS -l nodes=1:ppn=16:native! #PBS -l walltime=00:25:00! #PBS -o ior_scratch_native.out! #PBS -e ior_scratch_native.err! #PBS -V! #PBS M username@xyz123.edu! #PBS -m abe! #PBS A gue998!! cd /scratch/$user/$pbs_jobid!! mpirun_rsh -hostfile $PBS_NODEFILE -np 4 ~/SI12_basics/GORDON_PART2/IOR-gordon -i 1 -F b 16g -t 1m -v -v > IOR_native_scratch.log!! cp /scratch/$user/$pbs_jobid/ior_native_scratch.log ~/SI12_basics/GORDON_PART2/!

27 Using SSD Scratch (Native Nodes)" Snapshot on the node during the run:! $ pwd" /scratch/mahidhar/72251.gordon-fe2.local" $ ls -lt" total " -rw-r--r-- 1 mahidhar hpss May 15 23:48 testfile " -rw-r--r-- 1 mahidhar hpss May 15 23:48 testfile " -rw-r--r-- 1 mahidhar hpss May 15 23:48 testfile " -rw-r--r-- 1 mahidhar hpss May 15 23:48 testfile " -rw-r--r-- 1 mahidhar hpss 1101 May 15 23:48 IOR_native_scratch.log" Performance from single node (in log file copied back):! Max Write: MiB/sec ( MB/sec)" Max Read: MiB/sec ( MB/sec)" 27

28 IOPS SSD vs Lustre" FIO benchmark used to measure random I/O performance! Sample scripts! scratch_native_fio.cmd (uses SSDs)" lustre_native_fio.cmd Note: we will not run this today! This will overload the meta data server if there are too many simultaneous jobs with lots of random I/O requests. Output from a test run is in ior_lustre_native_fio.out to illustrate the low IOPs." Sample performance numbers:! SSD Random Write : iops=4782, Random Read: 13738" Lustre Random Write: iops=671, Random Read: iops=101 " 28

29 Which I/O system is right for my application?" Performance" Infrastructure" Persistence" Capacity" Use cases" Flash-based I/O nodes! SSDʼs support low latency I/O, high IOPS, and high bandwidth. One SSD can deliver 37K IOPS." Flash resources are dedicated to the user and performance is largely independent of what other users are doing on the system." SSDʼs are deployed in I/O nodes using iser, an RDMA protocol that is accessed over the InfiniBand network." Data is generally removed at the end of a run so the resource can be made available to the next job." Up to 4.8 TB per users depending on configuration" Local application scratch (Abaqus, Gaussian); as a data mining platform (e.g., Hadoop); graph problems;" Lustre! Lustre is ubiquitous in HPC. It does well for sequential I/O and files that support I/O to a few files from many cores simultaneously. Random I/O is a Lustre killer." Lustre is a shared resource and performance will vary depending on what other users are doing." 64 OSSʼs; distinct file systems and metadata servers; accessed over a 10GbE network via the I/O nodes. Hundreds of HDDs/spindles." Most is deployed as scratch and purgeable by policy (not necessarily at the end of the job." Some deployed as a persistent project storage resource." No specific limits or quotas imposed on scratch. File system is ~ 2 PB." Traditional HPC I/O associated with MPI applications. Prestaging of data that will be pulled into flash."

30 Model B: 16 SSDʼs for 1 Compute Node" Lustre" Compute Node" 4.8 TB" 16 SSDʼs in a RAID0 appear as a single 4.8 TB file system to the compute node." Flash I/O and Lustre traffic uses Rail 1 of the torus." " Use cases:" Database" Data mining" Gaussian" Logical View! File system appears as:! /scratch/$user/$pbs_jobid!

31 Model B: 16 SSDʼs for 1 Compute Node" We have 4 nodes in rack 18 set up under this model gcn-18-11, gcn-18-31,gcn-18-51, and gcn " We have reserved nodes gcn-18-51, gcn for summer institute users who wish to use this model. Users can directly request the nodes (example below)" #!/bin/bash! #PBS -q normal! #PBS -N ior_native! #PBS -l nodes=gcn-18-51:ppn=16:native! #PBS -l walltime=00:25:00! #PBS -o ior_scratch_native.out! #PBS -e ior_scratch_native.err! #PBS -V! #PBS -M mahidhar@sdsc.edu! #PBS -m abe! #PBS -A use300! cd /scratch/$user/$pbs_jobid!!

32 Model C: 16 SSDʼs within a vsmp Supernode" Lustre"! " 16 node" Virtual Compute Image" (1 TB)" Lustre not part of supernode" Logical View!!! 4.8 TB file system"! File system appears as:! /scratch1/$user/$pbs_jobid! (/scratch2 available if using a 32-node supernode)! 4.8 TB flash as a single XFS file system" Flash I/O uses both rail 0 and rail 1" " Use cases:" Serial and threaded applications that need large memory and local disk" Abaqus" Genomics (Velvet, Allpaths, etc)" "

33 Model C: 16 SSDʼs within a vsmp Supernode" We have reserved vsmp nodes for summer institute users who wish to use this model. Users can directly request the nodes (example below)" #!/bin/bash! #PBS q vsmp! #PBS -N ior_vsmp! #PBS -l nodes=1:ppn=16:vsmp! #PBS -l walltime=00:25:00! #PBS -o ior_scratch_vsmp.out! #PBS -e ior_scratch_vsmp.err! #PBS -V! #PBS M username@xyz123.edu! #PBS -m abe! #PBS A gue998!! cd /scratch1/$user/$pbs_jobid!!

34 Summary, Q/A "" Follow guidelines for serial, OpenMP, Pthreads, MPI jobs on the vsmp nodes.! Access options ssh clients, XSEDE User Portal! Data Transfer options scp, globus-url-copy (gridftp), globus online, and XSEDE User Portal File Manager.! Lustre routed over I/O nodes. Write performance determined by number of routers used by a job.! Use SSD local scratch where possible. Excellent for codes like Gaussian, Abaqus.! 34

High Performance Computing and Data Resources at SDSC

High Performance Computing and Data Resources at SDSC "! Mahidhar Tatineni (mahidhar@sdsc.edu)! SDSC Summer Institute! August 05, 2013! HPC Resources at SDSC Hardware Overview HPC Systems : Gordon, Trestles