NCAR Computation and Information Systems Laboratory (CISL) Facilities and Support Overview NCAR ASP 2008 Summer Colloquium on Numerical Techniques for Global Atmospheric Models June 2, 2008 Mike Page - NCAR/CISL/HSS/CSG Consulting Services Group
CISL s Mission for User Support CISL will provide a balanced set of services to enable researchers to securely, easily, and effectively utilize community resources CISL Strategic Plan (2005-2009) CISL also supports special colloquia, workshops and computational campaigns; giving these groups of users special privileges and access to facilities and services above and beyond normal service levels.
CISL Computing Systems At NCAR/CISL you ll find world-class facilities supporting leading-edge science through high-performance computing. Navigation and usage of the facilities requires a basic familiarity with a number of the functional aspects of the facility. Computing systems Allocations Usage Batch Interactive Security Data Archival MSS User Support
CISL Computing Systems Bluevista IBM eserver p575 #98 on Top 500 list (Nov. 2005) 624 Power5 processors with 1.9-GHz clock, DCMs Four floating-point operations per cycle 4.74 TFLOPS peak processing SMT technology 72 8-way batch nodes 16 Gb shared memory on each node AIX operating system IBM XL Compiler Suite; TotalView debugger LSF Batch system 3 GB $HOME quota 240 GB /ptmp quota
Allocations Allocations are granted in General Accounting Units (GAUs) Each of the colloquium modeling groups has a project number for which 6000 GAUs are available. Monitor GAU usage though the CISL portal: https://portal.scd.ucar.edu:8443/scd-portal (requires UCAS password) Charges are assessed overnight and will be available for review For runs that complete by midnight. GAUs charged = wallclock hours used * number of nodes used * number of processors in that node * computer factor * queue charging factor The computer factor for bluevista is 0.87. The queue charging factor for dycore is 1.0
Batch and Interactive Usage Batch Usage LSF - Load Sharing Facility Fair share scheduler dycore queue for batch runs share queue for analysis runs Interactive batch seldom used Interactive use through Unix shell commands
LSF Batch Submission Job submission bsub < script Submits the file script to LSF Monitor jobs bjobs shows jobs you have running and pending in the system bjobs -u all bjobs -q dycore -u all bhist -n 3 -a shows jobs submitted and completed over last few days System batch load batchview Shows all jobs for all users
LSF Example #!/bin/ksh # # LSF batch script to run an MPI application # #BSUB -n 24 # number of mpi tasks #BSUB -R "span[ptile=8]" # run 8 tasks per node (non-smt) ##BSUB -R span[ptile=16] # run 16 task per node (smt) #BSUB -P xxxxxxxx # Project xxxxxxxx #BSUB -J mpilsf.test # job name #BSUB -o mpilsf.%j.out # output filename #BSUB -e mpilsf.%j.err # error filename #BSUB -W 0:10 # 10 minutes wall clock time #BSUB -q dycore # queue # Fortran example mpxlf_r -o mpi_samp_f mpi.f mpirun.lsf./mpi_samp_f More examples in /usr/local/examples/lsf/batch
Useful Utilities Change mpirun.lsf./mpi_samp_f to timex mpirun.lsf./mpi_samp_f for information on execution time or to: export MP_LABELIO=yes mpirun.lsf /contrib/bin/job_memusage.exe./mpi_samp_f For information on memory usage (This will help you decide whether you can use SMT or not)
Security CISL Firewall Enter through roy.ucar.edu ssh only; telnet not allowed ssh, ssh -X, ssh -Y Cryptocard You must have one to access bluevista Usage Resynchronization Complete information from CISL web pages http://www.cisl.ucar.edu
NCAR Mass Store Subsystem - (MSS) Currently stores 5 petabytes of data Library of Congress (printed collection) 10 Terabytes = 0.01 Petabytes Mass Store holds 500 * Library of Congress Growing by 2-6 Terabytes of new data per day Data holdings increasing exponentially 1986-2 Tb 1997-100 Tb 2002-1000 Tb 2004-2000 Tb 2008-5000 Tb
Charges for MSS Usage The current MSS charging formula is: GAUs charged =.0837*R +.0012*A + N(.1195*W +.205*S) where: R = Gigabytes read W = Gigabytes created or written A = Number of disk drive or tape cartridge accesses S = Data stored, in gigabyte-years N = Number of copies of file: = 1 if economy reliability selected = 2 if standard reliability selected
NCAR Mass Store Subsystem Keys to Efficient Usage Be selective in the data you retain on the MSS Avoid repetitive reads/writes of same file Choose class of service and retention periods according to value of data Recommended file sizes Transfer a few large files rather than a large number of small files Maximum file size is 12 Gb Use tar to collect small files for a single transfer
Mass Store Usage Will NCAR maintain my data indefinitely? If you don t retain your account - 1 year If you retain your account - ongoing Using the Mass Store is expensive (GAUs) Consider offloading data (DVD) Create your own transportable media (except extreme cases) Use scp/sftp for data accessible from an SCD super Including divisional filesystems Use MSS ftp server http://www.scd.ucar.edu/docs/mss/ftp.html
File purge policies If: you are no longer active on any project your project closes you are no longer employed by UCAR/NCAR you are doing periodic maintenance of your MSS files You can: delete your MSS files yourself request that the CISL delete your files change the project number to an active account transfer ownership of the files to another user transfer the data between the MSS and other media 1 Terabyte > 300 DVDs transfer data to another network location
Mass Store Access Command Line and Batch Scripts http://www.cisl.ucar.edu/main/mss.html msrcp [-a[sync] -cl[ass] cos -n[oreplace] -pe[riod] n \ -pr[oject] proj_num -rpwd rpass -wpwd wpass -R -V[ersion]] \ source_file [source_file...] target msls [-project proj] [-class cos] [-full] \ [-CFPRSTVacdflpqrtuxz1] [path] msmv [-project proj] [-f] [-period ret_period] (1)password options \ file1 file2 msmv [-project proj] [-f] [-period ret_period] (1)password options \ directory1 directory2 msmv [-project proj] [-f] [-period ret_period] (1)password options \ path [path]... Directory Where: (1) [-rpwd read_password] [-wpwd write_password] \ [-newr read_password] [-neww write_password]
File Naming Convention Any file read from or written to the mass store needs to have the prefix mss:/pel/asp2008/ e.g. msrcp bluevista_filename \ mss:/pel/asp2008/model/test_case/horizontal_resolution/mss_filename
Good Practices Mixing LSF and MSS usage Multistep applications See /usr/local/examples/lsf/multistep Asynchronous reads/writes Step 1 - read data from mss (share queue) Step 2 - run model (dycore queue) Step 3 - write data to mss (share queue) Saves GAUs by reducing processor count Pre- and/or Post-processing can follow the same outline as this example
User Support ASP Wiki https://www.wiki.ucar.edu/display/dycores/home CISL homepage: http://www.cisl.ucar.edu/ High End Computing Mass Storage System Data Support Section VisLab Community Data Portal ESMF Extraview Home Page https://cislcustomersupport.ucar.edu/evj/extraview ASP liasion Mike Page 303-944-8291 303-497-2464 mpage@ucar.edu
Questions?