Introduction to High Performance Computing at Case Western Reserve University. KSL Data Center

Similar documents
Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Using a Linux System 6

Slurm basics. Summer Kickstart June slide 1 of 49

Introduction to GACRC Teaching Cluster

Introduction to GACRC Teaching Cluster PHYS8602

High Performance Computing Cluster Basic course

Introduction to GACRC Teaching Cluster

Introduction to SLURM & SLURM batch scripts

Introduction to High-Performance Computing (HPC)

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

HPC Introductory Course - Exercises

Graham vs legacy systems

Introduction to High-Performance Computing (HPC)

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Introduction to SLURM & SLURM batch scripts

Introduction to SLURM & SLURM batch scripts

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

High Performance Computing Cluster Advanced course

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

How to run a job on a Cluster?

Introduction to HPC Resources and Linux

How to Use a Supercomputer - A Boot Camp

Introduction to the Cluster

Heterogeneous Job Support

XSEDE New User Training. Ritu Arora November 14, 2014

June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Working with Shell Scripting. Daniel Balagué

Introduction to HPC Using zcluster at GACRC

Using Cartesius and Lisa. Zheng Meyer-Zhao - Consultant Clustercomputing

Sherlock for IBIIS. William Law Stanford Research Computing

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Exercises: Abel/Colossus and SLURM

For Dr Landau s PHYS8602 course

Introduction to HPC Using zcluster at GACRC

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

Introduction to HPC Using zcluster at GACRC

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

Introduction to BioHPC

High Performance Computing (HPC) Using zcluster at GACRC

TITANI CLUSTER USER MANUAL V.1.3

Introduction to the Cluster

Introduction to GALILEO

Submitting batch jobs

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Introduction to Abel/Colossus and the queuing system

Workstations & Thin Clients

SCALABLE HYBRID PROTOTYPE

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220

Training day SLURM cluster. Context. Context renewal strategy

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Using Compute Canada. Masao Fujinaga Information Services and Technology University of Alberta

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Introduction to RCC. September 14, 2016 Research Computing Center

Introduction to RCC. January 18, 2017 Research Computing Center

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Applications Software Example

Introduction to High Performance Computing Using Sapelo2 at GACRC

HPC Workshop. Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing

Choosing Resources Wisely. What is Research Computing?

ICS-ACI System Basics

New User Seminar: Part 2 (best practices)

Submitting batch jobs Slurm on ecgate Solutions to the practicals

CNAG Advanced User Training

Using Sapelo2 Cluster at the GACRC

CRUK cluster practical sessions (SLURM) Part I processes & scripts

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

P a g e 1. HPC Example for C with OpenMPI

Scheduling By Trackable Resources

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Introduction to PICO Parallel & Production Enviroment

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Training day SLURM cluster. Context Infrastructure Environment Software usage Help section SLURM TP For further with SLURM Best practices Support TP

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Bright Cluster Manager

Introduction to UBELIX

Slurm Birds of a Feather

Scientific Computing in practice

Name Department/Research Area Have you used the Linux command line?

Introduction to High Performance Computing at UEA. Chris Collins Head of Research and Specialist Computing ITCS

Beginner's Guide for UK IBM systems

Compiling applications for the Cray XC

Session 1: Accessing MUGrid and Command Line Basics

Introduction to GALILEO

MIC Lab Parallel Computing on Stampede

1 Bull, 2011 Bull Extreme Computing

INTRODUCTION TO THE CLUSTER

Introduction to HPC Using zcluster at GACRC

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

COSC 6374 Parallel Computation. Debugging MPI applications. Edgar Gabriel. Spring 2008

Effective Use of CCV Resources

Batch Systems. Running your jobs on an HPC machine

Quick Start Guide. Table of Contents

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

ECE 574 Cluster Computing Lecture 4

Protected Environment at CHPC. Sean Igo Center for High Performance Computing September 11, 2014

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

Transcription:

Introduction to High Performance Computing at Case Western Reserve University Research Computing and CyberInfrastructure team KSL Data Center Presenters Emily Dragowsky Daniel Balagué Guardia Hadrian Djohari Sanjaya Gajurel

Bootcamp Outline Who we are Case HPC resources Working with the Cluster Basic Linux Job Scripting Open Discussion/Q&A

Bootcamp Outline Who we are Case HPC resources Working with the Cluster Basic Linux Job Scripting Open Discussion/Q&A

Who we are Research Computing and CyberInfrastructure Team RCCI 5th floor, overlooking Euclid [U]TECH University Staff, academic ties CWRU grads Research group members Skilled practitioners Strong collaboration with Network, Servers and Storage teams

RCCI Services Cyberinfrastructure High Performance Computing Research Networking services Research Storage and Archival solutions Secure Research Environment for computing on regulated data Support Education and Awareness Consultation and Award Pre-support Database Design Visualization Programming Services Concierge for off-premise services (XSEDE,OSC,AWS) Public Cloud and Off- Premise Services

CASE HPC Cluster Designed for computationally intensive jobs long-running, number crunching Optimized for batch jobs combine resources as needed (cpu, memory, gnu) Supports interactive/graphically intensive jobs OS version emphasizes stability Linux (Red Hat Enterprise Linux 6.8) Accessible from Linux, Mac and Windows Some level of Linux expertise is needed - why we re here today Clusters: redcat (slurm), and hadoop

HPC Cluster Glossary Head Nodes: Development, Analysis, Job Submission Compute Nodes: Computational Computers Panasas: Engineered File System, fastest storage DELL Fluid File System: Value storage Data Transfer Nodes: hpctransfer, dtn1 Science DMZ: lowest resistance Data Pathway SLURM: Cluster workload manager (Job Scheduler)

HPC Cluster Components Resource Manager redcat.case.edu Science DMZ Dell FFS Storage Head Nodes SLURM Master Admin Nodes Data Transfer Nodes Panasas Storage Batch nodes GPU nodes SMP nodes

HPC Cluster Components Resource Manager redcat.case.edu University Science DMZ Dell FFS Storage Head Nodes SLURM Master Admin Nodes Firewall Data Transfer Nodes Panasas Storage Batch nodes GPU nodes SMP nodes

Working on the Cluster How To: ~ access the cluster ~ get my data onto the cluster ~ establish interactive sessions <break> ~ submit jobs through the scheduler ~ monitor jobs a.k.a. why is my job not running?? work with others within the cluster

You can login from anywhere You will need: An approved cluster account Enter your CaseID and the Single Sign-On password ssh (secure shell) utility [detailed instructions for all platforms] We recommend x2go-client Putty or cygwin (Windows), Terminal (Mac/Linux) will work for non-graphical output sessions. If Off-campus Location, then Connect through VPN, using two-factor authentication Case Guest wireless == off-campus

HPC Environment Your Full Cluster Resources Your HPC account, sponsored by your PI, provides: Group affiliation resources shared amongst group members Storage /home permanent storage, replicated & snapshot protected /scratch/pbsjobs up to 1 TB temporary storage /scratch/users small-scale temporary storage exceeding quota(s) will prevent using account Cores: member groups allocation of 32+ for an 8-share Wall-time: 320-hour limit for member shares (32 hours for guest shares)

HPC Environment Your /home Allocated storage space in the HPC filesystem for your work Create subdirectories underneath your /home/caseid, ideally each job has its own subdirectory cd linux command to change the current directory examples to change to home cd /home/<caseid> cd ~<CaseID> cd $HOME $HOME is an environment variable that points to /home/<caseid>

You are not alone. > ls /home

HPC Environment Beyond /home Linux systems have hierarchical directory structure User files: /home System files: /bin, /dev, /etc, /log, /opt, /var Application files: /usr/local/<module>/<version> Consider Python: 4 versions installed /bin/python 2.6.6 /usr/local/python/ 2.7.8 2.7.10 3.5.2

HPC Environment Environment Variables Keeping organized echo $PATH /home/mrd20/bin/grom5/bin:/home/mrd20/bin:/usr/local/i/1.0.0/bin:/usr/local/openmpi/1.8.8/bin:/usr/ local/intel/2015/composer_xe_2015.3.187/bin/intel64:/usr/local/munge/bin:/usr/local/slurm/bin:/usr/ local/slurm/sbin:/usr/lib64/qt-3.3/bin:/usr/local/emflex/1-j.11/wai/flex/programs:/usr/local/bin:/bin:/usr/ bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/dell/srvadmin/bin echo $LD_LIBRARY_PATH /home/mrd20/bin/grom5/lib64:/usr/local/openmpi/1.8.8/lib:/usr/local/intel/2015/ composer_xe_2015.3.187/mkl/lib/intel64:/usr/local/intel/2015/composer_xe_2015.3.187/compiler/lib/ intel64:/usr/local/munge/lib:/usr/local/slurm/lib:/usr/lib:/usr/lib64:/usr/local/lib

Modules and Environment Module command: avail, list, load, unload Manage the environment necessary to run your applications (binary, libraries, shortcuts) Using the module commands will set or remove the environment variables: >>module avail (or module avail python) >>module list (shows modules loaded in your environment) >>module load python (loads default version) >>module load python/3.5.2 (loads specific version) >>module unload python/3.5.2 (unloads specific version) -------------------------------------------------------------------

Modules and Environment [mrd20@hpc2 ~]$ module list Currently Loaded Modules: Module command: list & display 1) intel/2015 2) openmpi/1.8.8 3) i/1.0.0 4) StdEnv 5) python/2.7.8 [mrd20@hpc2 ~]$ module display python ---------------------------------------------------------------------------- /usr/local/share/modulefiles/python/2.7.8: ---------------------------------------------------------------------------- whatis("a powerful high-level programming language ") prepend_path("path","/usr/local/python/2.7.8/bin") prepend_path("cplus_include_path","/usr/local/python/2.7.8/include") prepend_path("c_include_path","/usr/local/python/2.7.8/include") prepend_path("ld_library_path","/usr/local/python/2.7.8/lib") prepend_path("library_path","/usr/local/python/2.7.8/lib") prepend_path("pkg_config_path","/usr/local/python/2.7.8/lib/pkgconfig")

Data Transfer scp command scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file] [-l limit] [-o ssh_option] [-P port] [-S program] [[user@]host1:]file1... [[user@]host2:]file2 Copy from HPC to your local PC scp -r mrd20@redcat.case.edu:/home/mrd20/data/vlos.dat.. full stop means this directory From your PC to HPC scp orange.py mrd20@redcat: : colon denotes hostname

Data Transfer GLOBUS Setup Instructions: https://sites.google.com/a/case.edu/hpc-upgraded-cluster/ home/important-notes-for-new-users/transferring-files

Start an Interactive GUI Session Create a session on compute node, not on the head node srun Create a job allocation (if needed) and launch a job step srun --x11 [-p batch -n 4 -t 1:00:00] --pty /bin/bash --x11 invokes X-forwarding --pty psuedoterminal, type of shell = bash -p -n -t partition (batch, gpufermi, gpuk40, smp) nodes duration of resource allocation

Examples: Interactive GUI Session Accepting the defaults srun --x11 --pty /bin/bash More tasks (default 1 cpu-per-task) srun --x11 -p batch -n 4 -t 1:00:00 --pty /bin/bash Graphically intensive session (default duration 10 hours) srun --x11 -p gpufermi --gres:gpu=2 -n 12 --pty /bin/bash

Now Let s Take time for reflection beverages stretching the legs washing of hands booking a flight checking email quiet contemplation talking with our neighbors

Working Big on the CWRU HPC Cluster Many people at once Many jobs running, and queued awaiting resources Slurm workload manager software has three key functions: allocates access to resources (compute nodes) to users for some duration of time so they can perform work. Provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Arbitrates contention for resources by managing a queue of pending work.

Monitor Cluster Status Workload management for collective benefit of HPC community sinfo View information about Slurm nodes and partitions sinfo [flags] -n nodes by name -o format output: sinfo -o "%10P %.3a %.10l %.4D %.8t %.14C %N" PARTITION AVA TIMELIMIT NODE STATE CPUS(A/I/O/T) NODELIST si script invoking sinfo with a set of standard flags exercise: > less `which si`, examine bash script contents

Submit a Job through the Scheduler Workload management for collective benefit of HPC community sbatch Create a resource allocation request to launch a job step sbatch [-p batch -N 1 -t 2-1:00:00] script script a bash shell script -p -N -t partition (batch, gpufermi, gpuk40, smp) nodes duration of resource allocation [dd-hh:mm:ss] Other common flags: -A, --ntasks, --cpus-per-task, --mem-per-cpu

Example Job Script hexacarbonyl-16.slurm #/bin/bash #SBATCH --time=4:00:00 #SBATCH --nodes=2 #SBATCH --ntasks=6 #SBATCH --cpus-per-task=2 #SBATCH --job-name=hexacarbonyl-16_job # Load the Gaussian module module load gaussian/16-sse # Run Gaussian srun g16 hexacarbonyl-16.com

Checking Job Status (I) squeue view information about jobs in scheduling queue squeue [options] -u <caseid> -A <PI caseid> -l standard long output fields -o select fields for output (~90 fields exist) - -start show estimated start times for pending jobs full documentation: slurm.schedmd.com/squeue.html

Checking Job Status (II) scontrol view and modify Slurm configuration and state most functionality reserved for system administrators scontrol [options] [commands] scontrol show job <jobid> scontrol show node <nodename> (refer to HPC Resource View)

Working within Group Allocations Group Name / ID: tas35 / 10085 (guest) Resources CPUs RAM max duration: 1-12:00:00 Checking group usage with squeue: squeue -o "%A %C %e %E %g %l %m %N %T %u" awk 'NR==1 /eecs600/' JOBID CPUS END_TIME DEPENDENCY GROUP TIME_LIMIT MIN_MEMORY NODELIST STATE USER 148137 1 2016-01-26T16:54:22 eecs600 2:00:00 1900 comp145t RUNNING aar93 148146 1 2016-01-27T01:14:27 eecs600 10:00:00 1900 comp148t RUNNING hxs356

SLURM Resources Reading List Case HPC SLURM command summary CPU Management User and Administrator Guide http://slurm.schedmd.com/cpu_management.html Support for Multi-core/Multi-Thread Architectures http://slurm.schedmd.com/mc_support.html Slides from Tutorial for Beginners http://www.schedmd.com/cray/tutorial.begin.pdf SLURM manual pages http://slurm.schedmd.com/<command>.html

Case Cluster: How to Learn Web Search: CWRU HPC https://sites.google.com/a/case.edu/hpc-upgraded-cluster/home hpc-support@case.edu

Summary Headnodes reserved for organizing work Compute nodes meant for performing work Low-Impedence Network for large-scale Data Transfer SLURM Workload Manager & Scheduler RCCI Staff on-hand for aid Jump in and learn hpc-support@case.edu RCCI Team: Roger Bielefeld, Mike Warfe, Hadrian Djohari Daniel Balagué, Brian Christian, Emily Dragowsky, Jeremy Fondran, Sanjaya Gajurel, Matt Garvey, Theresa Griegger, Cindy Martin, Lee Zickel