Viglen NPACI Rocks. Getting Started and FAQ

Similar documents
PBS Pro Documentation

Answers to Federal Reserve Questions. Training for University of Richmond

SGE Roll: Users Guide. Version Edition

Advanced Scripting Using PBS Environment Variables

Batch Systems. Running calculations on HPC resources

Introduction to Grid Computing!

User Guide of High Performance Computing Cluster in School of Physics

Rocks Cluster Administration. Learn how to manage your Rocks Cluster Effectively

Grid Engine Users Guide. 5.5 Edition

SGE Roll: Users Guide. Version 5.3 Edition

Installing and running COMSOL 4.3a on a Linux cluster COMSOL. All rights reserved.

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

High Performance Beowulf Cluster Environment User Manual

SGI Altix Running Batch Jobs With PBSPro Reiner Vogelsang SGI GmbH

Quick Guide for the Torque Cluster Manager

NBIC TechTrack PBS Tutorial

Please include the following sentence in any works using center resources.

Using ISMLL Cluster. Tutorial Lec 5. Mohsan Jameel, Information Systems and Machine Learning Lab, University of Hildesheim

Using ITaP clusters for large scale statistical analysis with R. Doug Crabill Purdue University

Computing with the Moore Cluster

Marvell SATA3 RAID Installation Guide

Grid Engine Users Guide. 7.0 Edition

Running Jobs, Submission Scripts, Modules

TORQUE Resource Manager Quick Start Guide Version

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

Dell EMC ME4 Series vsphere Client Plug-in

Kohinoor queuing document

GPU Cluster Usage Tutorial

Siemens PLM Software. HEEDS MDO Setting up a Windows-to- Linux Compute Resource.

UF Research Computing: Overview and Running STATA

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

OBTAINING AN ACCOUNT:

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

DDT: A visual, parallel debugger on Ra

Test Lab Introduction to the Test Lab Linux Cluster Environment

HPC Resources at Lehigh. Steve Anthony March 22, 2012

Base Roll: Users Guide. Version 5.2 Edition

New User Tutorial. OSU High Performance Computing Center

Introduction to HPC Resources and Linux

Effective Use of CCV Resources

Configuring the JUNOS Software the First Time on a Router with a Single Routing Engine

Introduction to GALILEO

Introduction to Discovery.

Shell Scripting. With Applications to HPC. Edmund Sumbar Copyright 2007 University of Alberta. All rights reserved

Introduction to Discovery.

Reset the Admin Password with the ExtraHop Rescue CD

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen

Introduction to Discovery.

This tutorial will guide you how to setup and run your own minecraft server on a Linux CentOS 6 in no time.

A Hands-On Tutorial: RNA Sequencing Using High-Performance Computing

Shark Cluster Overview

Sharpen Exercise: Using HPC resources and running parallel applications

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Shark Cluster Overview

Supercomputing environment TMA4280 Introduction to Supercomputing

Check the FQDN of your server by executing following two commands in the terminal.

HPC DOCUMENTATION. 3. Node Names and IP addresses:- Node details with respect to their individual IP addresses are given below:-

Initial Setup. Cisco APIC Documentation Roadmap. This chapter contains the following sections:

Advanced Topics in High Performance Scientific Computing [MA5327] Exercise 1

OPENSTACK CLOUD RUNNING IN A VIRTUAL MACHINE. In Preferences, add 3 Host-only Ethernet Adapters with the following IP Addresses:

Logging in to the CRAY

SSL VPN Reinstallation

Introduction to HPC Using the New Cluster at GACRC

Isilon InsightIQ. Version Installation Guide

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Introduction to HPC Using zcluster at GACRC

Answers to Federal Reserve Questions. Administrator Training for University of Richmond

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System

Introduction to PICO Parallel & Production Enviroment

A Brief Introduction to The Center for Advanced Computing

Introduction to Molecular Dynamics on ARCHER: Instructions for running parallel jobs on ARCHER

Computational Biology Software Overview

Using the computational resources at the GACRC

Using RDP with Azure Linux Virtual Machines

Content Gateway v7.x: Frequently Asked Questions

Tutorial on MPI: part I

National Biochemical Computational Research Familiarize yourself with the account policy

Isilon InsightIQ. Version Installation Guide

VERTIV. Avocent ACS8xxx Advanced Console System Release Notes VERSION 2.4.2, AUGUST 24, Release Notes Section Outline. 1 Update Instructions

Migrating from Zcluster to Sapelo

Using CLC Genomics Workbench on Turing

Sharpen Exercise: Using HPC resources and running parallel applications

Simple examples how to run MPI program via PBS on Taurus HPC

Managing GSS User Accounts Through a TACACS+ Server

Singularity: container formats

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Knights Landing production environment on MARCONI

Installing and Upgrading Cisco Network Registrar Virtual Appliance

Guillimin HPC Users Meeting October 20, 2016

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Name Department/Research Area Have you used the Linux command line?

High Performance Computing (HPC) Using zcluster at GACRC

SMRT Analysis Software Installation (v1.3.1)

Overview of the Cisco NCS Command-Line Interface

Introduction to CINECA Computer Environment

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

CompTIA Exam LX0-102 Linux Part 2 Version: 10.0 [ Total Questions: 177 ]

Introduction to Unix Environment: modules, job scripts, PBS. N. Spallanzani (CINECA)

Introduction to Linux and Cluster Computing Environments for Bioinformatics

Transcription:

Viglen NPACI Rocks Getting Started and FAQ

Table of Contents Viglen NPACI Rocks...1 Getting Started...3 Powering up the machines:...3 Checking node status...4 Through web interface:...4 Adding users:...7 Job Submission:...8 FAQ...10 Reinstalling a compute node...10 Adding applications to the cluster...11 Enabling web access to the cluster...11 Synchronising files across the cluster...12

Getting Started Powering up the machines: Power on the headnode first and allow the node to boot to the login screen. The cluster compute nodes rely on the headnode when booting client daemons, such as the queue client. Once the headnode is up, power on the rest of the compute nodes.

Checking node status Once all the nodes are powered on, you can check for any dead nodes using the CLI or Web interface. Through CLI: Login as root and perform the following [root@hostname:~]$ tentakel uptime ### compute-0-0 (stat:0, dur(s): 0.32) down Any nodes that failed to boot will be reported as down. Through web interface: The Ganglia scalable distributed monitoring system is configured on the cluster and can be used to discover dead nodes (along with cluster usage metrics) To access the ganglia web interface, user must launch the firefox browser from the headnode. (Please note external web access is disabled by default through the iptables configuration on the headnode, only ssh access is permitted) When logging into the headnode make sure to specify the X argument for ssh to enable XForwarding. [user@host:~]$ ssh X user@headnode [user@headnode:~]$ firefox And point the browser to http://localhost and you should see a welcome screen as below.

Click on the Cluster Status at the right side of the screen to display the ganglia monitoring screen.

Any nodes that are down or in use are reported through this interface. Please note that a node reported as dead here does not necessarily mean the node is dead, it could just mean that the ganglia monitoring daemon on the client has died. It is recommended that the users confirms the node status using tools such as ssh and ping

Adding users: To add a user, login as root and issue the useradd, set a password and synchronise using the rocks-user-sync command. [root@headnode:~]$ useradd <username> [root@headnode:~]$ passwd <username> [root@headnode:~]$ rocks sync users It is also recommend that the administrator for the cluster login as the user and set the ssh keys to allow passwordless access on the cluster. Any time a user account is created, this initial login will prompt the user for the path to the rsa key pair and an ssh passphrase. Leave all values blank for normal cluster operation.

Job Submission: Jobs are submitted to the queue in the form of a (bash) script. This script should contain a list of the command and arguments that you want to run as if you were running from the command line. Sample Job Submit Script #!/bin/bash # # SGE/PBS options can be specified here # e.g. set job to run in the current working directory #$ -cwd $HOME/myjobdir/myjobexecutable arg1 arg2 Save this file as ~/myjob.sh. To submit this job to the queue, use the qsub command passing the job script file as an argument. E.g. [user@headnode:~]$ qsub myjobs.sh Job status can be checked using qstat and jobs can be removed using qdel <jobid>. For more detailed usage, please refer to the SGE/PBS manuals supplied with the cluster.

Sample MPICH Job (64 core MPICH Job): #!/bin/bash #PBS -S /bin/bash #PBS -V # Request job to run on 8 nodes with 8 processes per node #PBS -l nodes=8:ppn=8 # Change to the working directory cd ${PBS_O_WORKDIR} # set global mem size export P4_GLOBMEMSIZE=64000000 # USERS MODIFY HERE # set file name and arguements MPIEXEC=/opt/mpiexec/bin/mpiexec EXECUTABLE=/opt/hpl/mpich-hpl/bin/xhpl ARGS="-v" # Load the mpich modules environment module load mpich/1.2.7 ########################################## # # # Output some useful job information. # # # ########################################## NPROCS=`wc -l < $PBS_NODEFILE` echo -----------------------------------------------------echo ' This job is allocated on '${NPROCS}' cpu(s)' echo 'Job is running on node(s): ' cat $PBS_NODEFILE echo -----------------------------------------------------echo PBS: qsub is running on $PBS_O_HOST echo PBS: originating queue is $PBS_O_QUEUE echo PBS: executing queue is $PBS_QUEUE echo PBS: working directory is $PBS_O_WORKDIR echo PBS: execution mode is $PBS_ENVIRONMENT echo PBS: job identifier is $PBS_JOBID echo PBS: job name is $PBS_JOBNAME echo PBS: node file is $PBS_NODEFILE echo PBS: current home directory is $PBS_O_HOME echo PBS: PATH = $PBS_O_PATH echo -----------------------------------------------------# Launch the Job ${MPIEXEC} ${ARGS} ${EXECUTABLE} Viglen provide sample job submission scripts for all the above jobs (~/qsub-scripts/). A directory is created when a new user is added with sample job submission scripts. Feel free to edit these for use with local applications/codes.

FAQ Reinstalling a compute node The recommended way of dealing with any node failures is to reinstall the problematic compute node (unless a hardware fault is present). The compute nodes are all set to boot from the network first, then local disk. The headnode can be configured to force any compute node to reinstall by setting the pxeboot flag. To see what pxeboot flags are currently set on your cluster, use the rocks command: [root@headnode:~]$ rocks list host pxeboot HOST ACTION headnode: -----compute-0-0 os compute-0-1 os... To flag a node for reinstallation use the rocks command to set the pxeboot flag to install. [..]$ rocks set host pxeboot compute-0-0 action=install Then power cycle the node (or reboot if node is still alive). The command can also be set back to boot local disk by setting the pxeboot flag back to os using the rocks command. [..]$ rocks set host pxeboot compute-0-0 action=os Power cycle the node: If IPMI modules are installed ipmitool U [user] P [passwd] h compute-0-0 chassis power off ipmitool U [user] P [passwd] h compute-0-0 chassis power on If APC PDUs are used: apc off compute-0-0 apc on compute-0-0 Note: If power cycling using the APC PDU, the BIOS power settings (Advance Boot Features Restore on AC Power Off) needs to be set to Last state on the compute node.

Adding applications to the cluster This topic is covered in details in the rocks user guide and online at http://www.rocksclusters.org/rolldocumentation/base/5.1/customization-adding-packages.html Enabling web access to the cluster By default, the firewall on the headnode will only allow ssh access. If you want to enable web access to the cluster (for e.g. monitoring through ganglia) you can allow this through the iptables configuration file /etc/sysconfig/iptables In the file, lines 18 and 19 are commented out, to enable web access uncomment these and restart iptables (/etc/init.d/iptables restart) # Uncomment the lines below to activate web access to the cluster. #-A INPUT m state --state NEW p tcp -dport https j ACCEPT #-A INPUT m state -state NEW p tcp -dport www j ACCEPT

Synchronising files across the cluster If there is a configuration file in/etc (or anywhere that s not an NFS share) that you want to synchronise across all nodes, there are two ways to do this. Option 1: Through the extend-compute.xml file (*note* changes to the file on the headnode will not automatically be synchronised more useful for configurations that need to be consistent on compute nodes only) In the <post> section of this file, create a <file> tag: <post> <file name= /path/to/file > # contents of file </file> When finished, verify the additions are syntactically correct: [user@host:site-profiles]$ xmllint extend-compute.xml Rebuild the distribution: [user@host:site-profiles]$ cd /home/install [user@host:install]$ rocks create distro Verify kickstart files are being generated correctly (from the /home/install directory. [user@host:install]$./sbin/kickstart.cgi -client= compute-0-0 This command should dump the kickstart file to screen (provided the node has already been installed and exists in the database). If not, repeat the above steps and try again. If the command is successful, set the pxeboot flag for the nodes (See Reinstalling a compute node ) and reboot the node. Option 2: Through the 411 subsystem Edit the file /var/411/files.mk The last line should read #FILE += /path/to/file

Uncomment this line, update with the path to the file you want to synchronise and run the command rocks sync users to apply this change. ie: FILE += /etc/hosts