ISTeC Cray High-Performance Computing System. Richard Casey, PhD RMRCE CSU Center for Bioinformatics

Similar documents
ISTeC Cray High Performance Computing System User s Guide Version 5.0 Updated 05/22/15

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

ARCHER Single Node Optimisation

Using CrayPAT and Apprentice2: A Stepby-step

Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon

Running applications on the Cray XC30

Monitoring Power CrayPat-lite

Logging in to the CRAY

Steps to create a hybrid code

Sharpen Exercise: Using HPC resources and running parallel applications

CLE and How to. Jan Thorbecke

Compiling applications for the Cray XC

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen

NBIC TechTrack PBS Tutorial

Sharpen Exercise: Using HPC resources and running parallel applications

PROGRAMMING MODEL EXAMPLES

Batch Systems. Running calculations on HPC resources

Practical: a sample code

Our new HPC-Cluster An overview

PRACTICAL MACHINE SPECIFIC COMMANDS KRAKEN

MPI introduction - exercises -

First steps on using an HPC service ARCHER

Debugging on Blue Waters

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Heidi Poxon Cray Inc.

Introduction to Molecular Dynamics on ARCHER: Instructions for running parallel jobs on ARCHER

PBS Pro Documentation

P a g e 1. HPC Example for C with OpenMPI

User Guide of High Performance Computing Cluster in School of Physics

Introduction to GALILEO

Reveal. Dr. Stephen Sachs

COMPILING FOR THE ARCHER HARDWARE. Slides contributed by Cray and EPCC

Heidi Poxon Cray Inc.

Shifter on Blue Waters

Getting started with the CEES Grid

Supercomputing environment TMA4280 Introduction to Supercomputing

Compute Cluster Server Lab 2: Carrying out Jobs under Microsoft Compute Cluster Server 2003

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Tech Computer Center Documentation

Our Workshop Environment

Introduction to PICO Parallel & Production Enviroment

Introduction to Parallel Programming with MPI

Introduction to SahasraT. RAVITEJA K Applications Analyst, Cray inc E Mail :

Toward Automated Application Profiling on Cray Systems

Quick Guide for the Torque Cluster Manager

Performance Measurement and Analysis Tools Installation Guide S

UBDA Platform User Gudie. 16 July P a g e 1

Answers to Federal Reserve Questions. Training for University of Richmond

Hybrid MPI+OpenMP Parallel MD

Simple examples how to run MPI program via PBS on Taurus HPC

An Introduction to the Cray X1E

ACEnet for CS6702 Ross Dickson, Computational Research Consultant 29 Sep 2009

JURECA Tuning for the platform

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Reveal Heidi Poxon Sr. Principal Engineer Cray Programming Environment

SGI Altix Running Batch Jobs With PBSPro Reiner Vogelsang SGI GmbH

Blue Waters Programming Environment

Batch Systems. Running your jobs on an HPC machine

Running Jobs on Blue Waters. Greg Bauer

Reducing Cluster Compatibility Mode (CCM) Complexity

Molecular Modelling and the Cray XC30 Performance Counters. Michael Bareford, ARCHER CSE Team

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Assignment 2 Using Paraguin to Create Parallel Programs

Advanced Message-Passing Interface (MPI)

Introduction to HPC Using zcluster at GACRC

Introduction to GALILEO

The Message Passing Model

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

Introduction to GALILEO

High Performance Computing (HPC) Using zcluster at GACRC

Cluster Clonetroop: HowTo 2014

Sequence Alignment. Practical Introduction to HPC Exercise

Part One: The Files. C MPI Slurm Tutorial - Hello World. Introduction. Hello World! hello.tar. The files, summary. Output Files, summary

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Beacon Quickstart Guide at AACE/NICS

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

The DTU HPC system. and how to use TopOpt in PETSc on a HPC system, visualize and 3D print results.

Introduction to CINECA Computer Environment

Investigating and Vectorizing IFS on a Cray Supercomputer

AWP ODC QUICK START GUIDE

CS C Primer. Tyler Szepesi. January 16, 2013

Advanced Job Launching. mapping applications to hardware

Introduction to Unix Environment: modules, job scripts, PBS. N. Spallanzani (CINECA)

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Programming Techniques for Supercomputers. HPC RRZE University Erlangen-Nürnberg Sommersemester 2018

Introduction to HPC Using the New Cluster at GACRC

Parameter searches and the batch system

CSC Supercomputing Environment

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Table of Contents. Table of Contents Job Manager for remote execution of QuantumATK scripts. A single remote machine

Parallel Programming Languages 1 - OpenMP

Running Jobs, Submission Scripts, Modules

ITCS 4145/5145 Assignment 2

Introduction to HPC Using zcluster at GACRC

Practical Introduction to Message-Passing Interface (MPI)

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

KNL tools. Dr. Fabio Baruffa

Transcription:

ISTeC Cray High-Performance Computing System Richard Casey, PhD RMRCE CSU Center for Bioinformatics

Compute Node Status Check whether interactive and batch compute nodes are up or down: xtprocadmin NID (HEX) NODENAME TYPE STATUS MODE 12 0xc c0-0c0s3n0 compute up interactive 13 0xd c0-0c0s3n1 compute up interactive 14 0xe c0-0c0s3n2 compute up interactive 15 0xf c0-0c0s3n3 compute up interactive 16 0x10 c0-0c0s4n0 compute up interactive 17 0x11 c0-0c0s4n1 compute up interactive 18 0x12 c0-0c0s4n2 compute up interactive 42 0x2a c0-0c1s2n2 compute up batch 43 0x2b c0-0c1s2n3 compute up batch 44 0x2c c0-0c1s3n0 compute up batch 45 0x2d c0-0c1s3n1 compute up batch 61 0x3d c0-0c1s7n1 compute up batch 62 0x3e c0-0c1s7n2 compute up batch 63 0x3f c0-0c1s7n3 compute up batch Naming convention: CabinetX-Y Cage-X Slot-X Node-X i.e. Cabinet0-0,Cage0,Slot3,Node0 Currently 960 batch compute cores 288 interactive compute cores

Compute Node Status Check the state of interactive and batch compute nodes and whether they are already allocated to other user s jobs: xtnodestat Current Allocation Status at Tue Apr 19 08:15:02 2011 Cabinet ID Service Nodes Cage X: Node X Slots (=blades) C0-0 n3 -------B n2 -------B n1 -------- c1n0 -------- n3 SSSaa;-- n2 aa;-- n1 aa;-- c0n0 SSSaa;-- s01234567 Batch Compute Nodes Allocated Batch Compute Nodes Free Batch Compute Nodes Interactive Compute Nodes Allocated Interactive Compute Nodes Free Interactive Compute Nodes Legend: nonexistent node S service node (login, boot, lustrefs) ; free interactive compute node - free batch compute node A allocated, but idle compute node? suspect compute node X down compute node Y down or admindown service node Z admindown compute node Available compute nodes: 4 interactive, 38 batch

Batch Jobs Torque/PBS Batch Queue Management System For submission and management of jobs in batch queues Use for jobs with large resource requirements (long-running, # of cores, memory, etc.) List all available queues: qstat Q (brief) qstat Qf (full) rcasey@cray2:~> qstat -Q Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T ---------------- --- --- --- --- --- --- --- --- --- --- - batch 0 0 yes yes 0 0 0 0 0 0 E Show the status of jobs in all queues: qstat (all queued jobs) qstat u username (only queued jobs for username ) (Note: if there are no jobs running in any of the batch queues, this command will show nothing and just return the Linux prompt). rcasey@cray2:~/lustrefs/mpi_c> qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1753.sdb mpic.job rcasey 0 R batch

Batch Jobs Common Job States Q: job is queued R: job is running E: job is exiting after having run C: job is completed after having run Submit a job to the default batch queue: qsub filename filename is the name of a file that contains batch queue commands Command line directives override batch script directives i.e. qsub N newname script ; newname overrides -N name in batch script Delete a job from the batch queues: qdel jobid jobid is the job ID number as displayed by the qstat command. You must be the owner of the job in order to delete it.

Sample Batch Job Script #!/bin/bash #PBS N jobname #PBS j oe #PBS l mppwidth=24 #PBS l walltime=1:00:00 #PBS q batch cd $PBS_O_WORKDIR date aprun n 24 executable PBS directives: -N: name of the job -j oe: combine standard output and standard error in single file -l mppwidth: specifies number of cores to allocate to job -l walltime: specifies maximum amount of wall clock time for job to run (hh:mm:ss); default = 5 years -q: specify which queue to submit the job to

Sample Batch Job Script PBS_O_WORKDIR environment variable is generated by Torque/PBS. Contains absolute path to directory from which you submitted your job. Required for Torque/PBS to find your executable files. Linux commands can be included in batch job script The value set in aprun -n parameter should match value set in PBS mppwidth directive i.e. #PBS l mppwidth=24 i.e. aprun n 24 exe Request proper resources: If -n or mppwidth > 960, job will be held in queued state for awhile and then deleted If mppwidth < -n, then error message apsched: claim exceeds reservation's nodecount If mppwidth > -n, then OK

Performance Analysis: Overview Performance analysis process consists of three basic steps: Instrument your program, to specify what kind of data you want to collect under what conditions Execute your instrumented program, to generate and capture the desired data Analyze the resulting data

Performance Analysis: Overview CrayPat, Perftools Cray s toolkit for instrumenting executables and producing data from runs Two basic types of analyses available: Sampling/Profiling: samples program counters at fixed intervals Tracing: traces function calls Type of analysis guided by build options and environment variables Profile/Trace function calls & loops Produce call graphs and execution profiles Adds some overhead to executable & increases runtime

Performance Analysis: Overview CrayPat, Perftools Outputs data in binary format which can be converted to text format, i.e. reports that contain statistical information CrayPat supports many languages + extensions C, C++, Fortran, MPI, OpenMP Use of binary instrumentation means relatively low overhead and no interference with compiler optimizations: Cray performance is dependent on compiler optimizations (loop vectorization especially), so this is a necessity for CrayPat Sampling instrumentation results in some overhead (< 2-3 %) Logfiles from runs are generally compact Check man craypat, pat_help, and the Craydoc Using Cray Performance Analysis Tools for more info

Performance Analysis: Workflow Load Cray, perftools, & craypat modules before compiling module load PrgEnv-cray module load perftools module load xt-craypat Compile code Use Cray compiler wrappers (cc, CC, ftn) Make sure object files (*.o) are retained C: cc -c exe.c, then cc o exe exe.o C++: CC c exe.c, then CC o exe exe.o Fortran: ftn c exe.f90, then ftn o exe exe.o If you use Makefiles, modify them to retain object files

Performance Analysis: Workflow Generate instrumented executable pat_build [options] exe Creates an instrumented executable exe+pat Execute instrumented code aprun n 1 exe+pat Creates file exe+pat+pid.xf (PID = process ID) Generate reports pat_report [options] exe+pat+pid.xf Outputs performance reports ( rpt text file)

Performance Analysis: Workflow pat_build By default, pat_build instruments code for sampling/profiling To instrument code for tracing, include one or several options: -w, -u, -g, -O, -T, -t i.e. pat_build w exe (enable tracing) i.e. pat_build u exe (trace user-defined functions only) i.e. pat_build g tracegroup exe (enable tracegroups) i.e. pat_build O reports exe (enable predefined reports) i.e. pat_build T funcname exe (trace specific function by name) i.e. pat_build t funclist exe (trace list of functions by name) Control instrumented program behavior and data collection 50+ optional runtime environment variables For example: To generate more detailed reports: export PAT_RT_SUMMARY=0 To measure MPI load imbalance: export PAT_RT_MPI_SYNC=1 for tracing export PAT_RT_MPI_SYNC=0 for sampling

Performance Analysis: Workflow Trace Groups Instrument code to trace all function references belonging to a specified group 30+ trace groups pat_build g tracegroup exe For example: To trace MPI calls, I/O calls, memory references: pat_build g mpi,io,heap exe Trace Group mpi omp stdio sysio io lustre heap Desc MPI calls OpenMP calls Application I/O calls System I/O calls stdio and sysio Lustre file system calls Memory references

Performance Analysis: Workflow Predefined reports 30+ predefined reports Use pat_report -O option For example, To show data by function name only: pat_report O profile exe+pat+pid.xf To show calling tree: pat_report O calltree exe+pat+pid.xf To show load balance across PE s: pat_report O load_balance exe+pat+pid.xf Report Option profile calltree load_balance heap_hiwater loops read_stats, write_stats Desc Show function names only Show calling tree top-down Show load balance across PE s Show max memory usage Show loop counts Show I/O statistics

Performance Analysis: Workflow Predefined Experiments Instrument code using preset environments 9 predefined experiments Choose experiment by setting PAT_RT_EXPERIMENT environment variable For example: To sample program counters at regular intervals: export PAT_RT_EXPERIMENT=samp_pc_time (default) Default sampling interval = 10,000 microseconds Change sampling interval with PAT_RT_INTERVAL, PAT_RT_INTERVAL_TIMER To trace function calls: export PAT_RT_EXPERIMENT=trace One of the pat_build trace options must be specified ( -g, -u, -t, -T, -O, -w )

Performance Analysis: Workflow Predefined Hardware Performance Counter Groups Build and instrument code as usual Set PAT_RT_HWPC env var (i.e. export PAT_RT_HWPC=3 ) 20 predefined groups available Summary L1, L2, L3 cache data accesses & misses Bandwidth info Hypertransport info Cycles stalled, resources idle/full Instructions and branches Instruction caches Cache hierarchy FP operations mix, vectorization, single-precision, double-precision Prefetches See man hwpc for full list and group numbers For summary data: export PAT_RT_HWPC=0 Shows MFLOPS, MIPS, computational intensity (FP ops / mem access), etc.

Performance Analysis: Reports #include <mpi.h> #include <stdio.h> #define N 10000 #define LOOPCNT 10000 void loop(float a[], float b[], float c[]); void main (int argc, char *argv[]) { int i, rank; float a[n], b[n], c[n]; for (i=0; i < N; i++) { a[i] = i * 1.0; b[i] = i * 1.0 } MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); for (i=0; i<loopcnt; i++) { loop(a, b, c); } MPI_Finalize(); if(rank==0) { for (i=0; i < N; i++) { printf("c[%d]= %f\n", i, c[i]);}} void loop(float a[], float b[], float c[]) { int i, numprocs; MPI_Comm_size(MPI_COMM_WORLD,&numprocs); for (i=0; i < N; i++) { c[i] = a[i] + b[i]; } }

Performance Analysis: Reports Default profiling cc c exe.c ; cc o exe exe.o ; pat_build exe ; pat_report *.xf > rpt CrayPat/X: Version 5.1 Revision 3746 (xf 3586) 08/20/10 16:46:28 Number of PEs (MPI ranks): 6 Numbers of PEs per Node: 6 PEs on 1 Node Numbers of Threads per PE: 1 thread on each of 6 PEs Number of Cores per Socket: 12 Execution start time: Mon Apr 18 13:23:12 2011 System name, type, and speed: x86_64 1900 MHz Table 2: Profile by Group, Function, and Line Samp % Samp Imb. Imb. Group Samp Samp % Function Source Line PE='HIDE' 100.0% 13 -- -- Total -------------------------------------------- 100.0% 13 -- -- USER subfunc 3 rcasey/perform/exe_c/exe.c ----------------------------------------- 4 15.4% 2 1.83 55.0% line.45 4 84.6% 11 2.33 21.5% line.46 <= for loop in subfunc function ============================================

Performance Analysis: Reports Profile function calls pat_build exe ; pat_report O profile *.xf > rpt Table 1: Profile by Function Group and Function Samp % Samp Group Function 100.0% 2 Total ------------------------ 50.0% 1 ETC vfprintf 50.0% 1 USER subfunc ========================

Performance Analysis: Reports Profile user function calls pat_build u exe ; pat_report *.xf > rpt Table 1: Profile by Function Group and Function Time % Time Calls Group Function 100.0% 0.086681 1004.0 Total ------------------------------------- 100.0% 0.086677 1002.0 USER ------------------------------------ 76.2% 0.066092 1.0 main 23.7% 0.020550 1000.0 subfunc =====================================

Performance Analysis: Reports Combine MPI calls, I/O calls, memory references pat_build g mpi,io,heap exe ; pat_report *.xf > rpt Table 1: Profile by Function Group and Function Time % Time Calls Group Function 100.0% 0.123657 12005.0 Total -------------------------------------- 79.9% 0.098813 10000.0 STDIO printf 20.1% 0.024828 1002.0 USER ------------------------------------- 16.9% 0.020847 1000.0 subfunc 3.2% 0.003947 1.0 main ====================================== 100.0% 0.086681 1004.0 Total ------------------------------------- 100.0% 0.086677 1002.0 USER ------------------------------------ 76.2% 0.066092 1.0 main 23.7% 0.020550 1000.0 subfunc ===================================== Table 8: File Output Stats by Filename Write Write MB Write Writes Write File Name Time Rate B/Call MB/sec 0.100870 0.203452 2.016974 10000.000000 21.33 Total ---------------------------------------------------------------- - 0.100870 0.203452 2.016974 10000.000000 21.33 stdout ================================================================ Table 9: Wall Clock Time, Memory High Water Mark Process Process Total Time HiMem (MBytes) 0.145398 22.160 Total ========================== Table 2: Load Balance with MPI Message Stats Time % Time Group 100.0% 0.126971 Total ------------------------ 80.0% 0.101597 STDIO 19.8% 0.025107 USER ========================

Performance Analysis: Reports Loop statistics cc c h profile_generate exe.c ; cc o exe exe.o ; pat_build exe ; pat_report *.xf > rpt Table 1: Loop Stats from -hprofile_generate Loop Loop Loop Loop Loop Loop Function=/.LOOP\. U.B. Hit Trips Trips Trips Notes Time Avg Min Max 100.0% 1003 9991.0 1000 10000 -- Total ------------------------------------------------------------------- 82.7% 1 10000.0 10000 10000 vector main.loop.0.li.22 82.7% 1 1000.0 1000 1000 novec main.loop.1.li.30 82.7% 1 10000.0 10000 10000 novec main.loop.2.li.36 17.3% 1000 10000.0 10000 10000 vector subfunc.loop.0.li.47 =================================================================== 100.0% 0.086681 1004.0 Total ------------------------------------- 100.0% 0.086677 1002.0 USER ------------------------------------ 76.2% 0.066092 1.0 main 23.7% 0.020550 1000.0 subfunc =====================================

Performance Analysis: Reports I/O statistics pat_build O write_stats exe ; pat_report *.xf > rpt Table 1: File Output Stats by Filename Write Write MB Write Writes Write File Name Time Rate B/Call MB/sec 0.108173 0.203452 1.880805 10000.000000 21.33 Total ----------------------------------------------------------------- 0.108173 0.203452 1.880805 10000.000000 21.33 stdout =================================================================