Compiling applications for the Cray XC

Similar documents
Running applications on the Cray XC30

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

First steps on using an HPC service ARCHER

COMPILING FOR THE ARCHER HARDWARE. Slides contributed by Cray and EPCC

Practical: a sample code

Programming Environment 4/11/2015

Introduction to SahasraT. RAVITEJA K Applications Analyst, Cray inc E Mail :

Advanced Job Launching. mapping applications to hardware

PROGRAMMING MODEL EXAMPLES

CLE and How to. Jan Thorbecke

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

CSC Supercomputing Environment

Introduction to PICO Parallel & Production Enviroment

How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab

Sisu User Guide 1. Sisu User Guide. Version: First version of the Sisu phase 2 User Guide

User Orientation on Cray XC40 SERC, IISc

Effective Use of CCV Resources

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Introduction to GALILEO

Introduction to GALILEO

Combining OpenMP and MPI. Timothy H. Kaiser,Ph.D..

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Running Jobs on Blue Waters. Greg Bauer

Our new HPC-Cluster An overview

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Sharpen Exercise: Using HPC resources and running parallel applications

Graham vs legacy systems

Shifter and Singularity on Blue Waters

ECE 574 Cluster Computing Lecture 4

Introduction to HPC Numerical libraries on FERMI and PLX

Cray RS Programming Environment

CSC BioWeek 2018: Using Taito cluster for high throughput data analysis

High Performance Beowulf Cluster Environment User Manual

Sharpen Exercise: Using HPC resources and running parallel applications

Our Workshop Environment

The Cray Programming Environment. An Introduction

Using a Linux System 6

Batch Systems. Running calculations on HPC resources

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Introduction to CINECA HPC Environment

MIC Lab Parallel Computing on Stampede

Batch Systems. Running your jobs on an HPC machine

Cray Scientific Libraries. Overview

Reducing Cluster Compatibility Mode (CCM) Complexity

JURECA Tuning for the platform

High Performance Computing Cluster Advanced course

Wednesday, August 10, 11. The Texas Advanced Computing Center Michael B. Gonzales, Ph.D. Program Director, Computational Biology

OpenMP threading on Mio and AuN. Timothy H. Kaiser, Ph.D. Feb 23, 2015

ISTeC Cray High Performance Computing System User s Guide Version 5.0 Updated 05/22/15

The Cray Programming Environment. An Introduction

TITANI CLUSTER USER MANUAL V.1.3

ParaTools ThreadSpotter Analysis of HELIOS

HPC Workshop. Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing

Why Combine OpenMP and MPI

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Introduction to the PDC environment. PDC Center for High Performance Computing KTH, Sweden

Introduction to GALILEO

CSC BioWeek 2016: Using Taito cluster for high throughput data analysis

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Debugging with Totalview. Martin Čuma Center for High Performance Computing University of Utah

Introduction to High-Performance Computing (HPC)

Compilation and Parallel Start

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

Shifter on Blue Waters

High Performance Computing Cluster Basic course

HTC Brief Instructions

SLURM Operation on Cray XT and XE

Never forget Always use the ftn, cc, and CC wrappers

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Cray Support of the MPICH ABI Compatibility Initiative

OpenACC Accelerator Directives. May 3, 2013

How to run a job on a Cluster?

Blue Waters Programming Environment

Introduction to SLURM & SLURM batch scripts

SCALABLE HYBRID PROTOTYPE

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

HPCF Cray Phase 2. User Test period. Cristian Simarro User Support. ECMWF April 18, 2016

Evaluating Shifter for HPC Applications Don Bahls Cray Inc.

Introduction to GACRC Teaching Cluster PHYS8602

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

How to Use a Supercomputer - A Boot Camp

Intel MPI Cluster Edition on Graham A First Look! Doug Roberts

For Dr Landau s PHYS8602 course

Introduction to Slurm

ISTeC Cray High-Performance Computing System. Richard Casey, PhD RMRCE CSU Center for Bioinformatics

Introduction to GACRC Teaching Cluster

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Cluster Clonetroop: HowTo 2014

Bash for SLURM. Author: Wesley Schaal Pharmaceutical Bioinformatics, Uppsala University

Lab: Scientific Computing Tsunami-Simulation

Stable Cray Support in EasyBuild 2.7. Petar Forai

Parallel Programming. Libraries and implementations

Computing with the Moore Cluster

Power9 CTE User s Guide

Xeon Phi Native Mode - Sharpen Exercise

Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Linux Tutorial. Ken-ichi Nomura. 3 rd Magics Materials Software Workshop. Gaithersburg Marriott Washingtonian Center November 11-13, 2018

User Training Cray XC40 IITM, Pune

Transcription:

Compiling applications for the Cray XC

Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers for each language are: cc wrapper around the C compiler CC wrapper around the C++ compiler ftn wrapper around the Fortran compiler These scripts will choose the required compiler version, target architecture options, scientific libraries and their include files automatically from the module environment. Use them exactly like you would the original compiler, e.g. To compile prog1.f90 run ftn -c prog1.f90

Compiler Driver Wrappers (2) The scripts choose which compiler to use from the PrgEnv module loaded PrgEnv Description Real Compilers PrgEnv-cray Cray Compilation Environment crayftn, craycc, craycc PrgEnv-intel Intel Composer Suite ifort, icc, icpc PrgEnv-gnu GNU Compiler Collection gfortran, gcc, g++ PrgEnv-pgi Portland Group Compilers pgf90, pgcc, pgcc Use module swap to change PrgEnv, e.g. module swap PrgEnv-cray PrgEnv-intel PrgEnv-cray is loaded by default at login. This may differ on other Cray systems u use module list to check what is currently loaded The Cray MPI module is loaded by default (cray-mpich2). To support SHMEM load the cray-shmem module.

Compiler Versions There are usually multiple versions of each compiler available to users. The most recent version is usually the default and will be loaded when swapping PrgEnvs. To change the version of the compiler in use, swap the Compiler Module. e.g. module swap cce cce/8.1.6 PrgEnv PrgEnv-cray PrgEnv-intel PrgEnv-gnu PrgEnv-pgi Compiler Module cce intel gcc pgi

EXCEPTION: Cross Compiling Environment The wrapper scripts, ftn, cc and CC, will create a highly optimised executable tuned for the Cray XC s compute nodes. This executable may not run on the login nodes 1. Login nodes do not support running distributed memory applications 2. Some Cray architectures may have different processors in the login and compute nodes. E.g. cross-compilation. If you are compiling for the login node you should use the original direct compiler commands e.g. ifort, ipcp, crayftn, gcc, g++ or gfortran. The PATH variable will change with the modules.

About the I, L and l flags For libraries and include files being triggered by module files, you should NOT add anything to your Makefile No additional MPI flags are needed (included by wrappers) You do not need to add any -I, -l or L flags for the Cray provided libraries If your Makefile needs an input for L to work correctly, try using. If you really, really need a specific path, try checking module show X for some environment variables

Dynamic vs Static linking Currently static linking is default May change in the future Already changed when linking for GPUs (XK6 nodes) To decide how to link, 1. you can either set CRAY_LINK_TYPE to static or dynamic 2. Or pass the -static or -dynamic option to the linking compiler. Features of dynamic linking : smaller executable, automatic use of new libs Might need longer startup time to load and find the libs Environment (loaded modules) should be the same between your compiler setup and your batch script (eg. when switching to PrgEnv-intel) Features of static linking : Larger executable (usually not a problem) Faster startup Application will run the same code every time it runs (independent of environment) If you want to hardcode the rpath into the executable use Set CRAY_ADD_RPATH=yes during compilation This will always load the same version of the lib when running, independent of the version loaded by modules

OpenMP OpenMP is support by all of the PrgEnvs. CCE (PrgEnv-cray) recognizes and interprets OpenMP directives by default. If you have OpenMP directives in your application but do not wish to use them, disable OpenMP recognition with hnoomp. PrgEnv Enable OpenMP Disable OpenMP PrgEnv-cray -homp -hnoomp PrgEnv-intel PrgEnv-gnu PrgEnv-pgi -openmp -fopenmp -mp

Compiler man Pages For more information on individual compilers PrgEnv C C++ Fortran PrgEnv-cray man craycc man craycc man crayftn PrgEnv-intel man icc man icpc man ifort PrgEnv-gnu man gcc man g++ man gfortran PrgEnv-pgi man pgcc man pgcc man pgf90 Wrappers man cc man CC man ftn To verify that you are using the correct version of a compiler, use: -V option on a cc, CC, or ftn command with PGI, Intel and Cray --version option on a cc, CC, or ftn command with GNU

Running applications on the Cray XC

How applications run on a Cray XC Most Cray XCs are batch systems. Users submit batch job scripts to a scheduler from a login node (e.g. PBS, MOAB, SLURM) for execution at some point in the future. Each job requires resources and a predicts how long it will run. The scheduler (running on an external server) chooses which jobs to run and allocates appropriate resources The batch system will then execute the user s job script on an a different login or batch MOM node. The scheduler monitors the job and kills any that overrun their runtime prediction. User job scripts typically contain two types of statements. 1. Serial commands that are executed by the MOM node, e.g. quick setup and post processing commands e.g. (rm, cd, mkdir etc) 2. Parallel executables that run on compute nodes. 1. Launched using the aprun command.

The Two types of Cray XC Nodes Login or service nodes This is the node you access when you first log in to the system. It runs a full version of the CLE operating system (all libraries and tools available) They are used for editing files, compiling code, submitting jobs to the batch queue and other interactive tasks. They are shared resources that may be used concurrently by multiple users. There may be many login nodes in any Cray XC30 and can be used for various system services (IO routers, daemon servers). They can be either connected to the Cray Aries network (internal login nodes) or proxies (external or eslogin nodes). Compute nodes These are the nodes on which production jobs are executed It runs Compute Node Linux, a version of the OS optimised for running batch workloads They can only be accessed by submitting jobs through a batch management system (e.g. PBS Pro, Moab, SLURM) They are exclusive resources that may only be used by a single user. There are many more compute nodes in any Cray XC30 than login or service nodes. They are always directly connected to the Cray Aries.

Primary File Systems on Sisu Home space ($HOME, Lustre) Visible to compute nodes (but read-only) Will have quotas and is backed up Permanent application store ($USERAPPL, Lustre) For application binaries Visible to compute nodes Backed up Temporary compile area ($TMPDIR, 2TB) Local filesystem on login nodes Compile here (NOT on home space) Work Space (768TB, Lustre, /wrk, $WRKDIR) Parallel filesystem optimized for large files, high bandwidth Use for scientific output, restart files etc. Visible to login nodes and compute nodes Not backed up There is no local storage on compute nodes

Lifecycle of a batch script Example Batch Job Script run.sh eslogin sbatch run.sh Scheduler Resources Serial Parallel #!/bin/bash #SBATCH --ntasks=64 #SBATCH --cpus-per-task=4 cd $WORKDIR aprun n 256 N 16 simulation.exe rm r $WORKDIR/tmp Cray XC Compute Nodes SLRUM Queue Manager SLRUM MOM Node

Running an application on the Cray XC ALPS + aprun ALPS : Application Level Placement Scheduler aprun is the ALPS application launcher It must be used to run application on the XC compute nodes: interactively or in a batch job If aprun is not used, the application is launched on the MOM node (and will most likely fail). aprun launches groups of Processing Elements (PEs) on the compute nodes (PE == MPI RANK == Coarray Image == UPC Thread) aprun man page contains several useful examples The 3 most important parameters to set are: Description Option Total Number of PEs used by the application -n Number of PEs per compute node -N Number of threads per PE (More precise, the stride between 2 PEs on a node) -d

Hyperthreads Intel Hyper-Threading is a method of improving the throughput of a CPU by allowing two independent program threads to share the execution resources of one CPU When one thread stalls the processor can execute ready instructions from a second thread instead of sitting idle Because only the thread context state and a few other resources are replicated (unlike replicating entire processor cores), the throughput improvement depends on whether the shared execution resources are a bottleneck Typically much less than 2x with two hyperthreads With aprun, hyper-threading is controlled with the switch -j -j 1 = no hyper-threading (a node is treated to contain 16 cores) -j 2 = hyper-threading enabled (a node is treated to contain 32 cores) It may improve or degrade the performance of your application Try it, if it does not help, turn it off

Running applications on the Cray XC30: Some basic examples Assuming an XC30 with Sandybridge nodes (32 cores per node with Hyperthreading) Pure MPI application, using all the available HT cores in a node $ aprun n 32 j2./a.out Pure MPI application, using only 1 rank per node 32 MPI tasks, 32 nodes with 32*32 core allocated Can be done to increase the available memory for the MPI tasks $ aprun N 1 n 32 d 32 j2./a.out Hybrid MPI/OpenMP application, 4 MPI ranks (PE) per node 32 MPI tasks, 8 OpenMP threads each need to set OMP_NUM_THREADS $ export OMP_NUM_THREADS=8 $ aprun n 32 N 4 d $OMP_NUM_THREADS j2

Scheduling a batch application with SLURM The number of required nodes can be specified in the job header The job is submitted by the sbatch command At the end of the exection, output and error files are returned to submission directory You can check the status of jobs with: squeue You can delete a job with scancel Hybrid MPI + OpenMP #!/bin/bash #SBATCH --jobname hybrid #SBATCH --ntasks=64 #SBATCH --cpus-per-task=4 export OMP_NUM_THREADS=4 aprun n64 d4 N4 j1 a.out

Other SLURM options #SBATCH --output=std.out #SBATCH --error=std.err File names of stdout and stderr #SBATCH --ntask=n Start N PEs (tasks) #BATCH --ntasks-per-node=m Start M tasks per node. Total Nodes used are N/M (+1 if mod(n,m)!=0) #SBATCH --cpus-per-task=8 Number of threads per tasks. Most used together with OMP_NUM_THREADS #SBATCH --nodes=8 Reserve 8 nodes (128 physical cores) but possibly not using all of them, e.g. for having more memory per core

SLURM and aprun SLURM aprun Descropt --ntasks=$pe -n $PE Number of PE to start --cpus-per-task=$threads -d $threads # threads/pe --ntasks-per-node=$n -N $N # (PEs per node) Shortcut: aprun s B option will automatically use the appropriate SLURM settings for n,-n,-d and m, e.g. aprun B./a.out

Watching a launched job on the Cray XC xtnodestat Shows how XC nodes are allocated and corresponding aprun commands apstat Shows aprun processes status apstat overview apstat a[ apid ] info about all the applications or a specific one apstat n info about the status of the nodes Batch squeue command shows batch jobs