PROGRAMMING MODEL EXAMPLES

Similar documents
Compiling applications for the Cray XC

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

Running applications on the Cray XC30

Practical: a sample code

Sharpen Exercise: Using HPC resources and running parallel applications

Advanced Job Launching. mapping applications to hardware

First steps on using an HPC service ARCHER

Sharpen Exercise: Using HPC resources and running parallel applications

The Cray Programming Environment. An Introduction

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Batch Systems. Running calculations on HPC resources

Our Workshop Environment

Introduction to GALILEO

User Guide of High Performance Computing Cluster in School of Physics

Introduction to PICO Parallel & Production Enviroment

Programming Environment 4/11/2015

Our new HPC-Cluster An overview

Domain Decomposition: Computational Fluid Dynamics

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Parallel Programming with Fortran Coarrays: Exercises

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System

Domain Decomposition: Computational Fluid Dynamics

Introduction to SahasraT. RAVITEJA K Applications Analyst, Cray inc E Mail :

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center

NBIC TechTrack PBS Tutorial

Parallel Programming. Libraries and implementations

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Domain Decomposition: Computational Fluid Dynamics

HPCF Cray Phase 2. User Test period. Cristian Simarro User Support. ECMWF April 18, 2016

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

The Cray Programming Environment. An Introduction

Table of Contents. Table of Contents Job Manager for remote execution of QuantumATK scripts. A single remote machine

COMPILING FOR THE ARCHER HARDWARE. Slides contributed by Cray and EPCC

Parallel Programming Libraries and implementations

PBS Pro Documentation

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

ISTeC Cray High-Performance Computing System. Richard Casey, PhD RMRCE CSU Center for Bioinformatics

Xeon Phi Native Mode - Sharpen Exercise

Introduction to Molecular Dynamics on ARCHER: Instructions for running parallel jobs on ARCHER

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Sequence Alignment. Practical Introduction to HPC Exercise

Quick Guide for the Torque Cluster Manager

Introduction to GALILEO

Introduction to CINECA Computer Environment

Introduction to HPC Numerical libraries on FERMI and PLX

Parallel and High Performance Computing CSE 745

Xeon Phi Native Mode - Sharpen Exercise

AWP ODC QUICK START GUIDE

Tech Computer Center Documentation

Introduction to GALILEO

High Performance Beowulf Cluster Environment User Manual

Homework 1 Due Monday April 24, 2017, 11 PM

Cray Scientific Libraries. Overview

CLE and How to. Jan Thorbecke

PBS Pro and Ansys Examples

OBTAINING AN ACCOUNT:

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Intel Xeon Phi Coprocessor

Shifter on Blue Waters

The DTU HPC system. and how to use TopOpt in PETSc on a HPC system, visualize and 3D print results.

Batch Systems. Running your jobs on an HPC machine

The Eclipse Parallel Tools Platform

Parallel Programming. Libraries and Implementations

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Introduction to Unix Environment: modules, job scripts, PBS. N. Spallanzani (CINECA)

5.3 Install grib_api for OpenIFS

Lecture V: Introduction to parallel programming with Fortran coarrays

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Running Jobs on Blue Waters. Greg Bauer

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Supercomputing environment TMA4280 Introduction to Supercomputing

Introduction to Parallel Programming with MPI

Effective Use of CCV Resources

Introduction to running C based MPI jobs on COGNAC. Paul Bourke November 2006

The Arm Technology Ecosystem: Current Products and Future Outlook

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Introduction to High Performance Computing (HPC) Resources at GACRC

Introduction to OpenMP

CSC Summer School in HIGH-PERFORMANCE COMPUTING

Exercise 1: Connecting to BW using ssh: NOTE: $ = command starts here, =means one space between words/characters.

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Hybrid Computing Lab

CFD exercise. Regular domain decomposition

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Compilation and Parallel Start

Cray RS Programming Environment

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Debugging with Totalview. Martin Čuma Center for High Performance Computing University of Utah

Advanced Fortran Programming

ParaTools ThreadSpotter Analysis of HELIOS

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

ARCHER Single Node Optimisation

Cluster Network Products

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

XSEDE New User Tutorial

Hybrid MPI+OpenMP Parallel MD

Transcription:

( Cray Inc 2015) PROGRAMMING MODEL EXAMPLES DEMONSTRATION EXAMPLES OF VARIOUS PROGRAMMING MODELS OVERVIEW Building an application to use multiple processors (cores, cpus, nodes) can be done in various ways and in particular the High Performance Computing (HPC) community has a long experience in using multithreaded and multi-process programming models to develop fast scalable applications on a range of hardware platforms. Typical APIs in use are OpenMP (source threading markup) and MPI (message-passing). The examples provided and described here can be used to compare these models when used to tackle a very simple problem and as a platform to learn how to build and launch parallel applications on the Cray XC supercomputer. The demonstrator problem was chosen for its simplicity and lack of subtle issues across the range of programming models. THE SAMPLE PROBLEM AND ALGORITHM The example we have chosen is to calculate π by piecewise approximation of the area of a circle. You could do this by drawing a quarter circle on graph paper and counting the number of squares inside and outside the circle. This approximates the area within the circle and hence π can be determined. A more complicated example might be to find the area/volume of a fractal structure; this would be much more computationally intensive and analytically complex. The algorithm covers the quarter circle with an N by N grid and uses a test in the centre to determine if the grid square is inside the circle (the error should average out somewhat). Each program loops over values of i and j indices to get the coordinates of the centre of each small square within the much larger unit square. The approximate value for π is then 4 times the number of squares within the quarter circle divided by the total number of squares. (We can of course do this is a much better way but not if we want to generalise to something like a fractal shape.)

CODE EXAMPLES PROVIDED The algorithm previously mentioned was coded using various programming models. You will need to unpack Cray_pm_examples.tar or access a location provided to you. Once unpacked you will find a directory with a README, sample job scripts and subdirectories C and Fortran which contain source for multiple versions of the pi program. The following versions are available: Programming Model Basic Fortran Fortran using parallel language features Basic Fortran with OpenMP threading Fortran using MPI Fortran using MPI and OpenMP (hybrid) Fortran using SHMEM Fortran using OpenACC Basic C C with pthreads C with OpenMP C with MPI C with MPI and OpenMP C with SHMEM C with OpenSHMEM UPC C++ using coarray implementation (Cray) Source File pi_serial.f90 pi_coarray.f90 pi_openmp.f90 pi_mpi.f90 pi_mpi_openmp.f90 pi_shmem.f90 pi_openacc.f90 pi_serial.c Pi_pthreads.c pi_openmp.c pi_mpi.c pi_mpi_openmp.c pi_shmem.c pi_openshmem.c pi_upc.c pi_coarray.cpp Note that the pthreads version uses OMP_NUM_THREADS to pick up the number of threads to use so that it can be run from a job set up to run an OpenMP application. Note in particular that no attempt has been made to optimize these examples as the goal was to provide the simplest possible code to illustrate the programming models. In some cases idioms are used that illustrate a feature of the programming model even if you would not normally code that way (for example the SHMEM version uses atomic update instead of reduction.) In each language directory (Fortran and C) is a Makefile that will build most of the examples with the Cray CCE compiler. The SHMEM and OpenACC examples need specific modules to be loaded before they will compile. In addition a more complex version of the source is supplied (Fortran_timing and C_timing) with timer code added, this is more useful for scaling studies as the internal timing is the elapsed time of the computation only. Job scripts are provided in the jobscripts directory which you can use as templates to create your own scripts. They work with one particular installation of PBS and the PBS headers and directory definition may need changed for your environment. These scripts are JOB SCRIPT SCENARIO pi.job Serial run (use also for OpenACC) pi_threads.job Multithreaded run (OMP_NUM_THREADS) pi_pes.job Multi process run (MPI,coarrays,SHMEM) pi_hybrid.job Hybrid (for example MPI/OpenMP) Ivybridge/PBS12+ECMWF 2

In the following sections there is more information concerning the programming environment which you can use if you need it. There is also some very basic information about the programming models covered. Have a look over those sections or alternatively just dive in or work as suggested by your instructor or guided by your own interests. You might wish to consider the following activities: Decide on your area of interest Learn how to compile and run some of the examples. You can choose serial, threaded, many processes or hybrid. Compare the source code to get a feel for each programming model although note that these examples only use the most basic features available. Look at scalability as you increase the number of processes or threads. Does the code run 2 times faster as you double the resources available? Does it matter how large (or small) a problem size you pick? Ivybridge/PBS12+ECMWF 3

BASIC NOTES ON THE PROGRAMMING MODELS Various programming models are covered by the examples. If you do not have much background in High Performance Computing these could seem daunting. The models are all attempting to provide the HPC programmer with a platform to decompose some problem efficiently onto complex distributed hardware that can range from a multi-core processor to thousands of complex nodes in a supercomputer. MULTITHREADED MODELS The model familiar to a C Systems programmer would be pthreads but this does not have much traction with the HPC community due to the requirement to explicitly manage thread creation and requirement for C. More prevalent is the approach to add compiler directives that advise the compiler how to target computation to a set of threads. OpenMP is the example we choose and only one OpenMP directive/pragma is shown which parallelizes the outer I loop. COOPERATING PROCESS MODELS These form a family where normally the same binary is run multiple times encapsulated as a Linux process. The model allows the programmer to organize computation by providing a way to determine how many processes are executing and by enumerating each process instance. There will also be a mechanism to communicate data between processes (message-passing) either point-to-point operations or collective operations (for example a sum over all processes). Traditional message-passing APIs are two-sided (a send on one process is matched by a receive on the other), it is also possible to have single-sided communication where only one process is involved in a put or get operations without explicit involvement of the other process. Examples here are MPI and SHMEM If you look at the examples they follow the same pattern. The API is used to determine the number of processes/ranks/pes (npes,size) and which one is executing (mype,rank). The API is also used to coordinate the processes (barriers and synchronization) and to add up the contribution to the final count from each process. PGAS MODELS These models are more complex and are typically languages that support access to local and remote data directly (this is why we put SHMEM in the previous section because it is an explicit API). Examples are Fortran Coarrays, UPC and Chapel. Fortran coarrays is the simplest, UPC allows data distribution across threads and provides special syntax to operate collectively on distributed (or shared) variables and Chapel is the most featured with advanced support a range of parallel programming paradigms. Ivybridge/PBS12+ECMWF 4

MANIPULATING YOUR PROGRAMMING ENVIRONMENT The Cray Programming Environment is designed to simplify the process of building parallel applications for the Cray XC30. It supports multiple different underlying tool chains (compilers) from different vendors (Cray, Intel and GNU). Only one on the available tool chains, or Programming Environments (PrgEnvs), can be active at any one time; to see the list of currently loaded modules type: module list A list will be displayed on screen, one of which will be of the form PrgEnv-*, e.g. 20) PrgEnv-cray/5.2.14 To see which other modules are available to load run the command module avail On most systems this may be a very long list, to be more selective run: module avail PrgEnv which will list all available modules that start with PrgEnv. To swap from the current PrgEnv to another PrgEnv (e.g. to select a different tool chain), run: module swap PrgEnv-cray PrgEnv-intel This will unload the Cray PrgEnv and replace it with the Intel PrgEnv. COMPILING AN APPLICATION To build applications to run on the XC30 users should always use the Cray compiler driver wrappers, ftn (Fortran), cc (C) and CC (C++) to compile their applications. These wrappers interface with the Cray Programming Environment and make sure the correct libraries and flags are supplied to the underlying library. When compiling an application make sure that the appropriate modules are loaded before starting to compile, then modify the build system to call the appropriate wrapper scripts in place of the direct compilers. If you wish to try other compilers you may need to add compiler flags, for example to enable OpenMP. SUBMITTING JOBS TO THE BATCH SYSTEM Nearly all XC30 system will run some form of batch scheduling system to fairly manage the nodes of the system between all the users. Therefore, users must submit all their work in the form of batch job scripts for the scheduler to run in the future. Ivybridge/PBS12+ECMWF 5

Example scripts for the PBS Pro (version 12+) batch scheduler plus very simple ECMWF job directives are provided in the example directory,./jobscripts/ They are called pi.job Runs a program in serial on a compute node pi_pes.job - Runs a non-threaded program on multiple PEs pi_threads.job Runs a threaded program on one PE with threads pi_hybrid.job Run a hybrid program on a mixture of PEs and threads You will need to edit these scripts These scripts includes special #PBS comments which are used to provide information to the batch system about the job resource requirements, e.g. #PBS -N pi specifies the name of the job (displayed in queues etc). #PBS -l EC_nodes=1 specifies the number of nodes that are required (in this case just one node). #PBS -l walltime=0:05:00 specifies the length of time the job will take (no longer than five minutes). Some systems, like ECMWF, may require additional queue information, e.g. #PBS q np Jobs are then submitted for execution using the qsub command: qsub jobscripts/pi.job which will return a job id that uniquely identifies the job. Any of the #PBS settings contained within a file can be overridden at submission time, e.g. qsub l EC_nodes=2 jobscripts/pi_pes.job changes the number of nodes requested for the job. To view the status of all the jobs in the system use the qstat command. To find information about a specific job using the jobid: qstat <jobid.sdb> To find information about jobs belonging to a specific user run: qstat u <username> Ivybridge/PBS12+ECMWF 6

By default, when a run has completed, the output from the run is left in the run directory in the form: <jobname>.o<jobid> for stdout and for stderr. <jobname>.e<jobid> Ivybridge/PBS12+ECMWF 7

LAUNCHING APPLICATIONS ON THE CRAY XC30 Some time after a job is submitted the scheduler will assign the resources the job requested and start the job script running. At this point the supplied script will begin to execute on a PBS MOM node, and it is important to understand that at this stage most of the programs in the script are not yet running in parallel on the Cray XC30 compute nodes. So for example a simple statement like this: cd ${PBS_O_WORKDIR} (which changes the current working directory to the same directory that the original qsub command was executed) is actually run in serial on the PBS MOM node. Only programs/commands that are preceded by an aprun statement will run in parallel on the XC30 compute nodes, e.g. aprun n 16 N 16./pi_mpi Therefore users should take care that long or computationally intensive jobs are only ever run inside an appropriate aprun command. Each batch job may contain multiple programs with aprun statements, as long as: An individual parallel program does not request more resources than have been allocated by the batch scheduler The program does not exceed the wall clock time limit specified in the submission script. See the lecture notes and the manual page, man aprun for a full explanation of the command line arguments to aprun. Launching Hybrid MPI/OpenMP applications The aprun command cannot launch application threads directly. This is done by the application itself, commonly via the OpenMP runtime. However, the aprun can reserve space on the CPUs for threads to execute via the d parameter. For a typical OpenMP application the script should include the following settings: export OMP_NUM_THREADS=4 aprun n ${NPROC} N ${NTASK} d ${OMP_NUM_THREADS}./pi_mpi_openmp Ivybridge/PBS12+ECMWF 8