CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Similar documents
Compiling applications for the Cray XC

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Practical: a sample code

HPCF Cray Phase 2. User Test period. Cristian Simarro User Support. ECMWF April 18, 2016

PROGRAMMING MODEL EXAMPLES

Graham vs legacy systems

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

First steps on using an HPC service ARCHER

Our Workshop Environment

Parallel Programming. Libraries and implementations

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Heidi Poxon Cray Inc.

The Eclipse Parallel Tools Platform

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Programming Environment 4/11/2015

RAMSES on the GPU: An OpenACC-Based Approach

Blue Waters Programming Environment

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Running applications on the Cray XC30

ReFrame: A Regression Testing Framework Enabling Continuous Integration of Large HPC Systems

OpenACC compiling and performance tips. May 3, 2013

A More Realistic Way of Stressing the End-to-end I/O System

Hands-on / Demo: Building and running NPB-MZ-MPI / BT

TAU Performance System Hands on session

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Shifter and Singularity on Blue Waters

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Introduction to SahasraT. RAVITEJA K Applications Analyst, Cray inc E Mail :

The Titan Tools Experience

User Training Cray XC40 IITM, Pune

Shifter at CSCS Docker Containers for HPC

Illinois Proposal Considerations Greg Bauer

Accelerate HPC Development with Allinea Performance Tools

OpenACC Accelerator Directives. May 3, 2013

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Beginner's Guide for UK IBM systems

Sharpen Exercise: Using HPC resources and running parallel applications

Shifter: Fast and consistent HPC workflows using containers

CALMIP : HIGH PERFORMANCE COMPUTING

The Arm Technology Ecosystem: Current Products and Future Outlook

Exercise 1: Connecting to BW using ssh: NOTE: $ = command starts here, =means one space between words/characters.

8/19/13. Blue Waters User Monthly Teleconference

SCALABLE HYBRID PROTOTYPE

The Cray Programming Environment. An Introduction

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Cray Scientific Libraries. Overview

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Our Workshop Environment

Sharpen Exercise: Using HPC resources and running parallel applications

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

How to run a job on a Cluster?

UAntwerpen, 24 June 2016

COMPILING FOR THE ARCHER HARDWARE. Slides contributed by Cray and EPCC

Hands-on: NPB-MZ-MPI / BT

Opportunities for container environments on Cray XC30 with GPU devices

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

ECMWF Environment on the CRAY practical solutions

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

CSC Summer School in HIGH-PERFORMANCE COMPUTING

Domain Decomposition: Computational Fluid Dynamics

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 28, 2011 David L. Hart, CISL

An Introduction to OpenACC

The Cray Programming Environment. An Introduction

Domain Decomposition: Computational Fluid Dynamics

Introduction to High-Performance Computing (HPC)

Introduction to GALILEO

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Intel Xeon Phi Coprocessor

CSC Summer School in HIGH-PERFORMANCE COMPUTING

Introduction to NCAR HPC. 25 May 2017 Consulting Services Group Brian Vanderwende

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Choosing Resources Wisely. What is Research Computing?

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Adrian Tate XK6 / openacc workshop Manno, Mar

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

Domain Decomposition: Computational Fluid Dynamics

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

Introduction to HPC2N

Altix Usage and Application Programming

software.sci.utah.edu (Select Visitors)

High Performance Computing Cluster Basic course

Batch Systems. Running your jobs on an HPC machine

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Operational Robustness of Accelerator Aware MPI

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Product Life Cycle System-level Products. April, 2016

Working with Shell Scripting. Daniel Balagué

BANGLADESH UNIVERSITY OF PROFESSIONALS ACADEMIC CALENDAR FOR MPhil AND PHD PROGRAM 2014 (4 TH BATCH) PART I (COURSE WORK)

Introduction to GALILEO

AWP ODC QUICK START GUIDE

High Performance Computing Cluster Advanced course

Linux process scheduling. David Morgan

Cuda C Programming Guide Appendix C Table C-

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Transcription:

CSCS Proposal writing webinar Technical review 12th April 2015 CSCS

Agenda Tips for new applicants CSCS overview Allocation process Guidelines Basic concepts Performance tools Demo Q&A open discussion 2

Tips for new applicants

CSCS: Overview (Nov. 2014) http://www.top500.org http://www.green500.org 4

CSCS: Usage statistics (2013) Usage by Institution Usage by Research Field http://www.cscs.ch/publications/annual_reports/ 5

CSCS: Piz Daint CSCS petascale system: Hybrid Cray XC30, 5272 compute nodes, connected with Aries interconnect Each compute node hosts 1 Intel SandyBridge CPU and 1 NVIDIA K20X GPU For a total of 42176 cores and 5272 GPUs, 7.8 Pflops peak performance http://user.cscs.ch/computing_resources/piz_daint_and_piz_daint_extension/ 6

Submitting a project at CSCS http://www.cscs.ch/user_lab/ 7

Guidelines: benchmarking and scaling

Proposal format 10

Benchmarking and Scalability There are many ways to measure the execution time: The most simple one is to time the aprun command and to report the real time: It is the appplicant responsibility to show the scalability of his application: A code scales if its execution time decreases when using increasing numbers of parallel processing elements (cores, processes, threads, etc ) The idea is to find the scalability limit of your application the point at which the execution time stops decreasing. 11

Typical user workflow: launching parallel jobs Batch system: The job submission system used at CSCS is SLURM. Submit your jobscript: cd $SCRATCH/ cp /project/*/$user/myinput. sbatch myjob.slurm squeue -u $USER scancel myjobid # if needed Adapt the jobscript to your needs: aprun [options] myexecutable -n : Total number of MPI tasks -N : Number of MPI tasks per compute node (<= 8) -d : Number of OpenMP threads http://user.cscs.ch/get_started/run_batch_jobs 12

Cost study To run 50 steps, my code takes: 1h/6~10m on 16 nodes, 1h/40~2m on 128 nodes, 1h/200<1m on 1024 nodes. For my production job, I must run 2400 steps (x48). How many compute node hours (CNH) should I ask? User type #1: (1024nodes*1h/200)*48 Realtime = 246 CNH Human time < 15 minutes User type #2: (16nodes*1h/6)*48 Realtime = 128 CNH Human time = 8hours (>>15 min!) BUT I used only half CNH I can submit another 2400steps job! User type #3: (128nodes*1h/40)*48 Realtime = 154 CNH Human time = 1 hour 12 min Faster than 8h! I can submit another 2400steps job! 13

Cost study To run 50 steps, my code takes: 1h/6~10m on 16 nodes, 1h/40~2m on 128 nodes, 1h/200<1m on 1024 nodes. For my production job, I must run 2400 steps (x48). How many compute node hours (CNH) should I ask? User type #0: (2048nodes*1h/100)*48 Realtime = 980 CNH! Human time = 30 minutes! User type #1: (1024nodes*1h/200)*48 Realtime = 246 CNH Human time < 15 minutes User type #2: (16nodes*1h/6)*48 Realtime = 128 CNH Human time = 8hours (>>15 min!) BUT I used only half CNH I can submit another 2400steps job! User type #3: (128nodes*1h/40)*48 Realtime = 154 CNH Human time = 1 hour 12 min Faster than 8h! I can submit another 2400steps job! 14

Benchmarking: Mpi+OpenMP code on PizDaint (small problem size) NASA BT C=480x320x28 15

Speedup and Efficiency It can sometimes be difficult to read a scaling curve It is standard to compare the execution time with a reference time Speedup is defined by the following formula: where T is the reference time, and 0 T is the execution on n compute nodes n t0 tn Linear speedup or ideal speedup is obtained when Speedupn=n. When running a code with linear speedup, doubling the number of processors doubles the speed. As this is ideal, it is considered very good scalability. Efficiency is a performance metric defined as Speedup n= Efficiency n= Codes with an efficiency > 0.5 are considered scalable. 16 Sn n

Scalability quiz: Mpi+OpenMP code on PizDaint (small problem size) C=480x320x28 17

Scalability quiz: Mpi+OpenMP code on PizDaint (small problem size) C=480x320x28 18

Scalability quiz: Mpi+OpenMP code on PizDaint (small problem size) C=480x320x28 19

Scalability quiz: Mpi+OpenMP code on PizDaint (small problem size) Use MPI&OpenMP C=480x320x28 20

Scalability quiz: Mpi+OpenMP code on PizDaint (small problem size) Use MPI&OpenMP Scales up to 8 CN C=480x320x28 21

Scalability quiz: Mpi+OpenMP code on PizDaint (medium problem size) D=1632x1216x34 22

Scalability quiz: Mpi+OpenMP code on PizDaint (medium problem size) D=1632x1216x34 23

Scalability quiz: Mpi+OpenMP code on PizDaint (medium problem size) D=1632x1216x34 24

Scalability quiz: Mpi+OpenMP code on PizDaint (medium problem size) D=1632x1216x34 25 Use MPI&OpenMP

Scalability quiz: Mpi+OpenMP code on PizDaint (medium problem size) D=1632x1216x34 26 Use MPI&OpenMP Scales up to 64 CN

How much can you ask? Allocation units are in node hours: Projects will always be charged full node hours: Compare different job sizes (using the time command), Report in your proposal the execution timings (without tools) for each job size, Multiply the optimal job size (nodes) by the execution time (hours) to find the amount to request (compute node hours). even though not all the CPU cores are used, even though the GPU is not used. Performance data must come from jobs run on Piz Daint: For a problem similar to that proposed in the project description use a problem state that best matches your intended production runs, scaling should be measured based on the overall performance of the application, compare results from same machine, same computational model, no simplified models or preferential configurations. There are many ways for presenting mediocre performance results in the best possible light: We know them, Contact us if you need help: help@cscs.ch 32

Guidelines: performance report

Typical user workflow: compilation 0. Connect to CSCS: ssh -X myusername@ela.cscs.ch ssh daint 4 compilers available: CCE GNU INTEL PGI 1. Setup your Programming Environment: module swap PrgEnv-cray PrgEnv-gnu CRAY specifics: You must use the CRAY wrappers to compile ftn (Fortran), cc (C) CC (C++) module list module avail module avail cray-netcdf module show cray-netcdf module load cray-netcdf/4.3.2 module rm cray-netcdf/4.3.2 The wrappers will detect the modules you have loaded and will automatically use the compiler and libraries that you need. 2. Compile your code: make clean make bt-mz MAIN=bt CLASS=C NPROCS=64 3. Submit your job: cd bin sbatch myjob.slurm http://user.cscs.ch/compiling_optimizing/compiling_your_code 35

Typical user workflow: compilation with perftool perflite is Cray's performance tool, it provides performance statistics automatically, and supports all compilers (cce, intel, gnu, pgi). 1. Setup your Programming Environment: module swap PrgEnv-cray PrgEnv-gnu module use /project/csstaff/proposals module load perflite/622 For MPI/OpenMP codes or module load craype-accel-nvidia35 module load perflite/622cuda For MPI/OpenMP + Cuda codes module load craype-accel-nvidia35 module load perflite/622openacc For MPI/OpenMP + OpenACC codes or 2. Recompile your code with the tool: cd $SCRATCH/proposals.git/vihps/NPB3.3-MZ-MPI make clean make bt-mz MAIN=bt CLASS=C NPROCS=64 3. Submit your job: No need to modify the jobscript, Just rerun cd bin sbatch myjob.slurm 4. Read and attach the full report to your proposal: No need to modify the src code, just recompile Performance report at the end of the job cat *.rpt http://user.cscs.ch/compiling_optimizing/performance_report 36

Performance: Mpi+OpenMP code on PizDaint (BT CLASS=C, 50 steps) The code stops scaling above 8 nodes. Why? What can we learn from the performance reports? 8CN: aprun -n32 -N4 -d2./bt-mz_c.32+pat622 Setup your Programming Environment: module swap PrgEnv-cray PrgEnv-gnu module use /project/csstaff/proposals module load perflite/622 Recompile your code with the tool: cd $SCRATCH/proposals.git/vihps/NPB3.3-MZ-MPI make clean make bt-mz MAIN=bt CLASS=C NPROCS=64 Submit your job: cd bin sbatch myjob.slurm cat *.rpt 16CN: aprun -n64 -N4 -d2./bt-mz_c.64+pat622 37 32CN: aprun -n64 -N2 -d4./bt-mz_c.64+pat622

Performance: Mpi+OpenMP code on PizDaint (BT CLASS=C, 50 steps) The code stops scaling above 8 nodes. Why? What can we learn from the performance reports? 8CN: aprun -n32 -N4 -d2./bt-mz_c.32+pat622 Setup your Programming Environment: module swap PrgEnv-cray PrgEnv-gnu module use /project/csstaff/proposals module load perflite/622 Recompile your code with the tool: cd $SCRATCH/proposals.git/vihps/NPB3.3-MZ-MPI make clean make bt-mz MAIN=bt CLASS=C NPROCS=64 Submit your job: cd bin sbatch myjob.slurm cat *.rpt 16CN: aprun -n64 -N4 -d2./bt-mz_c.64+pat622 38 32CN: aprun -n64 -N2 -d4./bt-mz_c.64+pat622

Performance: Mpi+OpenMP code on PizDaint (BT CLASS=C, 50 steps) The code stops scaling above 8 nodes. Why? What can we learn from the performance reports? 8CN: aprun -n32 -N4 -d2./bt-mz_c.32+pat622 Setup your Programming Environment: module swap PrgEnv-cray PrgEnv-gnu module use /project/csstaff/proposals module load perflite/622 Recompile your code with the tool: cd $SCRATCH/proposals.git/vihps/NPB3.3-MZ-MPI make clean make bt-mz MAIN=bt CLASS=C NPROCS=64 Submit your job: cd bin sbatch myjob.slurm cat *.rpt 16CN: aprun -n64 -N4 -d2./bt-mz_c.64+pat622 39 32CN: aprun -n64 -N2 -d4./bt-mz_c.64+pat622

Demo: MPI+OpenMP / 32CN

GPU codes

Scalability: Mpi+OpenMP+CUDA code on PizDaint 42

Scalability: Mpi+OpenMP+CUDA code on PizDaint 43

Scalability: Mpi+OpenMP+CUDA code on PizDaint 44

Scalability: Mpi+OpenMP+CUDA code on PizDaint 45 Scales up to <256 CN

Scalability: Mpi+OpenMP+CUDA code on PizDaint 46 Scales up to <256 CN What can we learn from the tool?

Performance: Mpi+OpenMP+CUDA code on PizDaint 128CN: aprun -n128 -N1 -d8 cp2k+pat622 512CN: aprun -n512 -N1 -d8 cp2k+pat622 47

Performance: Mpi+OpenMP+CUDA code on PizDaint 128CN: aprun -n128 -N1 -d8 cp2k+pat622 512CN: aprun -n512 -N1 -d8 cp2k+pat622 48

Performance: Mpi+OpenMP+CUDA code on PizDaint 128CN: aprun -n128 -N1 -d8 cp2k+pat622 512CN: aprun -n512 -N1 -d8 cp2k+pat622 49

Performance: Mpi+OpenMP+CUDA code on PizDaint 128CN: aprun -n128 -N1 -d8 cp2k+pat622 512CN: aprun -n512 -N1 -d8 cp2k+pat622 50

Performance: Mpi+OpenMP+CUDA code on PizDaint 128CN: aprun -n128 -N1 -d8 cp2k+pat622 512CN: aprun -n512 -N1 -d8 cp2k+pat622 51 To allow multiple CPU cores to simultaneously utilize a single GPU, the CUDA proxy must be enabled. CRAY_CUDA_MPS=1 Performance tools are only supported with the CUDA proxy disabled CRAY_CUDA_MPS=0

Conclusion

Recipe for a successfull technical review If the answer to any of the following question is No Does your code run on Piz Daint? Did you provide scalability data from Piz Daint? Did you attach the.rpt performance report file(s)? Is your resource request consistent with your scientific goals? Does the project plan fit the allocated time frame? there is a risk that your proposal will not be accepted 53

Future events (2015) Orientation course (EPFL) new 2 0 1 5 Jan Feb Mar Summer School (Ticino) PASC'15 (ETHZ) Apr May Jun HPC Visualisation using Python new Jul Aug Sep EuroHACKaton Orientation course (ETHZ) new Oct Nov Dec C++ for HPC new Intro Cuda/OpenACC Online registration ===> http://www.cscs.ch/events 55 new 2 0 1 5

Thank you for your attention. Questions?