LAB. Preparing for Stampede: Programming Heterogeneous Many- Core Supercomputers

Similar documents
LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Lab MIC Experiments 4/25/13 TACC

Hands-on with Intel Xeon Phi

OpenMP Exercises. These exercises will introduce you to using OpenMP for parallel programming. There are four exercises:

MIC Lab Parallel Computing on Stampede

ECE 574 Cluster Computing Lecture 4

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

SCALABLE HYBRID PROTOTYPE

Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Submitting batch jobs

Submitting batch jobs Slurm on ecgate Solutions to the practicals

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Lab: Scientific Computing Tsunami-Simulation

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

OpenMP threading on Mio and AuN. Timothy H. Kaiser, Ph.D. Feb 23, 2015

Exercises: Abel/Colossus and SLURM

Programming Environment

How to run a job on a Cluster?

Introduc)on to Xeon Phi

COSC 6385 Computer Architecture. - Homework

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Using CSC Environment Efficiently,

CRUK cluster practical sessions (SLURM) Part I processes & scripts

Heterogeneous Job Support

NUMA-aware OpenMP Programming

Native Computing and Optimization. Hang Liu December 4 th, 2013

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

JURECA Tuning for the platform

Introduction to GACRC Teaching Cluster

OpenMP Programming: Correctness, Tuning, Examples

Slurm basics. Summer Kickstart June slide 1 of 49

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

LAB : OpenMP Stampede

High Performance Computing: Tools and Applications

Department of Informatics V. Tsunami-Lab. Session 4: Optimization and OMP Michael Bader, Alex Breuer. Alex Breuer

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Shared Memory Programming with OpenMP (3)

Introduction to GACRC Teaching Cluster PHYS8602

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

Using a Linux System 6

HPC Introductory Course - Exercises

Symmetric Computing. John Cazes Texas Advanced Computing Center

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Leibniz Supercomputer Centre. Movie on YouTube

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Stampede User Environment

CNAG Advanced User Training

Native Computing and Optimization on Intel Xeon Phi

High Performance Computing Cluster Basic course

Meteorology 5344, Fall 2017 Computational Fluid Dynamics Dr. M. Xue. Computer Problem #l: Optimization Exercises

Practical Introduction to

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin

Intel Knights Landing Hardware

Submitting batch jobs Slurm on ecgate

Applications Software Example

Open Multi-Processing: Basic Course

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

Symmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center

Beginner's Guide for UK IBM systems

Parallel Numerical Algorithms

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Symmetric Computing. SC 14 Jerome VIENNE

OS impact on performance

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Compiling applications for the Cray XC

Allows program to be incrementally parallelized

Introduction to HPC2N

TITANI CLUSTER USER MANUAL V.1.3

Introduction to RCC. September 14, 2016 Research Computing Center

High Performance Computing Cluster Advanced course

Introduction to High-Performance Computing (HPC)

Introduction to CUDA Programming

Student HPC Hackathon 8/2018

Introduction to OpenMP

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Introduction to RCC. January 18, 2017 Research Computing Center

EE/CSCI 451: Parallel and Distributed Computation

Introduction to SLURM & SLURM batch scripts

Introduction to GACRC Teaching Cluster

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Introduction to SLURM & SLURM batch scripts

Programming Environment

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Introduction to SLURM & SLURM batch scripts

Linux Tutorial. Ken-ichi Nomura. 3 rd Magics Materials Software Workshop. Gaithersburg Marriott Washingtonian Center November 11-13, 2018

Using SDSC Systems (part 2)

Introduction to CINECA Computer Environment

Our new HPC-Cluster An overview

For Dr Landau s PHYS8602 course

Accelerate HPC Development with Allinea Performance Tools

Transcription:

LAB Preparing for Stampede: Programming Heterogeneous Many- Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1

Discovery System Login & Setup Login to our discovery system (train2## is your assigned login): ssh train2xx@login2.discovery.tacc.utexas.edu Untar lab files to your directory: tar xvf ~milfeld/xsede_12.tar Cd to the appropriate directory and read the instrucoons here and/or in the info file: 0.) How to use SLURM Batch system (slurm) OpenMP Experiments 1.) Overhead in OpenMP (omp_overhead) 2.) SeTng Affinity in code (affinity) 3.) QR factorizaoon (caqr) VectorizaOon Experiments 1.) VectorizaOon (vector) 2.) Alignment (align) 3.) TBD () ( )=directory

SLURM SLURM: Batch UOlity slurm is the batch system. Details at: h_ps://compuong.llnl.gov/linux/slurm/documentaoon.html sinfo - - show current parooons (queues) squeue - - show your queued jobs sbatch myjob - - submit batch job scancel <jid> - - delete job with id =<jid> Login Node login2.discovery.tacc.utexas.edu E5-2680 Sandy Bridge 2.7GHz (Hyper- Threading) 2 CPUs x 8 cores each 32KB L1, 256KB L2, 20MB L3 Queues (ParOOon) xsede Compute Nodes C3-401 à C3-420 nodes E5-2680 Sandy Bridge 2.7GHz (Turbo on) 2 CPUs x 8 cores each 32KB L1, 256KB L2, 20MB L3

SLURM SLURM: sinfo, jobscript, sbatch login2$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST xsede up 2-00:00:00 1 down* c3-412 xsede up 2-00:00:00 23 idle c3-[401-411,413-424] login2$ cat job #!/bin/bash #SBATCH - - Ome=5 #SBATCH - p xsede #SBATCH - w "c3-401" If you submit a job, you will get one of these nodes. Time (minutes) ParOOon (queue) Specific HOST you won t need to use this. module load mkl export OMP_NUM_THREADS=16./a.out If you use MKL, load module login2$ sbatch job Output: slurm- <jobid>.out

SLURM SLURM: sbatch, squeue, srun EXAMPLE: login2$ sbatch job Submitted batch job 2058 login2$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2058 xsede job milfeld R 0:09 1 c3-401 login2$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) login2$ ls overhead slurm-2058.out job info Makefile overhead.c scan_compact You can obtain interacove access to a compute node through slurm with the srun command. Srun submits an interacove job to the queue, and give you an interacove prompt when you get a node. Example: login2$ srun -p xsede --pty /bin/bash l c3-401$ Advise us if you want you use this.

SLURM SLURM: Experiment Go to the slurm directory Look over the job script and submit it with sbatch Monitor the progress of your job with squeue Kill your job with scancel if it hasn t finished. List details about all queue with the sinfo command Note that the output is returned in a file with the jobid in its name. How many nodes were in the alloc state when you ran sinfo? Read the output file. What is the PWD directory for batch execuoon? What opoon would you use to request a specific node? How many CPUS/CORES are in a node? cores cpus What size is the L3 cache? MB What is the nominal chip frequency? GHz

Vector: Experiment Vector General: The vector code calculates a simple Riemann sum on a unit cube. It consists of a triply- nested loop. The grid is meshed equally in each direcoon, so that a single array can be used to hold incremental values and remain in the L1 caches. About the experiment You will measure the number of clock period (CP) it takes to execute a 3- D Riemann sum that is vectorized. You will then determine the Ome (in CPs) to do the calculaoon without vectorizaoon. When porong to a Sandy Bridge (SNB) and MIC it is always good to determine how well a code vectorizes, to get an esomate of what to expect when running on a system with larger vector units. Remember, the SNB can perform 8 FLOPS/CP (4 Mults + 4 ADDS), while the MIC can do double that! The funcoon "f" (in f.h ) uses by default the default integrand: x*x + y*y + z*z Follow the instrucoons in the info file. What is the vectorizaoon speed- up for the x*x + y*y + z*z integrand? speedup What is the vectorizaoon speed- up for the x + pow(y,2) + sin( PI*z ) integrand? speedup

Align: Experiment Align General: For subrouones/funcoons compilers have limited informaoon, and cannot make adjustments for misaligned data, because the rouones may be in other programming units and/or cannot be inlined. Because this is the usual case, our experiment code is within a Fortran subrouone. In a mulo- dimensioned Fortran array if the first dimension is not a mulople of the alignment, the performance is diminished. e.g. for a real*8 a(15,16,16) array, 15*8bytes is not a mulople of 16 bytes for needed for Westmere, and 32 bytes for Sandy Bridge. General: The stencil update- arrays in sub5.f90 have dimensions of 15x16x16. By padding the first dimension to 16, be_er performance can be obtained. In the experiment we make sure the arrays are within the cache for the Omings. The Omer captures Ome in machine clock periods from the rdtsc instrucoon. The way it is used here, it is accurate within +/- 80 CPs. Large Oming variaoons are not due to the Ome, but to condioons we cannot control. You will measure the performance with & without alignment in the first dimension of a fortran mulo- dimensional array. You will also see how inter- procedural opomizaoon can be as effecove as padding.

Align: Experiment About the experiment Align You will change the leading dimension of a 3- D array used in a Fortran subrouone, so that the 1 st element of the leading dimension is always aligned by 32- bytes, no ma_er what indices are in the other dimensions. Follow the instrucoons in the info file. What is the performance improvement when the array is aligned? % Can the ipo opoon allow the compiler to work around the apparent alignment problem? %

OMP- overhead: Experiment General: O~en developers have no idea of the overhead/costs for creaong a parallel region (for the first Ome), reforming subsequent parallel regions, and synchronizing on a barrier. Overhead Review the overhead.c code. It has Omers around two separate parallel regions. The first region forks a set of threads (creaoon), and the other reuses (revives) the threads. A subsequent parallel region measures the cost of a barrier. For Oming a barrier, we just average over what each thread sees as the Ome (this is good enough for semi- quanotaove work). The code uses the rtsc hardware clock counter, good to around 25 Clock Periods (CPs), to obtain an accurate Ome for the openmp operaoons. Run the experiment several Omes so that you can see the variability. The experiment will execute and collect the Omes for 1 through 16 threads, sleeping 1 second to let the system become quiescent.

About the experiment OMP- overhead: Experiment Overhead You will measure the costs for forking, reviving, and synchronizing (barrier) for 1 through 16 threads. Use a compute node for this (submit the job script). Follow the instrucoons in the info file. A~er removing the outliers determine the range for the types of OpenMP operaoons: Thread CreaOon: to milliseconds Thread Revival : to microseconds Thread Barrier : to microseconds Note, first Ome you use a parallel regions, thereis a large cost to "fork" the threads. Note, a revival is only about 1/2 the cost of a barrier for a large number of threads. Time (sec.) 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 0 2 4 6 8 10 12 14 16 0.000012 0.00001 0.000008 0.000006 0.000004 0.000002 0 OpenMP Overhead Create Team Revive Team Barrier Revive Team Threads Barrier 0 2 4 6 8 10 12 14 16

Affinity: Experiment Affinity General: For some applicaoons, developers want full control over which cores their threads are placed. If there are two sockets on a node, each with 8 cores, a developer may want to increase the number of threads on a single socket from 1-8 (no threads on the other socket). This can be done by mapping the thread id to the logical cpu (core id). Threads assignments to cores are done in sched_setaffinity (). Intel numbers the cores on a two- socket system in a round robin fashion. Core ids on socket 0 are 0,2,4,6,8,10,12,14, and on socket 1 they are 1,3,5,7,9,11,13,15. In the affinity code, a modulo expression is used to give the numbers 0,2,4,6,8,10,12,14,1,3,5,7,9,11,13,15 for the thread ids (linear sequence): 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 General: Hence, for a thread team of n, the thread ids 0,1,2,...n- 1 can easily be mapped to 0,2,4,...13,15, so as to fill up one socket and then the other as the number of threads are increased..

QR QR: Experiment QR FactorizaOon (CommunicaOon Avoiding) h_p://www.eecs.berkeley.edu/pubs/techrpts/2010/eecs- 2010-131.pdf ObjecCve: Evaluate the scalability of the caqr code David Hudak ported to the KNF. (Thank you David!) Go to the caqr omp directory: cd caqr Load up MKL in your environment: module load mkl Create the caqr executable: make Run a single execuoon with the job script: sbatch job Modify job to run from 1-16 threads with MKP_AFFINITY set to sca_er and compact. Which one wins, sca_er or compact? export KMP_AFFINITY=sca_er for i in `seq 1 16`; do export OMP_NUM_THREADS=$i echo "team: $i threads"./caqr 2048 2048 10000 0 done

QR: Experiment - - KMP_AFFINITY Get some basic informaoon about threads using KMP_AFFINITY verbose. Submit a job with: export KMP_AFFINITY= vebose,none export OMP_NUM_THREADS=2./caqr 2048 2048 10000 0 OUTPUT: OMP: Info #156: KMP_AFFINITY: 80 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 8 cores/pkg x 1 threads/core (16 total cores) OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4, 14,15} OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4, 14,15} Include verbose,none in KMP_AFFINITY to get basic informaoon and default mask. Means thread can execute on any of the 16 cores. QR

QR QR: Experiment Socket (package) - - OS Proc map package 0 package 1 Socket OS Proc numbers 0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15 Hardware Threads KMP_AFFINITY verbose output

QR: Experiment Now run a job with KMP_AFFINITY=verbose,compact and KMP_AFFINITY=verbose,sca_er for 8 threads. Determine how the threads are distributed on the sockets for sca_er and compact affinity, using the table on the previous page. QR For the more adventurous: Change the scheduling in QR_matrix.cp to runome, e.g. include the clause schedule(runome) in the #pragma omp parallel for loops. and experiment with different scheduling, by setng the OMP_SCHEDULE environment variable. (e.g. export OMP_SCHEDULE= dynamic,#, where # is a chunk size number. )