Advanced Threading and Optimization

Similar documents
SCALABLE HYBRID PROTOTYPE

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Reusing this material

Overview of Intel Xeon Phi Coprocessor

Performance Tools for Technical Computing

Code optimization in a 3D diffusion model

HPC code modernization with Intel development tools

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Architecture for HPC

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Advanced Parallel Programming

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Native Computing and Optimization on Intel Xeon Phi

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Advanced Fortran Programming

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

TotalView. Debugging Tool Presentation. Josip Jakić

Microarchitectural Analysis with Intel VTune Amplifier XE

GRID Testing and Profiling. November 2017

Hybrid Implementation of 3D Kirchhoff Migration

Compiling applications for the Cray XC

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Native Computing and Optimization. Hang Liu December 4 th, 2013

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

AutoTune Workshop. Michael Gerndt Technische Universität München

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Xeon Phi Native Mode - Sharpen Exercise

Parallel Systems. Project topics

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

CSinParallel Workshop. OnRamp: An Interactive Learning Portal for Parallel Computing Environments

Early Experiences Writing Performance Portable OpenMP 4 Codes

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Xeon Phi Native Mode - Sharpen Exercise

MIC Lab Parallel Computing on Stampede

Using Intel VTune Amplifier XE for High Performance Computing

How to run a job on a Cluster?

Intel Xeon Phi архитектура, модели программирования, оптимизация.

KNL tools. Dr. Fabio Baruffa

Profiling: Understand Your Application

Habanero Operating Committee. January

Advanced Job Launching. mapping applications to hardware

Dynamic SIMD Scheduling

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Jackson Marusarz Intel Corporation

Slurm Configuration Impact on Benchmarking

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Simulation using MIC co-processor on Helios

Bring your application to a new era:

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

The GeantV prototype on KNL. Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team

Architecture, Programming and Performance of MIC Phi Coprocessor

Introduction to Parallel Programming

Bei Wang, Dmitry Prohorov and Carlos Rosales

Intel Parallel Studio XE 2015

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Practical Introduction to

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Martin Kruliš, v

COSC 6374 Parallel Computation. Debugging MPI applications. Edgar Gabriel. Spring 2008

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Accelerator Programming Lecture 1

Intel VTune Amplifier XE

Lab: Scientific Computing Tsunami-Simulation

Running applications on the Cray XC30

High Performance Computing Implementation on a Risk Assessment Problem

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Slurm basics. Summer Kickstart June slide 1 of 49

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

NUMA-aware OpenMP Programming

Shared Memory Programming With OpenMP Computer Lab Exercises

Mixed MPI-OpenMP EUROBEN kernels

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Introduction to Performance Tuning & Optimization Tools

Symmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center

Sami Ilvonen Pekka Manninen. Introduction to High-Performance Computing with Fortran. September 19 20, 2016 CSC IT Center for Science Ltd, Espoo

Debugging Intel Xeon Phi KNC Tutorial

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Symmetric Computing. SC 14 Jerome VIENNE

Code modernization of Polyhedron benchmark suite

the Intel Xeon Phi coprocessor

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Harp-DAAL for High Performance Big Data Computing

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Intel MPI Library Conditional Reproducibility

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Transcription:

Mikko Byckling, CSC Michael Klemm, Intel Advanced Threading and Optimization February 24-26, 2015 PRACE Advanced Training Centre CSC IT Center for Science Ltd, Finland!$omp parallel do collapse(3) do p4=1,p4d do p5=1,p5d do p6=1,p6d do h1=1,h1d do h7=1,h7d!dec$ vector aligned do h2h3=1,h2d*h3d triplesx(h2h3,h1,p6,p5,p4)=triplesx(h2h3,h1,p6,p5,p4) 1 - t2sub(h7,p4,p5,h1)*v2sub(h2h3,p6,h7)!$omp end parallel do

Agenda Tuesday 24th 9:00-10:00 Course introduction 10:00-10:15 Coffee Break 10:15-11:00 Exercises 11:00-12:00 SIMD optimization 12:00-12:45 Lunch break 12:45-13:45 Exercises Wednesday 25th 9:00-10:00 Memory optimization 10:00-10:15 Coffee Break 10:15-11:00 Exercises 11:00-12:00 Microarchitectural analysis 12:00-12:45 Lunch break 12:45-13:45 Exercises 13:45-14:45 Performance analysis methodology 13:45-14:45 Advanced OpenMP 14:45-15:00 Coffee Break 15:00-16:00 Exercises 14:45-15:00 Coffee Break 15:00-16:00 Exercises 16:00-18:00 Code optimization workshop (voluntary) 16:00-18:00 Code optimization workshop (voluntary) Thursday 26th 9:00-10:00 OpenMP optimization 10:00-10:15 Coffee Break 10:15-11:00 Exercises 11:00-12:00 OpenMP target extensions 12:00-12:45 Lunch break 12:45-13:45 Exercises 13:45-14:45 OpenMP and MPI 14:45-15:00 Coffee Break 15:00-16:00 Exercises 16:00-16:15 Course wrap-up

Reference material Title: Optimizing HPC Applications with Intel Cluster Tools Authors: Alexander Supalov, Andrey Semin, Michael Klemm, Chris Dahnken Table of Contents: Foreword by Bronis de Supinski (CTO LLNL) Preface Chapter 1: No Time to Read this Book? Chapter 2: Overview of Platform Architectures Chapter 3: Top-Down Software Optimization Chapter 4: Addressing System Bottlenecks Chapter 5: Addressing Application Bottlenecks: Distributed Memory Chapter 6: Addressing Application Bottlenecks: Shared Memory Chapter 7: Addressing Microarchitecture Bottlenecks Chapter 8: Application Design Implications ISBN-13 (pbk): 978-1-4302-6496-5 ISBN-13 (electronic): 978-1-4302-6497-2

Advanced Threading and Optimization exercises Hardware CSC cluster Taito (taito.csc.fi) has 10 full nodes reserved for the course both from the Haswell partition (two Intel Xeon E5-2690v3) and from the Xeon Phi partition (two Intel Xeon E5-2620-v2 and two Intel Xeon Phi 7120X). The names of the reservations areadvtao_hsw (Taito) andadvtao_mic (Taito Xeon Phi), respectively. Note that only the course training accounts are listed in the reservation, i.e., you cannot use your own CSC user account in conjunction with the reservation. Batch job queue system All the jobs must be submitted through the Slurm batch job queuing system. All access (including batch jobs) to the Xeon Phi partition is via nodem1, accessible with ssh from a regular Taito login node. To submit a single node job to the Taito Haswell partition (ten minute runtime at most): srun -N1 -n1 -c24 --cpu_bind=none --reservation=advtao_hsw -p parallel -t00:10:00./executable To submit a single node job with two Xeon Phi cards to Taito Xeon Phi partition (ten minute runtime at most): srun -N1 -n1 -c12 --cpu_bind=none --reservation=advtao_mic -- gres=mic:2 -p mic -t00:10:00./executable For benchmarking purposes, it is preferred to use the Slurm option--exclusive. To have an interactive shell session for 30 minutes to the Taito Haswell partition with X11 connection enabled: srun -N1 -n1 c1 --reservation=advtao_hsw -p parallel - t00:30:00 --x11 --pty $SHELL Software Latest Intel compiler stack is installed on Taito. To use it (software environment defaults to an older compiler), load the modulesintel/15.0.2intelmpi/5.0.2

andmkl/11.2.2. With the Intel software stack, it is recommended not to use cpu binding provided by Slurm, i.e., provide option--cpu_bind=none at job launch time. When using Intel compilers, threads should be bound to cores by using the OpenMP and MPI affinity settings provided by the Intel software stack. 1. Remote desktop environment (ex1) By following the instructions given innxkajaani_howto.txt set up a remote desktop connection to CSC s datacenter in Kajaani. The connection will be used later during the course for profiling applications with VTune and other Intel tools. 2. SIMD vectorization (ex2) The filevectorization_lab_linux.pdf contains instructions for a step-by-step SIMD optimization of a simple matrix-vector kernel. The example is credited to Georg Zitzlsberger from Intel. There is an additional example concerning SIMD functions in the workbook as well. 3. Performance analysis methodology (ex3) The filefinding_hotspots_linux.pdf contains instructions for a step-by-step profiling of a ray tracing application. Note that in order use VTune, Slurm reservation must have X11 plugin enabled (--x11 parameter) and the VTune module (intelvtune/15.1.1) must be loaded to the environment. 4. Memory optimization (ex4) Xeon Phi is able to attain about 170GB/sec bandwidth in the Stream triad benchmark, which computes a daxpy like operation for two vectors. For the rules and results see http://www.cs.virginia.edu/stream/. The benchmark has been created by Dr. John D. McCalpin. a) The filestream_ref(.c.f90) contains a reference version of the benchmark. Compile the benchmark (stream_ref) for the language of your choice with the givenmakefile_(c ftn). Run the benchmark on a Xeon Phi (native mode) to replicate the stream triad results. It is recommended to use 60 cores andscatter threadaffinity for the task. b) The filestream(.c.f90) contains a slightly modified version of the stream benchmark. Compile the benchmark (stream) for the language of your choice

with the givenmakefile_(c ftn). Run the benchmark on a Xeon Phi (native mode) to compare the results to the previously computed reference results. By only modifying the code (look for TODO -tags), try to match the results of the stream triad portion of the reference benchmark. 5. Microarchitectural analysis (ex5) The purpose of this exercise is to perform microarchitectural analysis with VTune to the kernels from previous exercises, i.e., 2. SIMD vectorization and 4. Memory optimization. a) The directorymatvec/ contains unoptimized and optimized C (matvec/c/) and Fortran (matvec/fortran/) -versions of matrix-vector multiply kernel of the exercise 2. Compile the code for the language of your choice with the givenmakefile to generatematvec (unoptimized) andmatvec_opt (optimized) binaries. By using VTune, run a Microarchitectural Analysis with General Exploration for both binaries. Compare the resulting number of instructions retired, CPI rate and also other indicators between the unoptimized and optimized versions of the code. Then investigate and compare the most time consuming hotspot of the code. Does having a low CPI value correlate to good performance? Then, using the assembly view functionality of VTune, try to determine if the hotspot vectorized or not. b) The directorystream/ contains unoptimized and optimized C (stream/c/) and Fortran (stream/fortran/) -versions of the stream benchmark of exercise 2. Compile the code for the language of your choice with the given Makefile to generatestream (unoptimized) andstream_ref (optimized) binaries. By using VTune, run a Microarchitectural Analysis with Bandwidth for both binaries. Compare the resulting number of instructions retired, CPI rate, LLC Miss and also other indicators between the unoptimized and optimized versions of the code. How does the sampled bandwidth measured by VTune compare to the bandwidth actual bandwidth measured by the stream benchmark? 6. Advanced OpenMP (ex6) The filenbody(.c.f90) implements a simple N-body iteration. Compile the code for the language of your choice with the givenmakefile (by typingmake c ormake fortran). The resulting binarynbody takes the number of bodies to simulate as the first argument. Try running a simulation by using 262144 bodies in total.

By using VTune or compiler vectorization reports, investigate if the code vectorizes well. Try to improve the vectorization properties and to speed up the code by using the OpenMP SIMD construct and any optimization techniques you are aware of. 7. Optimization for threads (ex7) The code implements a very simplistic matrix-vector multiplication that uses OpenMP for multi-threading. Compile the code (by typing make) and two binaries will be created: mxv.seqinit and mxv.parinit. Although they are named differently, each of the binaries allocates the matrix and the vector sequentially and thus suffer from a NUMA problem. Try running the codes with make runs and each binary will run with different thread distribution and different thread numbers. What average runtimes do you observe? As the next step, find the initialization of the matrix and the vector, and parallelize it according so it matches the parallelization of the matrix-vector multiplication. Run the binaries again (make runs) and check the performance. Do you spot a difference? 8. OpenMP target features (ex8) The fileex_jacobi(.c.f90) implements a simple Jacobi iteration. a) Add OpenMP directives to the code in such a way that the computational work is done in parallel. b) Modify the OpenMP implementation in such a way that the computation is done entirely on the device, i.e., no additional data transfers take place between the host and the device. c) (Bonus *) Optimize the performance of the implementation. 9. OpenMP and MPI (ex9) Benchmark hybrid MPI+OpenMP kernels by using EPCC s OpenMP/MPI Mixed-Mode Microbenchmarks. Both C and Fortran versions of the source code have been already downloaded to the directoriesepcc_mixedmode_c/ and EPCC_mixedMode_Fortran/, respectively. Compile the benchmarks (of the source code of your language choice) with an Intel MPI compiler. Instructions on the compilation and use of the benchmarks can be acquired from http://www2.epcc.ed.ac.uk/~markb/mpiopenmpbench/intro.html. A sample input filemultipingpong.bench performs a two node multipingpong benchmark and can be used to replicate the results given in the course handouts.