Research on Programming Models to foster Programmer Productivity

Similar documents
Accelerators in Technical Computing: Is it Worth the Pain?

NUMA-aware OpenMP Programming

MPI Runtime Error Detection with MUST

Performance Tools for Technical Computing

MPI Runtime Error Detection with MUST

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Computing. November 20, W.Homberg

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Parallel Programming Overview

Getting Performance from OpenMP Programs on NUMA Architectures

Runtime Correctness Checking for Emerging Programming Paradigms

Introduction to the Message Passing Interface (MPI)

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Using Intel Transactional Synchronization Extensions

Blue Waters Programming Environment

Advanced OpenMP Features

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COMP528: Multi-core and Multi-Processor Computing

Message Passing Interface

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Parallel Programming Libraries and implementations

S Comparing OpenACC 2.5 and OpenMP 4.5

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

Holland Computing Center Kickstart MPI Intro

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Parallel Programming. Libraries and Implementations

ECE 574 Cluster Computing Lecture 13

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

JURECA Tuning for the platform

OpenMP and Performance

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Dmitry Durnov 15 February 2017

OpenACC 2.6 Proposed Features

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

High performance computing. Message Passing Interface

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Hands-on Clone instructions: bit.ly/ompt-handson. How to get most of OMPT (OpenMP Tools Interface)

AutoTune Workshop. Michael Gerndt Technische Universität München

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

OpenMP API Version 5.0

The GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

ECE 574 Cluster Computing Lecture 10

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

High Performance Computing Course Notes Message Passing Programming I

Preparing for Highly Parallel, Heterogeneous Coprocessing

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Our new HPC-Cluster An overview

Introduction to OpenMP

Slides prepared by : Farzana Rahman 1

Distributed Memory Parallel Programming

Productivity and Performance with Multi-Core Programming

CS 426. Building and Running a Parallel Application

Parallel Programming. Libraries and implementations

Distributed Memory Programming with Message-Passing

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

15-440: Recitation 8

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

12:00 13:20, December 14 (Monday), 2009 # (even student id)

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Parallel Programming on Ranger and Stampede

OpenMP 4.0/4.5. Mark Bull, EPCC

Hybrid MPI and OpenMP Parallel Programming

MPI Correctness Checking with MUST

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

OpenStaPLE, an OpenACC Lattice QCD Application

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

A few words about MPI (Message Passing Interface) T. Edwald 10 June 2008

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Migrating Offloading Software to Intel Xeon Phi Processor

Lecture 7: Distributed memory

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

HPC Parallel Programing Multi-node Computation with MPI - I

Parallel Programming in C with MPI and OpenMP

Performance Analysis of Parallel Applications Using LTTng & Trace Compass

Introduction to OpenMP

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Introducing Task-Containers as an Alternative to Runtime Stacking

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Intel Parallel Studio XE 2015

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

<Insert Picture Here> OpenMP on Solaris

Transcription:

to foster Programmer Productivity Christian Terboven <terboven@itc.rwth-aachen.de> April 5th, 2017

Where is Aachen? 2

Where is Aachen? 3

Where is Aachen? 4

Agenda n Our Research Activities n Some Thoughts on Productivity n Example 1: Thread Affinity n Example 2: Transactional Memory n Correctness Checking n Summary 5

Research Activities in HPC n Focus on Efficient Parallel Programming for HPC n Topics: Ò Parallel Programming Paradigms (OpenMP and others) ÒAffinity, tasking, nesting, NUMA, Object-oriented Parallel Prog. ÒMember of the OpenMP Language Committee and ARB Ò Correctness Checking (MPI, MPI+OpenMP and other paradigms) Ò Total Cost of Ownership (Energy Efficiency, Programmability, Performance) Ò Analysis of parallel architectures ÒMember of SPEC ÒLarge Shared Memory machines ÒProgramming for Accelerators (GPUs, Intel MIC, Prototype Arch.) http://www.rwth-aachen.de 6

Some Thoughts on Productivity 7

Case Study: KegelSpan n 3D simulation of bevel gear cutting process 1 n Compute key values (i.a. chip thickness) to analyze tool load and tool wear n Fortran code (chip thickness computation) à Loop nest n à Dependencies in inner loop (minimum computation) Implementation à Basis: serial Fortran code Source: BMW, ZF, Klingelnberg 8 à OpenMP-simp: straight-forward OpenMP parallelization (no code tuning), data affinity à OpenMP-vec: restructuring for good data access pattern (SoA), vectorization, alignment to vector registers, loop interchanges, inlining, data affinity à OpenMP+LEO: OpenMP-vec (adapted to KNC), LEO directives for offloading kernels à OpenACC: restructuring for good data access pattern (SoA), coalescing à OpenCL: restructuring for good data access pattern (SoA), coalescing, shared memory 1 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI-Berichte, pages 1381 1384, Düsseldorf, 2010. VDI Verlag.

KegelSpan Effort & Performance 180 158 160 140 140 119 120 100 80 60 40 20 0 mod.locs 241 98 211 150 4 total runtime [s] effort [days] 6 4 2 0 5,0 1,5 4,5 3,5 0,5 OpenMP, Serial OpenCL, OpenACC GPU Host Compiler - Intel Sandy Bridge 16-core processor (2x Intel E5-2650 @2.0 GHz) Scientific Linux 6.3 NVIDIA Tesla C2050 ECC on, CUDA Toolkit 5.0/4.1 Intel Westmere 4-core processor (1x Intel E5620 @2.4 GHz) Scientific Linux 6.3 OpenCL (GPU) Intel 13.0.1 Intel 13.0.1/ PGI 12.9 OpenACC (GPU) OpenMP+LEO (Phi) OpenMP-vec (SNB) OpenMP-simp (SNB) 9

What is Productivity? n Productivity = +,-./ 0123 = ",51.63 18 209/60/" 0123 = #app. runs TCO à View on productivity might differ between scientist and HPC provider àtco: topic of active research n We believe: Abstractions can foster Programmer Productivity à Several studies showed: using pragmas (i.e. OpenMP) is more productive than using lower-level APIs (i.e. Posix-Threads) àless programming effort àeasier to learn and grasp important concepts 10

Example: Thread Affinity 11

Motivation n 2004: Sun Fire E25k server n 2012: Bull BCS System (in our cluster) à 128 Intel Nehalem-EX cores à Max. memory bandwidth: ca. 230 GB/s à 6 to 10 systems per rack possible à 144 Sun UltraSPARC IV cores n à Max. memory bandwidth: ca. 170 GB/s 2016: Intel Broadwell w/ Cluster-on-Die The memory hierarchy becomes more and more complex: at least two NUMA levels this is a challenge to program for! 12

The OpenMP Places concept n n Specification of Thread Affinity has to happen within the machine abstraction Considering the following system: c0 c1 c2 c3 c4 c5 c6 c7 à 2 sockets, 4 cores per socket, 4 hyper-threads per core n n n 13 Place: set of execution units Place List: (ordered) list of places The OpenMP place list is defined by the OMP_PLACES environment variable: à Specification of a regular expression, or à Specification of an abstract name, such as: à threads: one place per hyper-thread à cores: à sockets: one place per core (contains multiple hyper-threads) one place per socket (contains multiple cores) à Reduction of complex architecture to relevant performance-critical properties

Illustration of Thread Affinity n Selection of an application-specific strategy: à spread: à close: à master: separation of threads within the place list placement of threads closely together co-location of threads on single place n Example (nested par.): separation in outer loop, nearness in inner loop: OMP_PLACES=(0,1,2,3), (4,5,6,7),... = (0-3):8:4 = cores p0 p1 p2 p3 p4 p5 p6 p7 #pragma omp parallel proc_bind(spread) num_threads(4) p0 p1 p2 p3 p4 p5 p6 p7 #pragma omp parallel proc_bind(close) num_threads(4) p0 p1 p2 p3 p4 p5 p6 p7 14

Analysis for a SpMXV kernel Absolute performance: ca. 38 GFlops Roofline model: 39.4 GFlops as upper limit ü Application of the OpenMP Thread Affinity model ü NUMA-specific memory management with C++ allocator ü Integration via Template Expression mechanism and Adapter pattern ü Exploitation of matrix structure for load balancing and data placement 15

Example: Transactional Memory 16

Motivation n Processor-level hardware support for speculative lock elision has been introduced by IBM, Intel and others à potential for significant performance improvement if used in the right way à danger of tremendous penalties if used inappropriately n No standardized way in OpenMP to select a lock implementation à vendor-specific approaches are neither portable nor satisfying à a global setting, such as an environment variable, is not sensible n This work proposes an extended OpenMP API for locks and to extend the critical construct à to support the selection of lock implementations on a per-lock basis 17 à to offer backwards compatibility for existing application codes

Extended Locking API /1 n Fundamental requirement: do not break any existing code à new functionality is introduced as hints n Three options were considered à pragmas to prefix existing lock routines with the desired hint à complete set of new locking routines and lock types à new lock initialization routines to use with the existing lock API àminimal code modification, allows for incremental code adoption n OpenMP lock review à variable of type omp_lock_t or omp_nest_lock_t à must be initialized before first use with omp_init[_nest]_lock() 18 à routines to initialize, set, unset, and test a lock and finally to destroy it

Extended Locking API /2 n Two new lock init function provide hints to the runtime system à void omp_init[_nest]_lock_hinted( omp[_nest]_lock_t*, omp_lock_hint ) n The omp_lock_hint type lists high-level optimization criterions: à omp_lock_hint_none à omp_lock_hint_uncontended à omp_lock_hint_contended optimize for an uncontended lock optimize for a contended lock à omp_lock_hint_nonspeculative do not use hardware speculation à omp_lock_hint_speculative à omp_lock_hint_adaptive do use hardware speculation adaptively use hw speculation 19 à plus room for vendor-specific extensions n Similarly: Extended Critical construct

Evaluation with NPB UA n Naive use of HLE locks is not successful n The more threads are used, the more profitable is the clever use of HLE locks 20 Intel Xeon E5-2697v3 (code-name Haswell ), 2.6 GHz (no turbo), single socket, Red Har* Enterprise Linux* 7.0 (kernel 3.10.0- Christian Terboven, Matthias S. Müller IT Center der RWTH Aachen 123-el7.x86_64), University Intel Composer XE for C/C++ 2013 SP1 2.144 with O3 optimization. * Other names and brands may be the property of othe

Correctness Checking 21

How many errors can you spot in this tiny example? #include <mpi.h> #include <stdio.h> int main (int argc, char** argv) { int rank, size, buf[8]; MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Datatype type; MPI_Type_contiguous (2, MPI_INTEGER, &type); MPI_Recv (buf, 2, MPI_INT, size - rank, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (buf, 2, type, size - rank, 123, MPI_COMM_WORLD); printf ("Hello, I am rank %d of %d.\n", rank, size); } 22 return 0; At least 8 issues in this code example

How many errors can you spot in this tiny example? #include <mpi.h> #include <stdio.h> int main (int argc, char** argv) { int rank, size, buf[8]; MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Datatype type; MPI_Type_contiguous (2, MPI_INTEGER, &type); No MPI_Init before first MPI-call Fortran type in C Recv-recv deadlock Rank0: src=size (out of range) Type not committed before use Type not freed before end of main Send 4 int, recv 2 int: truncation No MPI_Finalize before end of main MPI_Recv (buf, 2, MPI_INT, size - rank, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (buf, 2, type, size - rank, 123, MPI_COMM_WORLD); printf ("Hello, I am rank %d of %d.\n", rank, size); } return 0; MUST detects these issues and pinpoints you to the source 23

Now what about accelerated systems? n Hybrid parallel programming (MPI + OpenMP) is even more complex n Including accelerators (i.e. OpenMP target, OpenACC) even more n Recent work made MUST support à Threading à Offloading 24

Example: Race between Host + Device n Result not deterministic n Race detection only possible with memory tracing (pintool) n OMPT mapping information required double result = 0; #pragma omp parallel num_threads(2) { #pragma omp sections { #pragma omp section #pragma omp target map(tofrom:result) { result += compute(); } #pragma omp section { result += compute(); } } } 25

Status Correctness Checking n Comparison of Correctness Checking Capabilities Catergory Errorclass Insp(clang) Insp(Phi) MUST FK1 data_missing_accelerator x FK1 data_missing_host (x) x* FK1 data_outdated_accelerator x FK1 data_outdated_host x* FK2 datarace_inside_devkernel x FK2 datarace_across_devkernels FK3 race_between_host_and_device FK4 only_some_thread_pass_barrier x FK4 deadlock_with_locks x FK4 simd_misalign x** FK5 thread_pass_different_barriers x FK5 uninitialized_locks x FK6 dev_allocation_fails x x 26 * Check directly implemented in pintool ** Only in specialized version for x86

Summary 27

Influence on OpenMP 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 1.0 2.0 2.5 3.0 3.1 4.0 4.5 Loop-level Parallelization Tasking Heterog. Arch. n n 28 OpenMP 3.0 and 3.1: C++ à Extension of the canonical form or parallelizable loops + interator loops à Definition of object behavior in the context of data scoping OpenMP 4.0: Thread Affinity à Integration of the OpenMP thread affinity model, support for nested par. n OpenMP 4.5: à Taskloop construct: loop parallelization by means of tasks (composability) à Locks with hints: Support for different lock types, like for transactional memory n OpenMP TR5 / 5.0: à Memory management à OpenMP Tools and Debugging Interface

Summary n Research interests of our group in Aachen Ò Parallel Programming Paradigms Ò Correctness Checking Ò Total Cost of Ownership Ò Analysis of parallel architectures n Abstractions can foster Programmer Productivity n Development of Programming Languages has to go along with Development of Tools à Focus not only on Performance, but also on Correctness 29

Thank for your attention. Christian Terboven <terboven@itc.rwth-aachen.de> Matthias Müller <mueller@itc.rwth-aachen.de>