Research on Programming Models to foster Programmer Productivity

Size: px

Start display at page:

Download "Research on Programming Models to foster Programmer Productivity"

Dorthy Heath
5 years ago
Views:

1 to foster Programmer Productivity Christian Terboven April 5th, 2017

2 Where is Aachen? 2

3 Where is Aachen? 3

4 Where is Aachen? 4

5 Agenda n Our Research Activities n Some Thoughts on Productivity n Example 1: Thread Affinity n Example 2: Transactional Memory n Correctness Checking n Summary 5

6 Research Activities in HPC n Focus on Efficient Parallel Programming for HPC n Topics: Ò Parallel Programming Paradigms (OpenMP and others) ÒAffinity, tasking, nesting, NUMA, Object-oriented Parallel Prog. ÒMember of the OpenMP Language Committee and ARB Ò Correctness Checking (MPI, MPI+OpenMP and other paradigms) Ò Total Cost of Ownership (Energy Efficiency, Programmability, Performance) Ò Analysis of parallel architectures ÒMember of SPEC ÒLarge Shared Memory machines ÒProgramming for Accelerators (GPUs, Intel MIC, Prototype Arch.) 6

7 Some Thoughts on Productivity 7

Case Study: KegelSpan n 3D simulation of bevel gear cutting process 1 n Compute key values (i.a. chip thickness) to analyze tool load and tool wear n Fortran code (chip thickness computation) à Loop

8 Case Study: KegelSpan n 3D simulation of bevel gear cutting process 1 n Compute key values (i.a. chip thickness) to analyze tool load and tool wear n Fortran code (chip thickness computation) à Loop nest n à Dependencies in inner loop (minimum computation) Implementation à Basis: serial Fortran code Source: BMW, ZF, Klingelnberg 8 à OpenMP-simp: straight-forward OpenMP parallelization (no code tuning), data affinity à OpenMP-vec: restructuring for good data access pattern (SoA), vectorization, alignment to vector registers, loop interchanges, inlining, data affinity à OpenMP+LEO: OpenMP-vec (adapted to KNC), LEO directives for offloading kernels à OpenACC: restructuring for good data access pattern (SoA), coalescing à OpenCL: restructuring for good data access pattern (SoA), coalescing, shared memory 1 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume of VDI-Berichte, pages , Düsseldorf, VDI Verlag.

9 KegelSpan Effort & Performance mod.locs total runtime [s] effort [days] ,0 1,5 4,5 3,5 0,5 OpenMP, Serial OpenCL, OpenACC GPU Host Compiler - Intel Sandy Bridge 16-core processor (2x Intel GHz) Scientific Linux 6.3 NVIDIA Tesla C2050 ECC on, CUDA Toolkit 5.0/4.1 Intel Westmere 4-core processor (1x Intel GHz) Scientific Linux 6.3 OpenCL (GPU) Intel Intel / PGI 12.9 OpenACC (GPU) OpenMP+LEO (Phi) OpenMP-vec (SNB) OpenMP-simp (SNB) 9

10 What is Productivity? n Productivity = +,-./ 0123 = ", /60/" 0123 = #app. runs TCO à View on productivity might differ between scientist and HPC provider àtco: topic of active research n We believe: Abstractions can foster Programmer Productivity à Several studies showed: using pragmas (i.e. OpenMP) is more productive than using lower-level APIs (i.e. Posix-Threads) àless programming effort àeasier to learn and grasp important concepts 10

11 Example: Thread Affinity 11

Motivation n 2004: Sun Fire E25k server n 2012: Bull BCS System (in

12 Motivation n 2004: Sun Fire E25k server n 2012: Bull BCS System (in our cluster) à 128 Intel Nehalem-EX cores à Max. memory bandwidth: ca. 230 GB/s à 6 to 10 systems per rack possible à 144 Sun UltraSPARC IV cores n à Max. memory bandwidth: ca. 170 GB/s 2016: Intel Broadwell w/ Cluster-on-Die The memory hierarchy becomes more and more complex: at least two NUMA levels this is a challenge to program for! 12

13 The OpenMP Places concept n n Specification of Thread Affinity has to happen within the machine abstraction Considering the following system: c0 c1 c2 c3 c4 c5 c6 c7 à 2 sockets, 4 cores per socket, 4 hyper-threads per core n n n 13 Place: set of execution units Place List: (ordered) list of places The OpenMP place list is defined by the OMP_PLACES environment variable: à Specification of a regular expression, or à Specification of an abstract name, such as: à threads: one place per hyper-thread à cores: à sockets: one place per core (contains multiple hyper-threads) one place per socket (contains multiple cores) à Reduction of complex architecture to relevant performance-critical properties

14 Illustration of Thread Affinity n Selection of an application-specific strategy: à spread: à close: à master: separation of threads within the place list placement of threads closely together co-location of threads on single place n Example (nested par.): separation in outer loop, nearness in inner loop: OMP_PLACES=(0,1,2,3), (4,5,6,7),... = (0-3):8:4 = cores p0 p1 p2 p3 p4 p5 p6 p7 #pragma omp parallel proc_bind(spread) num_threads(4) p0 p1 p2 p3 p4 p5 p6 p7 #pragma omp parallel proc_bind(close) num_threads(4) p0 p1 p2 p3 p4 p5 p6 p7 14

15 Analysis for a SpMXV kernel Absolute performance: ca. 38 GFlops Roofline model: 39.4 GFlops as upper limit ü Application of the OpenMP Thread Affinity model ü NUMA-specific memory management with C++ allocator ü Integration via Template Expression mechanism and Adapter pattern ü Exploitation of matrix structure for load balancing and data placement 15

16 Example: Transactional Memory 16

17 Motivation n Processor-level hardware support for speculative lock elision has been introduced by IBM, Intel and others à potential for significant performance improvement if used in the right way à danger of tremendous penalties if used inappropriately n No standardized way in OpenMP to select a lock implementation à vendor-specific approaches are neither portable nor satisfying à a global setting, such as an environment variable, is not sensible n This work proposes an extended OpenMP API for locks and to extend the critical construct à to support the selection of lock implementations on a per-lock basis 17 à to offer backwards compatibility for existing application codes

18 Extended Locking API /1 n Fundamental requirement: do not break any existing code à new functionality is introduced as hints n Three options were considered à pragmas to prefix existing lock routines with the desired hint à complete set of new locking routines and lock types à new lock initialization routines to use with the existing lock API àminimal code modification, allows for incremental code adoption n OpenMP lock review à variable of type omp_lock_t or omp_nest_lock_t à must be initialized before first use with omp_init[_nest]_lock() 18 à routines to initialize, set, unset, and test a lock and finally to destroy it

19 Extended Locking API /2 n Two new lock init function provide hints to the runtime system à void omp_init[_nest]_lock_hinted( omp[_nest]_lock_t*, omp_lock_hint ) n The omp_lock_hint type lists high-level optimization criterions: à omp_lock_hint_none à omp_lock_hint_uncontended à omp_lock_hint_contended optimize for an uncontended lock optimize for a contended lock à omp_lock_hint_nonspeculative do not use hardware speculation à omp_lock_hint_speculative à omp_lock_hint_adaptive do use hardware speculation adaptively use hw speculation 19 à plus room for vendor-specific extensions n Similarly: Extended Critical construct

Evaluation with NPB UA n Naive use of HLE locks is not successful n The more threads are used, the more profitable is the clever use of HLE locks 20 Intel Xeon E5-2697v3 (code-name Haswell ), 2.

20 Evaluation with NPB UA n Naive use of HLE locks is not successful n The more threads are used, the more profitable is the clever use of HLE locks 20 Intel Xeon E5-2697v3 (code-name Haswell ), 2.6 GHz (no turbo), single socket, Red Har* Enterprise Linux* 7.0 (kernel Christian Terboven, Matthias S. Müller IT Center der RWTH Aachen 123-el7.x86_64), University Intel Composer XE for C/C SP with O3 optimization. * Other names and brands may be the property of othe

21 Correctness Checking 21

22 How many errors can you spot in this tiny example? #include <mpi.h> #include <stdio.h> int main (int argc, char** argv) { int rank, size, buf[8]; MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Datatype type; MPI_Type_contiguous (2, MPI_INTEGER, &type); MPI_Recv (buf, 2, MPI_INT, size - rank, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (buf, 2, type, size - rank, 123, MPI_COMM_WORLD); printf ("Hello, I am rank %d of %d.\n", rank, size); } 22 return 0; At least 8 issues in this code example

23 How many errors can you spot in this tiny example? #include <mpi.h> #include <stdio.h> int main (int argc, char** argv) { int rank, size, buf[8]; MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Datatype type; MPI_Type_contiguous (2, MPI_INTEGER, &type); No MPI_Init before first MPI-call Fortran type in C Recv-recv deadlock Rank0: src=size (out of range) Type not committed before use Type not freed before end of main Send 4 int, recv 2 int: truncation No MPI_Finalize before end of main MPI_Recv (buf, 2, MPI_INT, size - rank, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (buf, 2, type, size - rank, 123, MPI_COMM_WORLD); printf ("Hello, I am rank %d of %d.\n", rank, size); } return 0; MUST detects these issues and pinpoints you to the source 23

24 Now what about accelerated systems? n Hybrid parallel programming (MPI + OpenMP) is even more complex n Including accelerators (i.e. OpenMP target, OpenACC) even more n Recent work made MUST support à Threading à Offloading 24

25 Example: Race between Host + Device n Result not deterministic n Race detection only possible with memory tracing (pintool) n OMPT mapping information required double result = 0; #pragma omp parallel num_threads(2) { #pragma omp sections { #pragma omp section #pragma omp target map(tofrom:result) { result += compute(); } #pragma omp section { result += compute(); } } } 25

26 Status Correctness Checking n Comparison of Correctness Checking Capabilities Catergory Errorclass Insp(clang) Insp(Phi) MUST FK1 data_missing_accelerator x FK1 data_missing_host (x) x* FK1 data_outdated_accelerator x FK1 data_outdated_host x* FK2 datarace_inside_devkernel x FK2 datarace_across_devkernels FK3 race_between_host_and_device FK4 only_some_thread_pass_barrier x FK4 deadlock_with_locks x FK4 simd_misalign x** FK5 thread_pass_different_barriers x FK5 uninitialized_locks x FK6 dev_allocation_fails x x 26 * Check directly implemented in pintool ** Only in specialized version for x86

27 Summary 27

28 Influence on OpenMP Loop-level Parallelization Tasking Heterog. Arch. n n 28 OpenMP 3.0 and 3.1: C++ à Extension of the canonical form or parallelizable loops + interator loops à Definition of object behavior in the context of data scoping OpenMP 4.0: Thread Affinity à Integration of the OpenMP thread affinity model, support for nested par. n OpenMP 4.5: à Taskloop construct: loop parallelization by means of tasks (composability) à Locks with hints: Support for different lock types, like for transactional memory n OpenMP TR5 / 5.0: à Memory management à OpenMP Tools and Debugging Interface

29 Summary n Research interests of our group in Aachen Ò Parallel Programming Paradigms Ò Correctness Checking Ò Total Cost of Ownership Ò Analysis of parallel architectures n Abstractions can foster Programmer Productivity n Development of Programming Languages has to go along with Development of Tools à Focus not only on Performance, but also on Correctness 29

30 Thank for your attention. Christian Terboven Matthias Müller

Accelerators in Technical Computing: Is it Worth the Pain?

Accelerators in Technical Computing: Is it Worth the Pain? A TCO Perspective Sandra Wienke, Dieter an Mey, Matthias S. Müller Center for Computing and Communication JARA High-Performance Computing RWTH