to foster Programmer Productivity Christian Terboven <terboven@itc.rwth-aachen.de> April 5th, 2017
Where is Aachen? 2
Where is Aachen? 3
Where is Aachen? 4
Agenda n Our Research Activities n Some Thoughts on Productivity n Example 1: Thread Affinity n Example 2: Transactional Memory n Correctness Checking n Summary 5
Research Activities in HPC n Focus on Efficient Parallel Programming for HPC n Topics: Ò Parallel Programming Paradigms (OpenMP and others) ÒAffinity, tasking, nesting, NUMA, Object-oriented Parallel Prog. ÒMember of the OpenMP Language Committee and ARB Ò Correctness Checking (MPI, MPI+OpenMP and other paradigms) Ò Total Cost of Ownership (Energy Efficiency, Programmability, Performance) Ò Analysis of parallel architectures ÒMember of SPEC ÒLarge Shared Memory machines ÒProgramming for Accelerators (GPUs, Intel MIC, Prototype Arch.) http://www.rwth-aachen.de 6
Some Thoughts on Productivity 7
Case Study: KegelSpan n 3D simulation of bevel gear cutting process 1 n Compute key values (i.a. chip thickness) to analyze tool load and tool wear n Fortran code (chip thickness computation) à Loop nest n à Dependencies in inner loop (minimum computation) Implementation à Basis: serial Fortran code Source: BMW, ZF, Klingelnberg 8 à OpenMP-simp: straight-forward OpenMP parallelization (no code tuning), data affinity à OpenMP-vec: restructuring for good data access pattern (SoA), vectorization, alignment to vector registers, loop interchanges, inlining, data affinity à OpenMP+LEO: OpenMP-vec (adapted to KNC), LEO directives for offloading kernels à OpenACC: restructuring for good data access pattern (SoA), coalescing à OpenCL: restructuring for good data access pattern (SoA), coalescing, shared memory 1 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI-Berichte, pages 1381 1384, Düsseldorf, 2010. VDI Verlag.
KegelSpan Effort & Performance 180 158 160 140 140 119 120 100 80 60 40 20 0 mod.locs 241 98 211 150 4 total runtime [s] effort [days] 6 4 2 0 5,0 1,5 4,5 3,5 0,5 OpenMP, Serial OpenCL, OpenACC GPU Host Compiler - Intel Sandy Bridge 16-core processor (2x Intel E5-2650 @2.0 GHz) Scientific Linux 6.3 NVIDIA Tesla C2050 ECC on, CUDA Toolkit 5.0/4.1 Intel Westmere 4-core processor (1x Intel E5620 @2.4 GHz) Scientific Linux 6.3 OpenCL (GPU) Intel 13.0.1 Intel 13.0.1/ PGI 12.9 OpenACC (GPU) OpenMP+LEO (Phi) OpenMP-vec (SNB) OpenMP-simp (SNB) 9
What is Productivity? n Productivity = +,-./ 0123 = ",51.63 18 209/60/" 0123 = #app. runs TCO à View on productivity might differ between scientist and HPC provider àtco: topic of active research n We believe: Abstractions can foster Programmer Productivity à Several studies showed: using pragmas (i.e. OpenMP) is more productive than using lower-level APIs (i.e. Posix-Threads) àless programming effort àeasier to learn and grasp important concepts 10
Example: Thread Affinity 11
Motivation n 2004: Sun Fire E25k server n 2012: Bull BCS System (in our cluster) à 128 Intel Nehalem-EX cores à Max. memory bandwidth: ca. 230 GB/s à 6 to 10 systems per rack possible à 144 Sun UltraSPARC IV cores n à Max. memory bandwidth: ca. 170 GB/s 2016: Intel Broadwell w/ Cluster-on-Die The memory hierarchy becomes more and more complex: at least two NUMA levels this is a challenge to program for! 12
The OpenMP Places concept n n Specification of Thread Affinity has to happen within the machine abstraction Considering the following system: c0 c1 c2 c3 c4 c5 c6 c7 à 2 sockets, 4 cores per socket, 4 hyper-threads per core n n n 13 Place: set of execution units Place List: (ordered) list of places The OpenMP place list is defined by the OMP_PLACES environment variable: à Specification of a regular expression, or à Specification of an abstract name, such as: à threads: one place per hyper-thread à cores: à sockets: one place per core (contains multiple hyper-threads) one place per socket (contains multiple cores) à Reduction of complex architecture to relevant performance-critical properties
Illustration of Thread Affinity n Selection of an application-specific strategy: à spread: à close: à master: separation of threads within the place list placement of threads closely together co-location of threads on single place n Example (nested par.): separation in outer loop, nearness in inner loop: OMP_PLACES=(0,1,2,3), (4,5,6,7),... = (0-3):8:4 = cores p0 p1 p2 p3 p4 p5 p6 p7 #pragma omp parallel proc_bind(spread) num_threads(4) p0 p1 p2 p3 p4 p5 p6 p7 #pragma omp parallel proc_bind(close) num_threads(4) p0 p1 p2 p3 p4 p5 p6 p7 14
Analysis for a SpMXV kernel Absolute performance: ca. 38 GFlops Roofline model: 39.4 GFlops as upper limit ü Application of the OpenMP Thread Affinity model ü NUMA-specific memory management with C++ allocator ü Integration via Template Expression mechanism and Adapter pattern ü Exploitation of matrix structure for load balancing and data placement 15
Example: Transactional Memory 16
Motivation n Processor-level hardware support for speculative lock elision has been introduced by IBM, Intel and others à potential for significant performance improvement if used in the right way à danger of tremendous penalties if used inappropriately n No standardized way in OpenMP to select a lock implementation à vendor-specific approaches are neither portable nor satisfying à a global setting, such as an environment variable, is not sensible n This work proposes an extended OpenMP API for locks and to extend the critical construct à to support the selection of lock implementations on a per-lock basis 17 à to offer backwards compatibility for existing application codes
Extended Locking API /1 n Fundamental requirement: do not break any existing code à new functionality is introduced as hints n Three options were considered à pragmas to prefix existing lock routines with the desired hint à complete set of new locking routines and lock types à new lock initialization routines to use with the existing lock API àminimal code modification, allows for incremental code adoption n OpenMP lock review à variable of type omp_lock_t or omp_nest_lock_t à must be initialized before first use with omp_init[_nest]_lock() 18 à routines to initialize, set, unset, and test a lock and finally to destroy it
Extended Locking API /2 n Two new lock init function provide hints to the runtime system à void omp_init[_nest]_lock_hinted( omp[_nest]_lock_t*, omp_lock_hint ) n The omp_lock_hint type lists high-level optimization criterions: à omp_lock_hint_none à omp_lock_hint_uncontended à omp_lock_hint_contended optimize for an uncontended lock optimize for a contended lock à omp_lock_hint_nonspeculative do not use hardware speculation à omp_lock_hint_speculative à omp_lock_hint_adaptive do use hardware speculation adaptively use hw speculation 19 à plus room for vendor-specific extensions n Similarly: Extended Critical construct
Evaluation with NPB UA n Naive use of HLE locks is not successful n The more threads are used, the more profitable is the clever use of HLE locks 20 Intel Xeon E5-2697v3 (code-name Haswell ), 2.6 GHz (no turbo), single socket, Red Har* Enterprise Linux* 7.0 (kernel 3.10.0- Christian Terboven, Matthias S. Müller IT Center der RWTH Aachen 123-el7.x86_64), University Intel Composer XE for C/C++ 2013 SP1 2.144 with O3 optimization. * Other names and brands may be the property of othe
Correctness Checking 21
How many errors can you spot in this tiny example? #include <mpi.h> #include <stdio.h> int main (int argc, char** argv) { int rank, size, buf[8]; MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Datatype type; MPI_Type_contiguous (2, MPI_INTEGER, &type); MPI_Recv (buf, 2, MPI_INT, size - rank, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (buf, 2, type, size - rank, 123, MPI_COMM_WORLD); printf ("Hello, I am rank %d of %d.\n", rank, size); } 22 return 0; At least 8 issues in this code example
How many errors can you spot in this tiny example? #include <mpi.h> #include <stdio.h> int main (int argc, char** argv) { int rank, size, buf[8]; MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Datatype type; MPI_Type_contiguous (2, MPI_INTEGER, &type); No MPI_Init before first MPI-call Fortran type in C Recv-recv deadlock Rank0: src=size (out of range) Type not committed before use Type not freed before end of main Send 4 int, recv 2 int: truncation No MPI_Finalize before end of main MPI_Recv (buf, 2, MPI_INT, size - rank, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (buf, 2, type, size - rank, 123, MPI_COMM_WORLD); printf ("Hello, I am rank %d of %d.\n", rank, size); } return 0; MUST detects these issues and pinpoints you to the source 23
Now what about accelerated systems? n Hybrid parallel programming (MPI + OpenMP) is even more complex n Including accelerators (i.e. OpenMP target, OpenACC) even more n Recent work made MUST support à Threading à Offloading 24
Example: Race between Host + Device n Result not deterministic n Race detection only possible with memory tracing (pintool) n OMPT mapping information required double result = 0; #pragma omp parallel num_threads(2) { #pragma omp sections { #pragma omp section #pragma omp target map(tofrom:result) { result += compute(); } #pragma omp section { result += compute(); } } } 25
Status Correctness Checking n Comparison of Correctness Checking Capabilities Catergory Errorclass Insp(clang) Insp(Phi) MUST FK1 data_missing_accelerator x FK1 data_missing_host (x) x* FK1 data_outdated_accelerator x FK1 data_outdated_host x* FK2 datarace_inside_devkernel x FK2 datarace_across_devkernels FK3 race_between_host_and_device FK4 only_some_thread_pass_barrier x FK4 deadlock_with_locks x FK4 simd_misalign x** FK5 thread_pass_different_barriers x FK5 uninitialized_locks x FK6 dev_allocation_fails x x 26 * Check directly implemented in pintool ** Only in specialized version for x86
Summary 27
Influence on OpenMP 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 1.0 2.0 2.5 3.0 3.1 4.0 4.5 Loop-level Parallelization Tasking Heterog. Arch. n n 28 OpenMP 3.0 and 3.1: C++ à Extension of the canonical form or parallelizable loops + interator loops à Definition of object behavior in the context of data scoping OpenMP 4.0: Thread Affinity à Integration of the OpenMP thread affinity model, support for nested par. n OpenMP 4.5: à Taskloop construct: loop parallelization by means of tasks (composability) à Locks with hints: Support for different lock types, like for transactional memory n OpenMP TR5 / 5.0: à Memory management à OpenMP Tools and Debugging Interface
Summary n Research interests of our group in Aachen Ò Parallel Programming Paradigms Ò Correctness Checking Ò Total Cost of Ownership Ò Analysis of parallel architectures n Abstractions can foster Programmer Productivity n Development of Programming Languages has to go along with Development of Tools à Focus not only on Performance, but also on Correctness 29
Thank for your attention. Christian Terboven <terboven@itc.rwth-aachen.de> Matthias Müller <mueller@itc.rwth-aachen.de>