Parallel Programming. The Ultimate Road to Performance April 16, Werner Krotz-Vogel

Size: px

Start display at page:

Download "Parallel Programming. The Ultimate Road to Performance April 16, Werner Krotz-Vogel"

Kelley Rogers
6 years ago
Views:

1 Parallel Programming The Ultimate Road to Performance April 16, 2013 Werner Krotz-Vogel 1

2 Getting started with parallel algorithms Concurrency is a general concept multiple activities that can occur and make progress at the same time. A parallel algorithm is any algorithm that uses concurrency to solve a problem of a given size in less time Scientific programmers have been working with parallelism since the early 80 s Hence we have almost 30 years of experience to draw on to help us understand parallel algorithms Mathew J. Sottile, Timothy G. Mattson, and Craig E 2

3 Enabling & Advancing Parallelism High Performance Parallel Programming Intel tools, libraries and parallel models extend to multicore, many-core and heterogeneous computing Compiler Libraries Parallel Models Code Multicore Many-core Cluster Multicore CPU Multicore CPU Intel Xeon Phi coprocessor Multicore Cluster Multicore & Many -core Cluster Develop & Parallelize Today for Maximum Use One Software Architecture Today. Scale Forward Tomorrow. Performance

4 Intel Software Development Products Deliver Application Performance Advanced Performance Cluster Performance Intel C/C++ and Fortran Compilers w/openmp Intel MKL, Intel Cilk Plus, Intel TBB Library, Intel IPP Library Intel Inspector XE, Intel VTune Amplifier XE, Intel Advisor Intel MPI Library Intel Trace Analyzer and Collector Intel Parallel Studio XE Foundation of Performance, Productivity, and Standards

A Family of Parallel Programming Models Developer Choice Intel

Libraries Established Standards Research and Development C/C++

template library for parallelism Intel Integrated Performance

Collections Intel Math Kernel Library OpenMP* Coarray Fortran

Open sourced Also an Intel product OpenCL* Offload Extensions

5 A Family of Parallel Programming Models Developer Choice Intel Cilk Plus Intel Threading Building Blocks Domain-Specific Libraries Established Standards Research and Development C/C++ language extensions to simplify parallelism Widely used C++ template library for parallelism Intel Integrated Performance Primitives Message Passing Interface (MPI) Intel Concurrent Collections Intel Math Kernel Library OpenMP* Coarray Fortran Intel SPMD Parallel Compiler Open sourced Also an Intel product Open sourced Also an Intel product OpenCL* Offload Extensions Choice of high-performance parallel programming models Applicable to Multicore and Many-core Programming

Invest in Common Tools and Programming Models

designed for intelligent performance and smart

coprocessors are ideal for highly parallel

Intel Xeon processor family and instruction set

) Use One Software Architecture Today Tomorrow +

6 Invest in Common Tools and Programming Models Code Multicore Intel Xeon processors are designed for intelligent performance and smart energy efficiency Many-core Intel Xeon Phi coprocessors are ideal for highly parallel computing applications Continuing to advance Intel Xeon processor family and instruction set (e.g., Intel AVX, etc.) Use One Software Architecture Today Tomorrow + + Software development platforms ramping now Use One Software Architecture Today. Scale Forward Tomorrow.

Go Parallel with High Performance Math Kernel Library Intel Math Kernel Library (Intel MKL) void foo() /* Intel Math Kernel Library */ { float *A, *B, *C; /* Matrices */ sgemm(&transa, &transb, &N,

7 Go Parallel with High Performance Math Kernel Library Intel Math Kernel Library (Intel MKL) void foo() /* Intel Math Kernel Library */ { float *A, *B, *C; /* Matrices */ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); } Implicit automatic offloading requires no code changes, simply link with the offload MKL Library Intel Xeon processor Intel Xeon Phi coprocessor Intel High Performance Math Kernel Library is Applicable to Multicore and Many-core Programming

8 Go Parallel with Intel Cilk Plus Proven Cilk parallel model, teachable in one minute Parallelism in Three Key Words: cilk_spawn cilk_sync cilk_for Cilk Plus: an open specification Recently placed into open source by Intel for the advancement of parallel programming // Parallel function invocation, in C cilk_for (int i=0; i<n; ++i){ Foo(a[i]); } // Parallel spawn in a recursive fibonacci // computation, in C int fib (int n) { if (n < 2) return 1; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } } Learn more at Intel Cilk Plus is Applicable to Multicore and Many-core Programming

Go Parallel with Intel Cilk Plus Data and Task Parallelism as first class citizens in C and C++ Vectorization via intuitive notations that automatically span MMX, SSE, AVX, and wider widths in the

9 Go Parallel with Intel Cilk Plus Data and Task Parallelism as first class citizens in C and C++ Vectorization via intuitive notations that automatically span MMX, SSE, AVX, and wider widths in the future including those in the Intel Xeon Phi coprocessors array notations #pragma SIMD controls elemental functions // Simplify operation using // array notations in C/C++: a[:] = b[:] + c[:]; // Elemental functions, in C, // using Cilk Plus: declspec (vector) void saxpy(float a, float x, float &y) { y += a * x; } //pragma SIMD: User-mandated // vectorization #pragma simd for (i=0; i<n; i++) { A[i] = A[i]+ B[i] + C[i]; } Learn more at Intel Cilk Plus is Applicable to Multicore and Many-core Programming

10 Go Parallel with Intel Threading Building Blocks (Intel TBB) A popular parallel abstraction for C++ developers A C++ template library Scalable memory allocation Load-balancing Work-stealing task scheduling Thread-safe pipeline Concurrent containers High-level parallel algorithms Numerous synchronization primitives //Parallel function invocation example, in C++, //using TBB: parallel_for (0, n, [=](int i) { Foo(a[i]); }); Learn more at Intel remains a leading participant and contributor in the TBB open source project as well as a leading supplier of TBB support and supporting tool. Intel TBB is Applicable to Multicore and Many-core Programming

Intel Threading Building Blocks Generic Parallel Algorithms

having to start from scratch Miscellaneous Thread-safe timers Task

task-stealing to maximize concurrency Threads OS API wrappers

to containers that are externally locked for thread-safety TBB flow

data that supports infinite number of TLS Synchronization Primitives

operations to several flavors of mutexes and condition variables

11 Intel Threading Building Blocks Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch Miscellaneous Thread-safe timers Task scheduler The engine that empowers parallel algorithms that employs task-stealing to maximize concurrency Threads OS API wrappers Concurrent Containers Concurrent access, and a scalable alternative to containers that are externally locked for thread-safety TBB flow graph Thread Local Storage Scalable implementation of thread-local data that supports infinite number of TLS Synchronization Primitives User-level and OS wrappers for mutual exclusion, ranging from atomic operations to several flavors of mutexes and condition variables Memory Allocation Per-thread scalable memory manager and false-sharing free allocators 11

$TBB Flow Graph Dependence Example struct body { std::string my_name; body( const char *name ) : my_name(name) {} void operator()( continue_msg ) const { printf("%s\n", my_name.$

12 TBB Flow Graph Dependence Example struct body { std::string my_name; body( const char *name ) : my_name(name) {} void operator()( continue_msg ) const { printf("%s\n", my_name.c_str()); } }; A E f() f() f() f() f() B C D int main() { graph g; broadcast_node< continue_msg > start; continue_node< continue_msg > a( g, body("a") ); continue_node< continue_msg > b( g, body("b") ); continue_node< continue_msg > c( g, body("c") ); continue_node< continue_msg > d( g, body("d") ); continue_node< continue_msg > e( g, body("e") ); make_edge( start, a ); make_edge( start, b ); make_edge( a, c ); make_edge( b, c ); make_edge( c, d ); make_edge( a, e ); for (int i = 0; i < 3; ++i ) { start.try_put( continue_msg() ); g.wait_for_all(); } return 0; } 12

Go Parallel with Message Passing Interface

MPI) Extend your cluster solutions to the Intel

, Intel Xeon Phi coprocessor in every node of

Building Blocks and/or Intel Cilk Plus on nodes

Multicore and Many-core Intel is a leading

13 Go Parallel with Message Passing Interface (MPI) Intel Message Passing Interface (Intel MPI) Extend your cluster solutions to the Intel Xeon Phi coprocessor E.g., Intel Xeon Phi coprocessor in every node of the cluster using Intel MPI and Intel Threading Building Blocks and/or Intel Cilk Plus on nodes Same model as an Intel Xeon processor based cluster. Multicore Cluster Clusters Clusters with Multicore and Many-core Intel is a leading vendor of MPI implementations and tools Learn more at MPI is applicable to Multicore and Many-core Programming

14 Go Parallel with Coarray Fortran Intel Fortran Compiler!Sum in Fortran, using co-array feature: REAL SUM[*] CALL SYNC_ALL( WAIT=1 ) DO IMG= 2,NUM_IMAGES() IF (IMG==THIS_IMAGE()) THEN SUM = SUM + SUM[IMG-1] ENDIF CALL SYNC_ALL( WAIT=IMG ) ENDDO A standard, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax. For parallel programming on both shared memory and distributed memory systems Learn more at Coarray Fortran is Applicable to Multicore and Many-core Programming

15 Go Parallel with OpenMP* Intel C/C++ and Fortran Compilers (C Example) main() { double pi = 0.0f; long i; #pragma offload target (mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.5)/n); pi += 4.0/(1.0+t*t); } printf("pi = %f\n",pi/n); } One Line Change to Offload to the Intel Xeon Phi coprocessor Intel Xeon processor Intel Xeon Phi coprocessor OpenMP* is Applicable to Multicore and Many-core Programming

16 Go Parallel with OpenMP* Intel C/C++ and Fortran Compilers (Fortran Example)!dir$ omp offload target(mic)!$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo!$omp end parallel do One Line Change to Offload to the Intel Xeon Phi coprocessor Intel Xeon processor Intel Xeon Phi coprocessor OpenMP* is Applicable to Multicore and Many-core Programming

17 Go Parallel with C/C++ Language Extensions Simple Keyword Language Extensions to control offloading to Intel Xeon Phi coprocessor C/C++ Language Extensions class _Shared common { int data1; char *data2; class common *next; void process(); }; _Shared class common obj1, obj2; _Cilk_spawn _Offload obj1.process(); _Cilk_spawn obj2.process(); C/C++ Language Extensions to Multicore and Many-core Programming

18 Use the Same Code for Execution on Intel Xeon Phi coprocessors by Offloading C/C++ Offload Pragma Fortran Offload Directive #pragma offload target (mic)!dir$ omp offload target(mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); } pi /= count; MKL Implicit Offload //MKL implicit offload requires no source code changes, simply link with the offload MKL Library. MKL Explicit Offload #pragma offload target (mic) \ in(transa, transb, N, alpha, beta) \ in(a:length(matrix_elements)) \ in(b:length(matrix_elements)) \ in(c:length(matrix_elements)) \ out(c:length(matrix_elements)alloc_if(0)) sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);!$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo!$omp end parallel C/C++ Language Extensions class _Shared common { int data1; char *data2; class common *next; void process(); }; _Shared class common obj1, obj2; _Cilk_spawn _Offload obj1.process(); _Cilk_spawn obj2.process();

19 Parallelism with OpenCL* Intel OpenCL SDK OpenCL* is a framework for writing programs that execute across heterogeneous platforms (e.g., CPUs, GPUs, many-core) //Simple per element multiplication using OpenCL*: kernel void dotprod( global const float *a, global const float *b, global float *c) { int myid = get_global_id(0); c[myid] = a[myid] * b[myid]; } Intel is a leading participant in the OpenCL* standard efforts, and a vendor of solutions and related tools with early implementations available today. OpenCL* addresses the needs of customers in specific segments Learn more at OpenCL is applicable to multicore and many-core programming

Running your Application Execution on the host and Intel Xeon Phi coprocessor Without: Intel Xeon Phi coprocessor(s) are absent Application starts and executes on host With: Intel Xeon Phi

Intel Host Processor Your Application Execution Flow With identified Compute Intensive Kernels Your Application With identified Compute Intensive Kernels Intel Xeon Phi coprocessor(s) At each

20 Running your Application Execution on the host and Intel Xeon Phi coprocessor Without: Intel Xeon Phi coprocessor(s) are absent Application starts and executes on host With: Intel Xeon Phi coprocessor(s) are present Application starts on host and executes portions on Intel Xeon Phi coprocessor(s) At runtime, if Intel Xeon Phi coprocessor (s) are available, the target binary is loaded Intel Host Processor Your Application Execution Flow With identified Compute Intensive Kernels Your Application With identified Compute Intensive Kernels Intel Xeon Phi coprocessor(s) At each offload, the construct runs on host cores/threads At each offload, the construct runs on the Intel Xeon Phi coprocessor(s) Host Offload Library Message Library Target Offload Library Message Library Many-core Normal program termination on host At program termination, target binary is unloaded Multicore

21 Intel MPI/Thread Environment Support The execution command mpirun of Intel MPI reads argument sets from the command line: Sections between : define an argument set (alternatively a line in a configfile specifies a set) Host, number of nodes, but also environment can be set independently in each argument set # mpirun env I_MPI_PIN_DOMAIN 4 host myxeon... : -env I_MPI_PIN_DOMAIN 16 host mymic Adapt the important environment variables to the architecture OMP_NUM_THREADS, KMP_AFFINITY for OpenMP CILK_NWORKERS for Intel Cilk TM Plus * Although locality issues apply as well, multicore threading runtimes are by far more expressive, richer, and with lower overhead. 21

Analyzing your Application Performance Analysis Tools Intel VTune Amplifier XE performance profiler Analyze your multicore and many-core performance Analyze performance of the application in offload

22 Analyzing your Application Performance Analysis Tools Intel VTune Amplifier XE performance profiler Analyze your multicore and many-core performance Analyze performance of the application in offload mode Support for Intel Xeon Phi coprocessors includes: A Linux* hosted command line tool that collects events The VTune Amplifier XE graphical user interface to display results collected in previous step highlighting bottlenecks, time spent and other details of performance.

23 GDB* on Intel Xeon Phi Coprocessor GDB* supports Intel Xeon Phi Coprocessor Intel upstreams features and capabilities to GNU* community Broad enabling of developers and software tools ecosystem Available from Intel at /16/201

24 The GNU* Project Debugger and Intel Xeon Phi Coprocessor Native and cross-debugger versions of GDB* exist for the Intel Xeon Phi coprocessor It is part of the Intel Manycore Platform Software Stack (Intel MPSS) You can debug with it as either root or a user Intel Confidential NDA presentation 24

25 Native debugging on the Intel Xeon Phi Coprocessor with GDB* Run GDB* on the Intel Xeon Phi Coprocessor ssh t mic0 /usr/bin/gdb To attach to a running application via the process-id (gdb) shell pidof my_application 42 (gdb) attach 42 To run an application directly from GDB* (gdb) file /target/path/to/application (gdb) start Intel Confidential NDA presentation 25

26 Remote debugging with GDB* for Intel Xeon Phi Coprocessor Run GDB* on your localhost /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb Start gdbserver on the Intel Xeon Phi Coprocessor To remote debug using ssh (gdb) target extended-remote ssh T mic0 gdbserver multi IP:port To remote debug using stdio (gdb) target extended-remote ssh -T mic0 gdbserver multi - To attach to a running application via the process-id (pid) (gdb) file /local/path/to/application (gdb) attach <remote-pid> To run an application directly from GDB* (gdb) file /local/path/to/application (gdb) set remote exec-file /target/path/to/application 26

27 Explore Intel Xeon Phi Coprocessor Architecture Features List all new vector and mask registers (gdb) info registers zmm k0 0x0 0 zmm31 {v16_float = {0x0 <repeats 16 times>}, v8_double = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v64_int8 = {0x0 <repeats 64 times>}, v32_int16 = {0x0 <repeats 32 times>}, v16_int32 = {0x0 <repeats 16 times>}, v8_int64 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_uint128 = {0x0, 0x0, 0x0, 0x0}} Disassemble Instructions (gdb) disassemble $pc, +10 Dump of assembler code from 0x11 to 0x24: 0x <foobar+17>: vpackstorelps %zmm0,- 0x10(%rbp){%k1} 0x <foobar+24>: vbroadcastss -0x10(%rbp),%zmm0 27 4/16/ 2013

Intel Software Tools Roadmap Q3 12 Q4 12 Q2 13 Q1 13 Q3 13 Data Center

2013 -Support for Intel Xeon Phi Coprocessors (Linux) Intel Parallel

Windows* (Alpha) Intel Xeon Phi Coprocessor Support for Windows*

-Support for Intel Xeon Phi Coprocessors (Linux) Intel Cluster Studio

28 Intel Software Tools Roadmap Q3 12 Q4 12 Q2 13 Q1 13 Q3 13 Data Center Tools High Performance Computing / Enterprise Intel Parallel Studio XE Support for Intel Xeon Phi Coprocessors (Linux) Intel Parallel Studio XE NEXT Many-Core Intel Xeon Phi Coprocessor Support for Windows* (Alpha) Intel Xeon Phi Coprocessor Support for Windows* (Beta) Intel Cluster Studio XE 2012 Intel Cluster Studio XE Support for Intel Xeon Phi Coprocessors (Linux) Intel Cluster Studio XE NEXT Alpha Beta Gold release Beta release window for Microsoft Windows* Release window

Preserve Your Development Investment Common

Intel C/C++ Compiler Intel Fortran Compiler

Coarray Offload Directives OpenMP* Many-core

29 Preserve Your Development Investment Common Tools and Programming Models for Parallelism C/C++ Multicore Intel Cilk Plus OpenCL* OpenMP* Intel TBB Offload Pragmas Fortran Intel C/C++ Compiler Intel Fortran Compiler Intel MKL Intel MPI Heterogeneous Computing Coarray Offload Directives OpenMP* Many-core Develop Using Parallel Models that Support Heterogeneous Computing

30 Conclusion There are many parallel programming models in existence. But only a small number are actually used and standardized across platforms: OpenMP MPI TBB Cilk Pthreads OpenCL All you do to make applications run well on Intel Xeon Phi coprocessors (vectorization, parallelization) can be done in above ways (OpenMP, MPI, etc.) - it also works on Intel Xeon, and typically improves performance there too. 30

31 Call to Action Evaluate the Intel Software Development Products, including the family of Parallel Programming Models, for your High Performance needs: Note: The Intel Parallel Studio XE 2013 and Intel Cluster Studio XE 2013 products include support for Intel Xeon Phi coprocessors prior to the coprocessors being generally available. For product information see:

32 32

33 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Xeon Phi, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Intel Confidential - Use under NDA only and/or other countries. *Other names and brands may be claimed as the property Copyright 2012, of others. Intel Corporation. All rights reserved. 4/16/201 *Other brands and names are the property of their respective owners.

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN