Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core
Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. This document contains information on products in the design phase of development. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel s current plan of record product roadmaps. Intel, VTune, Cilk, Xeon and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others Copyright 2012 Intel Corporation. All rights reserved.
Table of Contents Intel in HPC Intel HPC Software Development Products High Performance Parallel Programming with Intel s architectures Features and benefits Investment protection Better performance & efficiency Call to Action and Summary
Intel in High-Performance Computing Dedicated, Renowned Applications Expertise Large Scale Clusters for Test & Optimization Tera- Scale Research Exa-Scale Labs Broad Software Tools Portfolio Defined HPC Application Platform Platform Building Blocks Manufacturing Process Technologies Leading Performance, Energy Efficient Many Integrated Core Architecture A long term commitment to the HPC market segment
Intel Technology is Changing HPC Performance, Energy Efficiency, Reliability, TCO PROCESSORS SOLID STATE DISK 10GbE Xeon MIC Scalable Performance and Energy Efficiency, Multi- and Many-Core Optimize Performance for I/O Intensive Apps and Boot Drive Replacement Bridging the Gap Between 1GbE and InfiniBand*, with RDMA, Unified Networking A platform approach to high performance
The Majority of all HPC-Systems are Clusters (Source: IDC) Interconnect Multi-Threading within each (SMP) Node I/O M I/O M I/O... M... I/O M Message Passing between/inside Nodes M e.g. CORE CORE CORE CORE CORE CORE CORE CORE P P P P P P P P e.g. CO-PROCESSOR Vectorization (SIMD) within each Core
Software Development and System Environment Linux* Intel Xeon Processor Same Comprehensive Set of SW Tools: Application Source Code Builds with a Compiler Switch Intel Many Integrated Core Architecture Established HPC Operating System Software & Services Group, (die sizes Developer not to scale) Products Division
Scaling Performance Forward Software Tools Vision Employ versatile and common development tools across all IA architectures Single Portable Software Stack Processor Flexible Programmability Scalable Performance Data-Parallelism Thread-Parallelism Messaging...
High Performance Parallel Programming Features and Benefits: Details
Enabling & Advancing Parallelism High Performance Parallel Programming Compiler Libraries Parallel Models Code Multicore Many-core Cluster Develop & Parallelize Today for Maximum Performance Multicore CPU Multicore CPU Intel MIC Architecture Co-processor Multicore Cluster Multicore & Many -core Cluster Use One Software Architecture Today. Scale Forward Tomorrow. 10
More cores. Wider vectors. Co-Processors. Tools need to access all three dimensions to deliver performance Images do not reflect actual die sizes Intel Xeon processor 64-bit Intel Xeon processor 5100 series Intel Xeon processor 5500 series Intel Xeon processor 5600 series Intel Xeon processor code-named Sandy Bridge Intel Xeon processor code-named Ivy Bridge Intel Xeon processor code-named Haswell Intel MIC coprocessor code-named Knights Ferry Intel MIC coprocessor code-named Knights Corner Core(s) 1 2 4 6 8 32 >50 Threads 2 2 8 12 16 128 >200 SIMD Width 128 128 128 128 256 256 256 512 512 SSE2 SSSE3 SSE4.2 SSE4.2 AVX AVX AVX2 FMA3 Software challenge: Develop scalable software
High Performance Software Products Supporting Multicore and Many-core Development Intel Parallel Studio XE* Advanced Performance Intel Cluster Studio XE* Distributed Performance Intel C/C++ and Fortran Compilers w/openmp Intel MKL, Intel Cilk Plus, Intel TBB Library, Intel ArBB Library Intel IPP Library Intel Inspector XE, Intel VTune Amplifier XE, Intel Advisor Intel MPI Library Intel Trace Analyzer and Collector Intel Parallel Studio XE Performance. Scale Forward. Proven
Invest in Common Tools and Programming Models Multicore Intel Xeon processors are designed for intelligent performance and smart energy efficiency Code Many-core Intel MIC Architecture - coprocessors are ideal for highly parallel computing applications + Continuing to advance Intel Xeon processor family and instruction set (e.g., Intel AVX, etc.) Use One Software Architecture Today Tomorrow Software development platforms ramping now Use One Software Architecture Today. Scale Forward Tomorrow.
Optimized Intel Libraries Intel MKL Math Kernel Library Science, Engineering and Financial applications oriented Incl. BLAS, LAPACK, ScaLAPACK, Sparse Solvers, Fast Fourier Transforms, Vector Math Intel IPP Integrated Performance Primitives Multimedia, Data Processing, and Communications applications oriented Cryptography and String Processing
Go Parallel with High Performance Math Kernel Library Intel Math Kernel Library (Intel MKL) void foo() /* Intel Math Kernel Library */ { float *A, *B, *C; /* Matrices */ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); } Implicit automatic offloading requires no code changes, simply link with the offload MKL Library Intel Xeon processor Intel MIC co-processor
Go Parallel with Intel Cilk Plus Proven Cilk parallel model, teachable in one minute Parallelism in Three Key Words: cilk_spawn cilk_sync cilk_for Cilk Plus: an open specification Recently placed into open source by Intel for the advancement of parallel programming // Parallel function invocation, in C cilk_for (int i=0; i<n; ++i){ Foo(a[i]); } // Parallel spawn in a recursive fibonacci // computation, in C int fib (int n) { if (n < 2) return 1; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } } Learn more at http://cilkplus.org 16
Go Parallel with Intel Cilk Plus Data and Task Parallelism as first class citizens in C and C++ vectorization via intuitive notations that automatically span MMX, SSE, AVX, and wider widths in the future including those in MIC co-processors array notations #pragma SIMD controls elemental functions // Simplify operation using // array notations in C/C++: a[:] = b[:] + c[:]; // Elemental functions, in C, // using Cilk Plus: declspec (vector) void saxpy(float a, float x, float &y) { y += a * x; } //pragma SIMD: User-mandated // vectorization #pragma simd for (i=0; i<n; i++) { A[i] = A[i]+ B[i] + C[i]; } Learn more at http://cilkplus.org
Go Parallel with Intel Threading Building Blocks (Intel TBB) A popular parallel abstraction for C++ developers A C++ template library Scalable memory allocation Load-balancing Work-stealing task scheduling Thread-safe pipeline Concurrent containers High-level parallel algorithms Numerous synchronization primitives //Parallel function invocation example, in C++, //using TBB: parallel_for (0, n, [=](int i) { Foo(a[i]); }); Learn more at http://threadingbuildingblocks.org Intel remains a leading participant and contributor in the TBB open source project as well as a leading supplier of TBB support and supporting tools
Go Parallel with Message Passing Interface Intel Message Passing Interface (Intel MPI) Extend your cluster solutions to the Intel MIC Architecture Multicore Cluster Clusters with Multicore and Many-core E.g., Intel MIC in every node of the cluster using Intel MPI and Intel Parallel Building Blocks on nodes Same model as an Intel Xeon processor based cluster Intel is a leading vendor of MPI implementations and tools Clusters Learn more at http://intel.com/go/mpi 19
Go Parallel with Coarray Fortran Intel Fortran Compiler A standard, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax. For parallel programming on both shared memory and distributed memory systems!sum in Fortran, using co-array feature: REAL SUM[*] CALL SYNC_ALL( WAIT=1 ) DO IMG= 2,NUM_IMAGES() IF (IMG==THIS_IMAGE()) THEN SUM = SUM + SUM[IMG-1] ENDIF CALL SYNC_ALL( WAIT=IMG ) ENDDO Learn more at http://intel.com/software/products 20
Go Parallel with OpenMP* Intel C/C++ and Fortran Compilers A flexible interface for developing parallel applications An abstraction for multi-threaded solutions OpenMP* is a standard used by many parallel applications Supported by every major compiler for Fortran, C, and C++ //C/C++ OpenMP* Pragma #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); } pi /= count;!fortran OpenMP*!$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo!$omp end parallel Learn more at http://openmp.org 21
Go Parallel with OpenMP* Intel C/C++ and Fortran Compilers main() { double pi = 0.0f; long i; #pragma offload target (mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.5)/n); pi += 4.0/(1.0+t*t); } printf("pi = %f\n",pi/n); } One Line Change to Offload to MIC Co-Processor Intel Xeon processor Intel MIC co-processor
Go Parallel with C/C++ Language Extensions Simple Keyword Language Extensions to control offloading to MIC coprocessor C/C++ Language Extensions class _Shared common { int data1; char *data2; class common *next; void process(); }; _Shared class common obj1, obj2; _Cilk_spawn _Offload obj1.process(); _Cilk_spawn obj2.process(); 23
Use the Same Code for Execution on Intel MIC Architecture by Offloading C/C++ Offload Pragma #pragma offload target (mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); } pi /= count; Fortran Offload Directive!dir$ omp offload target(mic)!$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo!$omp end parallel MKL Implicit Offload //MKL implicit offload requires no source code changes, simply link with the offload MKL Library. MKL Explicit Offload #pragma offload target (mic) \ in(transa, transb, N, alpha, beta) \ in(a:length(matrix_elements)) \ in(b:length(matrix_elements)) \ in(c:length(matrix_elements)) \ out(c:length(matrix_elements)alloc_if(0)) sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); C/C++ Language Extensions class _Shared common { int data1; char *data2; class common *next; void process(); }; _Shared class common obj1, obj2; _Cilk_spawn _Offload obj1.process(); _Cilk_spawn obj2.process(); 24
Parallelism with OpenCL* Intel OpenCL SDK OpenCL* is a framework for writing programs that execute across heterogeneous platforms (e.g., CPUs, GPUs, many-core) //Simple per element multiplication using OpenCL*: kernel void dotprod( global const float *a, global const float *b, global float *c) { int myid = get_global_id(0); c[myid] = a[myid] * b[myid]; } Intel is a leading participant in the OpenCL* standard efforts, and a vendor of solutions and related tools with early implementations available today. OpenCL* addresses the needs of customers in specific segments Learn more at http://intel.com/go/opencl 25
2 Running your Application Execution on the host and Intel MIC Co-processor(s) Without: Intel MIC Co-processor(s) are absent Application starts and executes on host At each offload, the construct runs on host cores/threads Normal program termination on host With: Intel MIC Co-processor(s) are present Application starts on host and executes portions on Intel MIC Coprocessor(s) At runtime, if Intel MIC Coprocessor(s) are available, the target binary is loaded At each offload, the construct runs on the Intel MIC Co-processor(s) At program termination, target binary is unloaded Host Program Host Offload Library Message Library Target Program Target Offload Library Message Library
Using the Intel Debugger Overview Host Program Intel Debugger Target Target Program Program Intel Intel Debug Server Debug Server Debugging of host and target simultaneously If host application is being debugged, target application is also debugged automatically Debugger runs on host for both host and target program Debugger halts and resumes both host and target program synchronously Full C, C++ and Fortran support on both sides Future: debugger presents view of one virtual application inside a single GUI Extensible to cover more than one offload card 27
Analyzing your Application Performance Analysis Tools Intel VTune Amplifier XE performance profiler Analyze your multicore and many-core performance Analyze performance of the application in offload mode Support for Intel MIC Co-processors includes: A Linux* hosted command line tool that collects events The VTune Amplifier XE graphical user interface to display results collected in previous step highlighting bottlenecks, time spent and other details of performance. 28
Preserve Your Development Investment Common Tools and Programming Models for Parallelism C/C++ OpenCL* Multicore Intel Cilk Plus Intel TBB OpenMP* Offload Pragmas Fortran Intel C/C++ Compiler Intel Fortran Compiler Coarray Offload Directives OpenMP* Intel MKL Intel MPI Heterogeneous Computing Many-core Develop Using Parallel Models that Support Heterogeneous Computing 29
Call to Action Evaluate the Intel Software Development Products, including the family of Parallel Programming Models, for your High Performance needs: http://www.intel.com/software/products/eval For product information see: http://www.intel.com/software/products 30
Copyright 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 31
Optimization Notice 32