Intel Software Development Products for High Performance Computing and Parallel Programming

Similar documents
High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

Parallel Programming. The Ultimate Road to Performance April 16, Werner Krotz-Vogel

Using Intel VTune Amplifier XE for High Performance Computing

Intel Many Integrated Core (MIC) Architecture

Installation Guide and Release Notes

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Installation Guide and Release Notes

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Intel Software Development Products Licensing & Programs Channel EMEA

Path to Exascale? Intel in Research and HPC 2012

Getting Started with Intel SDK for OpenCL Applications

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Efficiently Introduce Threading using Intel TBB

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes

Using Intel Inspector XE 2011 with Fortran Applications

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

A Simple Path to Parallelism with Intel Cilk Plus

What s New August 2015

Installation Guide and Release Notes

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Intel Array Building Blocks

What s P. Thierry

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Graphics Performance Analyzer for Android

Intel + Parallelism Everywhere. James Reinders Intel Corporation

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Beyond Offloading Programming Models for the Intel Xeon Phi Coprocessor. Michael Hebenstreit, Senior Cluster Architect, Intel SFTS001

Intel Parallel Amplifier Sample Code Guide

More performance options

Software Tools for Software Developers and Programming Models

Intel Xeon Phi Coprocessor

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes

Get Ready for Intel MKL on Intel Xeon Phi Coprocessors. Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Intel Xeon Phi Programmability (the good, the bad and the ugly)

Growth in Cores - A well rehearsed story

Intel Parallel Studio XE 2015

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Installation Guide and Release Notes

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

LED Manager for Intel NUC

Overview of Intel Parallel Studio XE

Intel Xeon Phi Coprocessor

Optimizing the operations with sparse matrices on Intel architecture

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Eliminate Threading Errors to Improve Program Stability

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

Intel Math Kernel Library (Intel MKL) Latest Features

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes

Intel Desktop Board DZ68DB

Maximizing performance and scalability using Intel performance libraries

Eliminate Threading Errors to Improve Program Stability

Intel VTune Amplifier XE

High Performance Computing The Essential Tool for a Knowledge Economy

Intel Media Server Studio Professional Edition for Linux*

Ready For Future Computing? Levent Akyil Software and Services Group

Installation Guide and Release Notes

Expressing and Analyzing Dependencies in your C++ Application

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Intel PerfMon Performance Monitoring Hardware

Intel Xeon Phi Coprocessor Performance Analysis

extreme XQCD Bern Aug 5th, 2013 Edmund Preiss Manager Business Development, EMEA

Intel Math Kernel Library 10.3

Intel Parallel Studio 2011

Intel(R) Threading Building Blocks

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Eliminate Memory Errors to Improve Program Stability

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Software Occlusion Culling

Intel Cache Acceleration Software for Windows* Workstation

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Intel Fortran Composer XE 2011 Getting Started Tutorials

C Language Constructs for Parallel Programming

Intel vpro Technology Virtual Seminar 2010

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

Intel Stereo 3D SDK Developer s Guide. Alpha Release

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth

Laurent Duhem Intel Alain Dominguez - Intel

Memory & Thread Debugger

Cilk Plus GETTING STARTED

Intel Thread Checker 3.1 for Windows* Release Notes

Microarchitectural Analysis with Intel VTune Amplifier XE

Intel Media Server Studio 2017 R3 Essentials Edition for Linux* Release Notes

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Transcription:

Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core

Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. This document contains information on products in the design phase of development. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel s current plan of record product roadmaps. Intel, VTune, Cilk, Xeon and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others Copyright 2012 Intel Corporation. All rights reserved.

Table of Contents Intel in HPC Intel HPC Software Development Products High Performance Parallel Programming with Intel s architectures Features and benefits Investment protection Better performance & efficiency Call to Action and Summary

Intel in High-Performance Computing Dedicated, Renowned Applications Expertise Large Scale Clusters for Test & Optimization Tera- Scale Research Exa-Scale Labs Broad Software Tools Portfolio Defined HPC Application Platform Platform Building Blocks Manufacturing Process Technologies Leading Performance, Energy Efficient Many Integrated Core Architecture A long term commitment to the HPC market segment

Intel Technology is Changing HPC Performance, Energy Efficiency, Reliability, TCO PROCESSORS SOLID STATE DISK 10GbE Xeon MIC Scalable Performance and Energy Efficiency, Multi- and Many-Core Optimize Performance for I/O Intensive Apps and Boot Drive Replacement Bridging the Gap Between 1GbE and InfiniBand*, with RDMA, Unified Networking A platform approach to high performance

The Majority of all HPC-Systems are Clusters (Source: IDC) Interconnect Multi-Threading within each (SMP) Node I/O M I/O M I/O... M... I/O M Message Passing between/inside Nodes M e.g. CORE CORE CORE CORE CORE CORE CORE CORE P P P P P P P P e.g. CO-PROCESSOR Vectorization (SIMD) within each Core

Software Development and System Environment Linux* Intel Xeon Processor Same Comprehensive Set of SW Tools: Application Source Code Builds with a Compiler Switch Intel Many Integrated Core Architecture Established HPC Operating System Software & Services Group, (die sizes Developer not to scale) Products Division

Scaling Performance Forward Software Tools Vision Employ versatile and common development tools across all IA architectures Single Portable Software Stack Processor Flexible Programmability Scalable Performance Data-Parallelism Thread-Parallelism Messaging...

High Performance Parallel Programming Features and Benefits: Details

Enabling & Advancing Parallelism High Performance Parallel Programming Compiler Libraries Parallel Models Code Multicore Many-core Cluster Develop & Parallelize Today for Maximum Performance Multicore CPU Multicore CPU Intel MIC Architecture Co-processor Multicore Cluster Multicore & Many -core Cluster Use One Software Architecture Today. Scale Forward Tomorrow. 10

More cores. Wider vectors. Co-Processors. Tools need to access all three dimensions to deliver performance Images do not reflect actual die sizes Intel Xeon processor 64-bit Intel Xeon processor 5100 series Intel Xeon processor 5500 series Intel Xeon processor 5600 series Intel Xeon processor code-named Sandy Bridge Intel Xeon processor code-named Ivy Bridge Intel Xeon processor code-named Haswell Intel MIC coprocessor code-named Knights Ferry Intel MIC coprocessor code-named Knights Corner Core(s) 1 2 4 6 8 32 >50 Threads 2 2 8 12 16 128 >200 SIMD Width 128 128 128 128 256 256 256 512 512 SSE2 SSSE3 SSE4.2 SSE4.2 AVX AVX AVX2 FMA3 Software challenge: Develop scalable software

High Performance Software Products Supporting Multicore and Many-core Development Intel Parallel Studio XE* Advanced Performance Intel Cluster Studio XE* Distributed Performance Intel C/C++ and Fortran Compilers w/openmp Intel MKL, Intel Cilk Plus, Intel TBB Library, Intel ArBB Library Intel IPP Library Intel Inspector XE, Intel VTune Amplifier XE, Intel Advisor Intel MPI Library Intel Trace Analyzer and Collector Intel Parallel Studio XE Performance. Scale Forward. Proven

Invest in Common Tools and Programming Models Multicore Intel Xeon processors are designed for intelligent performance and smart energy efficiency Code Many-core Intel MIC Architecture - coprocessors are ideal for highly parallel computing applications + Continuing to advance Intel Xeon processor family and instruction set (e.g., Intel AVX, etc.) Use One Software Architecture Today Tomorrow Software development platforms ramping now Use One Software Architecture Today. Scale Forward Tomorrow.

Optimized Intel Libraries Intel MKL Math Kernel Library Science, Engineering and Financial applications oriented Incl. BLAS, LAPACK, ScaLAPACK, Sparse Solvers, Fast Fourier Transforms, Vector Math Intel IPP Integrated Performance Primitives Multimedia, Data Processing, and Communications applications oriented Cryptography and String Processing

Go Parallel with High Performance Math Kernel Library Intel Math Kernel Library (Intel MKL) void foo() /* Intel Math Kernel Library */ { float *A, *B, *C; /* Matrices */ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); } Implicit automatic offloading requires no code changes, simply link with the offload MKL Library Intel Xeon processor Intel MIC co-processor

Go Parallel with Intel Cilk Plus Proven Cilk parallel model, teachable in one minute Parallelism in Three Key Words: cilk_spawn cilk_sync cilk_for Cilk Plus: an open specification Recently placed into open source by Intel for the advancement of parallel programming // Parallel function invocation, in C cilk_for (int i=0; i<n; ++i){ Foo(a[i]); } // Parallel spawn in a recursive fibonacci // computation, in C int fib (int n) { if (n < 2) return 1; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } } Learn more at http://cilkplus.org 16

Go Parallel with Intel Cilk Plus Data and Task Parallelism as first class citizens in C and C++ vectorization via intuitive notations that automatically span MMX, SSE, AVX, and wider widths in the future including those in MIC co-processors array notations #pragma SIMD controls elemental functions // Simplify operation using // array notations in C/C++: a[:] = b[:] + c[:]; // Elemental functions, in C, // using Cilk Plus: declspec (vector) void saxpy(float a, float x, float &y) { y += a * x; } //pragma SIMD: User-mandated // vectorization #pragma simd for (i=0; i<n; i++) { A[i] = A[i]+ B[i] + C[i]; } Learn more at http://cilkplus.org

Go Parallel with Intel Threading Building Blocks (Intel TBB) A popular parallel abstraction for C++ developers A C++ template library Scalable memory allocation Load-balancing Work-stealing task scheduling Thread-safe pipeline Concurrent containers High-level parallel algorithms Numerous synchronization primitives //Parallel function invocation example, in C++, //using TBB: parallel_for (0, n, [=](int i) { Foo(a[i]); }); Learn more at http://threadingbuildingblocks.org Intel remains a leading participant and contributor in the TBB open source project as well as a leading supplier of TBB support and supporting tools

Go Parallel with Message Passing Interface Intel Message Passing Interface (Intel MPI) Extend your cluster solutions to the Intel MIC Architecture Multicore Cluster Clusters with Multicore and Many-core E.g., Intel MIC in every node of the cluster using Intel MPI and Intel Parallel Building Blocks on nodes Same model as an Intel Xeon processor based cluster Intel is a leading vendor of MPI implementations and tools Clusters Learn more at http://intel.com/go/mpi 19

Go Parallel with Coarray Fortran Intel Fortran Compiler A standard, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax. For parallel programming on both shared memory and distributed memory systems!sum in Fortran, using co-array feature: REAL SUM[*] CALL SYNC_ALL( WAIT=1 ) DO IMG= 2,NUM_IMAGES() IF (IMG==THIS_IMAGE()) THEN SUM = SUM + SUM[IMG-1] ENDIF CALL SYNC_ALL( WAIT=IMG ) ENDDO Learn more at http://intel.com/software/products 20

Go Parallel with OpenMP* Intel C/C++ and Fortran Compilers A flexible interface for developing parallel applications An abstraction for multi-threaded solutions OpenMP* is a standard used by many parallel applications Supported by every major compiler for Fortran, C, and C++ //C/C++ OpenMP* Pragma #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); } pi /= count;!fortran OpenMP*!$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo!$omp end parallel Learn more at http://openmp.org 21

Go Parallel with OpenMP* Intel C/C++ and Fortran Compilers main() { double pi = 0.0f; long i; #pragma offload target (mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.5)/n); pi += 4.0/(1.0+t*t); } printf("pi = %f\n",pi/n); } One Line Change to Offload to MIC Co-Processor Intel Xeon processor Intel MIC co-processor

Go Parallel with C/C++ Language Extensions Simple Keyword Language Extensions to control offloading to MIC coprocessor C/C++ Language Extensions class _Shared common { int data1; char *data2; class common *next; void process(); }; _Shared class common obj1, obj2; _Cilk_spawn _Offload obj1.process(); _Cilk_spawn obj2.process(); 23

Use the Same Code for Execution on Intel MIC Architecture by Offloading C/C++ Offload Pragma #pragma offload target (mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); } pi /= count; Fortran Offload Directive!dir$ omp offload target(mic)!$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo!$omp end parallel MKL Implicit Offload //MKL implicit offload requires no source code changes, simply link with the offload MKL Library. MKL Explicit Offload #pragma offload target (mic) \ in(transa, transb, N, alpha, beta) \ in(a:length(matrix_elements)) \ in(b:length(matrix_elements)) \ in(c:length(matrix_elements)) \ out(c:length(matrix_elements)alloc_if(0)) sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); C/C++ Language Extensions class _Shared common { int data1; char *data2; class common *next; void process(); }; _Shared class common obj1, obj2; _Cilk_spawn _Offload obj1.process(); _Cilk_spawn obj2.process(); 24

Parallelism with OpenCL* Intel OpenCL SDK OpenCL* is a framework for writing programs that execute across heterogeneous platforms (e.g., CPUs, GPUs, many-core) //Simple per element multiplication using OpenCL*: kernel void dotprod( global const float *a, global const float *b, global float *c) { int myid = get_global_id(0); c[myid] = a[myid] * b[myid]; } Intel is a leading participant in the OpenCL* standard efforts, and a vendor of solutions and related tools with early implementations available today. OpenCL* addresses the needs of customers in specific segments Learn more at http://intel.com/go/opencl 25

2 Running your Application Execution on the host and Intel MIC Co-processor(s) Without: Intel MIC Co-processor(s) are absent Application starts and executes on host At each offload, the construct runs on host cores/threads Normal program termination on host With: Intel MIC Co-processor(s) are present Application starts on host and executes portions on Intel MIC Coprocessor(s) At runtime, if Intel MIC Coprocessor(s) are available, the target binary is loaded At each offload, the construct runs on the Intel MIC Co-processor(s) At program termination, target binary is unloaded Host Program Host Offload Library Message Library Target Program Target Offload Library Message Library

Using the Intel Debugger Overview Host Program Intel Debugger Target Target Program Program Intel Intel Debug Server Debug Server Debugging of host and target simultaneously If host application is being debugged, target application is also debugged automatically Debugger runs on host for both host and target program Debugger halts and resumes both host and target program synchronously Full C, C++ and Fortran support on both sides Future: debugger presents view of one virtual application inside a single GUI Extensible to cover more than one offload card 27

Analyzing your Application Performance Analysis Tools Intel VTune Amplifier XE performance profiler Analyze your multicore and many-core performance Analyze performance of the application in offload mode Support for Intel MIC Co-processors includes: A Linux* hosted command line tool that collects events The VTune Amplifier XE graphical user interface to display results collected in previous step highlighting bottlenecks, time spent and other details of performance. 28

Preserve Your Development Investment Common Tools and Programming Models for Parallelism C/C++ OpenCL* Multicore Intel Cilk Plus Intel TBB OpenMP* Offload Pragmas Fortran Intel C/C++ Compiler Intel Fortran Compiler Coarray Offload Directives OpenMP* Intel MKL Intel MPI Heterogeneous Computing Many-core Develop Using Parallel Models that Support Heterogeneous Computing 29

Call to Action Evaluate the Intel Software Development Products, including the family of Parallel Programming Models, for your High Performance needs: http://www.intel.com/software/products/eval For product information see: http://www.intel.com/software/products 30

Copyright 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 31

Optimization Notice 32