IXPUG 16. Dmitry Durnov, Intel MPI team

Similar documents
Bei Wang, Dmitry Prohorov and Carlos Rosales

Vectorization Advisor: getting started

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

H.J. Lu, Sunil K Pandey. Intel. November, 2018

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Ravindra Babu Ganapathi

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Cluster Checker 3.0 webinar

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Intel Software Development Products Licensing & Programs Channel EMEA

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Tuning Python Applications Can Dramatically Increase Performance

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Kevin O Leary, Intel Technical Consulting Engineer

Jackson Marusarz Software Technical Consulting Engineer

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Memory & Thread Debugger

Getting Started with Intel SDK for OpenCL Applications

Expressing and Analyzing Dependencies in your C++ Application

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Installation Guide and Release Notes

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

INTEL MKL Vectorized Compact routines

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Intel Advisor XE. Vectorization Optimization. Optimization Notice

More performance options

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

High Performance Computing The Essential Tool for a Knowledge Economy

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Real World Development examples of systems / iot

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Software Optimization Case Study. Yu-Ping Zhao

What s P. Thierry

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Intel Architecture for Software Developers

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications

Intel Parallel Studio XE 2015

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Optimizing Film, Media with OpenCL & Intel Quick Sync Video

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

OPENSHMEM AND OFI: BETTER TOGETHER

Knights Corner: Your Path to Knights Landing

Intel Math Kernel Library (Intel MKL) Latest Features

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Graphics Performance Analyzer for Android

Ayal Zaks and Gil Rapaport, Vectorization Team, Intel Corporation. October 18 th, 2017 US LLVM Developers Meeting, San Jose, CA

What s New August 2015

Desktop 4th Generation Intel Core, Intel Pentium, and Intel Celeron Processor Families and Intel Xeon Processor E3-1268L v3

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

Using Intel VTune Amplifier XE for High Performance Computing

HPCG Results on IA: What does it tell about architecture?

Kirill Rogozhin. Intel

IN-PERSISTENT-MEMORY COMPUTING WITH JAVA ERIC KACZMAREK INTEL CORPORATION

Installation Guide and Release Notes

Demonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Sayantan Sur, Intel. ExaComm Workshop held in conjunction with ISC 2018

LS-DYNA Performance on Intel Scalable Solutions

3D ray tracing simple scalability case study

Intel Math Kernel Library (Intel MKL) Sparse Solvers. Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

pymic: A Python* Offload Module for the Intel Xeon Phi Coprocessor

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

Intel Server Board S2600CW2S

Obtaining the Last Values of Conditionally Assigned Privates

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Intel Core TM i7-4702ec Processor for Communications Infrastructure

Gil Rapaport and Ayal Zaks. Intel Corporation, Israel Development Center. March 27-28, 2017 European LLVM Developers Meeting

Fastest and most used math library for Intel -based systems 1

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Building on The NVM Programming Model A Windows Implementation

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Embree Ray Tracing Kernels: Overview and New Features

Intel Many Integrated Core (MIC) Architecture

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

Intel Core TM Processor i C Embedded Application Power Guideline Addendum

Intel s Architecture for NFV

Transcription:

IXPUG 16 Dmitry Durnov, Intel MPI team

Agenda - Intel MPI 2017 Beta U1 product availability - New features overview - Competitive results - Useful links - Q/A 2

Intel MPI 2017 Beta U1 is available! Key features: - Topology aware SHM collectives - Intel Xeon processor E5-2600 v4 product family + Intel Omni-Path Fabric tuning - Intel Xeon Phi Processor codenamed Knights Landing (KNL) tuning (node level) - Memory binding management features - Asynchronous progress control - Enhanced OpenFabrics Interfaces (OFI) support - Process deployment enhancements - Intel MPI benchmark improvements 3

Intel MPI 2017 Beta U1 is available! Join Intel Parallel Studio XE 2017 Beta program: https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2017-beta The beta program officially ends June 28th, 2016. The beta license provided will expire October 7th, 2016. 4

Topology aware SHM collectives Allow to get a very low collective operation latency Available for the following collective operations: - MPI_Barrier - MPI_Bcast - MPI_Reduce - MPI_Allreduce 5

Topology aware SHM collectives Implemented as a set of new collective operations and available via I_MPI_ADJUST family control: I_MPI_ADJUST_BARRIER=<7 8 9> I_MPI_ADJUST_BCAST=<9 10 11> I_MPI_ADJUST_REDUCE=<8 9 10> I_MPI_ADJUST_ALLREDUCE=<10 11 12> 6

ratio ratio ratio ratio Topology aware SHM collectives. Xeon. Intranode MPI_Barrier MPI_Bcast MPI_Reduce MPI_Allreduce 4.00 3.50 3.00 2.50 3.66 1.80 1.60 1.40 1.20 1.67 1.80 1.60 1.40 1.20 1.67 2.50 2.00 1.50 2.23 2.00 1.50 0.80 0.60 0.80 0.60 0.40 0.40 0.50 0.50 0.20 8 0.20 8 8 Note: IMB-MPI1 4.1.1. N1P44. Intel Xeon E5-2699 v4 @ 2.20GHz. Higher is better Optimization Notice 7

ratio ratio ratio ratio Topology aware SHM collectives. Xeon Phi. Intranode MPI_Barrier MPI_Bcast MPI_Reduce MPI_Allreduce 3.50 3.00 3.18 3.00 2.50 2.45 3.50 3.00 3.10 3.00 2.50 2.57 2.50 2.00 2.50 2.00 2.00 1.50 0.50 1.50 0.50 2.00 1.50 0.50 1.50 0.50 8 8 8 Note: IMB-MPI1 4.1.1. N1P64. Intel Xeon Phi (KNL). Higher is better Results were obtained with pre-release HW. Final results may vary. Optimization Notice 8

Memory binding management feature - Provides user friendly interface for memory allocation control - General NUMA awareness - HBM/MCDRAM awareness (Xeon Phi specific) - Available via the following env variables: - I_MPI_BIND_NUMA, I_MPI_BIND_ORDER - I_MPI_BIND_WIN_ALLOCATE - I_MPI_HBW_POLICY - Fine grain control for MPI_Win_allocate_shared via MPI_Info mechanism 9

Memory binding management feature. I_MPI_HBW_POLICY example. There are 3 kinds of MPI process memory we can control: Application buffers Internal MPI buffers Application buffers allocated for MPI_Win_allocate_shared/MPI_Win_allocate I_MPI_HBW_POLICY=<user buffers policy>[,[mpi buffers policy][,win_allocate policy]] The following values are available: Value hbw_preferred hbw_bind hbw_interleave Note Try to allocate MCDRAM first. If not available allocate DDR. Try to allocate MCDRAM. If not available fail. MCDRAM/DDR interleaved allocation 10

Speedup (times) 1 1 1 1 1.34 1.48 1.52 1.89 Intel Xeon processor E5-2600 v4 product family + Intel Omni-Path Fabric tuning Superior Performance with Intel MPI Library 2017 Beta U1 2304 Processes, 64 nodes (Omni-Path), Linux* 64 Relative (Geomean w/o vector ops) MPI Latency Benchmarks (Higher is Better) 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 4 bytes 512 bytes 16 Kbytes 128 Kbytes IntelMPI 2017 Beta Update 1 OpenMPI-1.10.2 Configuration Info: Hardware: CPU: Intel Xeon E5-2697 v4 @ 2.30GHz; 128 GB RAM. Interconnect: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10) Software: RHEL* 6.7; IFS 10.0.1.0.50; Libfabric 1.3.0; Intel MPI Library 2017 Beta Update 1; Intel MPI Benchmarks 4.1.1 (built with Intel C++ Compiler XE 17.0.0 Beta for Linux*); Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation. Optimization Notice: Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Optimization Notice 11

Links/Contacts https://software.intel.com/en-us/intel-mpi-library https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2017-beta mail: dmitry.durnov@intel.com 12

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 14