Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Similar documents
Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Vectorization Advisor: getting started

Bei Wang, Dmitry Prohorov and Carlos Rosales

Simplified and Effective Serial and Parallel Performance Optimization

Using Intel VTune Amplifier XE for High Performance Computing

IXPUG 16. Dmitry Durnov, Intel MPI team

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Intel VTune Amplifier XE

Revealing the performance aspects in your code

Intel Software Development Products Licensing & Programs Channel EMEA

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Memory & Thread Debugger

H.J. Lu, Sunil K Pandey. Intel. November, 2018

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Real World Development examples of systems / iot

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

Intel Xeon Phi Coprocessor Performance Analysis

Graphics Performance Analyzer for Android

Tuning Python Applications Can Dramatically Increase Performance

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Intel Cluster Checker 3.0 webinar

Eliminate Threading Errors to Improve Program Stability

Installation Guide and Release Notes

Software Optimization Case Study. Yu-Ping Zhao

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Jackson Marusarz Software Technical Consulting Engineer

Eliminate Threading Errors to Improve Program Stability

Getting Started Tutorial: Finding Hotspots

Obtaining the Last Values of Conditionally Assigned Privates

Getting Started Tutorial: Finding Hotspots

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

What s New August 2015

3D ray tracing simple scalability case study

Getting Started with Intel SDK for OpenCL Applications

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

Eliminate Memory Errors to Improve Program Stability

Microarchitectural Analysis with Intel VTune Amplifier XE

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Expressing and Analyzing Dependencies in your C++ Application

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Tutorial: Finding Hotspots with Intel VTune Amplifier - Linux* Intel VTune Amplifier Legal Information

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

High Performance Computing The Essential Tool for a Knowledge Economy

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Stanislav Bratanov; Roman Belenov; Ludmila Pakhomova 4/27/2015

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

More performance options

Optimizing Film, Media with OpenCL & Intel Quick Sync Video

What s P. Thierry

Jomar Silva Technical Evangelist

Performance Analysis using Intel VTune Amplifier XE

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Installation Guide and Release Notes

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Eliminate Memory Errors to Improve Program Stability

What's new in VTune Amplifier XE

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Getting Started Tutorial: Finding Hotspots

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Intel Parallel Amplifier Sample Code Guide

Efficiently Introduce Threading using Intel TBB

Using Intel Inspector XE 2011 with Fortran Applications

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

A Simple Path to Parallelism with Intel Cilk Plus

Intel and the Future of Consumer Electronics. Shahrokh Shahidzadeh Sr. Principal Technologist

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Intel Parallel Studio XE 2015

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance

Intel Xeon Phi Coprocessor

Kevin O Leary, Intel Technical Consulting Engineer

INTEL MKL Vectorized Compact routines

Intel Many Integrated Core (MIC) Architecture

Getting Started Tutorial: Finding Hotspots

Kirill Rogozhin. Intel

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

Gil Rapaport and Ayal Zaks. Intel Corporation, Israel Development Center. March 27-28, 2017 European LLVM Developers Meeting

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Transcription:

Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP Analysis Workflow OpenMP analysis for hybrid MPI + OpenMP applications Summary 2

Typical customer questions on parallelization efficiency of OpenMP* applications I put pragmas but why my speed up is far from linear? Parallelization inefficiency I ran my app on a system with bigger number of cores but it does not run as efficient as on smaller number Scalability issues Decomposing the questions Is serial time of my application significant to prevent scaling? How efficient is my OpenMP parallelization? If not, how much gain can be achieved if I invest in fighting with the inefficiencies? What OpenMP regions/loops/barriers are worth to tune? What are the particular problems with them? 3

If Performance Information is OpenMP Unaware.. OpenMP unaware views of VTune Amplifier XE Difficult to detect problems, customers might even blame runtime seeing CPU time consumption there not understanding that this is a result of parallelization inefficiency The questions are tied to OpenMP program structure #pragmas Answers should be given the same way to be understandable and actionable 4

VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Overview on summary pane Is serial time of my application significant to prevent scaling? How efficient is my parallelization towards ideal parallel execution? How much theoretical gain I can get if invest in tuning? What regions are more perspective to invest? Links to grid view for more details on inefficiency 5

Key of OpenMP awareness in VTune Region based views and metrics Definition of Region Potential Gain (elapsed time metric) Fork Actual Parallel Region Elapsed Time Join Effective time (sampling) Lock spinning (sampling) Imbalance (tracing) Scheduling (sampling) Work forking (sampling) Atomics (sampling) Potential Gain as a sum of inefficiencies normalized by num of threads Estimated Ideal Time = Effective time / Number of Threads 6

Technology under VTune Amplifier XE OpenMP Analysis Tracing of OpenMP constructions to provide region/work sharing context and precise imbalance on barriers Provided to VTune by Intel OpenMP Runtime under profiling Fork-Join points of parallel regions with number of working threads (Intel Compiler 14 and later) OpenMP construct barrier points with imbalance info and OpenMP loop metadata -parallel-source-info=2 compiler option to embed source file name to a region name Looking at transition to OMPT, working with John M.-C. on interface enrichments for low overhead analysis Sampling to define and classify CPU time - user s code and OpenMP RTL work Classification: Locking, Scheduling, Work Forking 7

VTune Amplifier XE OpenMP Analysis Workflow Start with HPC Performance Characterization analysis Explore CPU Utilization aspect metrics related to OpenMP in summary, grid, source view CL: >amplxe-cl collect hpc-performance <my_app> 8

Per Region Details in grid view: inefficiencies in wall time - classification and issue highlighting Imbalance on a loop barrier Dynamic scheduling overhead on a parallel loop 9

Details in Grid View: Serial Time Hotspots Serial hotspots under Master Thread Time Filter to exclude initialization phase 10

Details on Scalable Timeline Super tiny timeline display mode a bird-eye s view having all data without scrolling Region frames on the ruler More green color more efficient multithreaded execution Intel Xeon Phi profiling result with 288 threads 11

Details for a Region at source file level 12

Summary VTune Amplifier XE OpenMP analysis answers on customer s questions about performance on the language of OpenMP constructs The analysis is well-scalable for many-core systems with good balance of tracing and sampling collection technologies The OpenMP analysis is MPI-aware that is helpful for inner-node hybrid MPI + OpenMP application tuning The full feature set is available in VTune Amplifier XE 2018 with Intel OpenMP and Intel MPI runtimes as a part of Intel Parallel Studio XE 2018 13

Back-up 14

A Use Case: NPB CG imbalance improvement Step 1: Profiling original application NPB CG (Class B) There is a region with promising potential gain go to Grid View for more details 15

A Use Case: NPB CG imbalance improvement Step 1: Profiling original application NPB CG (Class B) There are barriers in region use experimental /OpenMP Region/OpenMP Barrier.. grouping Imbalance on omp loop in cg.f, lines: 572-580, schedule is static 16

A Use Case: NPB CG imbalance improvement Step 2: Trying dynamic scheduling omp do schedule (dynamic) Elapsed time increased no improvement Go to Grid View for details 17

A Use Case: NPB CG imbalance improvement Step 2: Trying dynamic scheduling omp do schedule (dynamic) Default chunk size is 1 and it led to scheduling overhead Let s try bigger chunk size 18

A Use Case: NPB CG imbalance improvement Step 3: Trying dynamic schedule with chunk 20 Improved original elapsed time ~15%, eliminated imbalance 19

Back-up Intel Confedencial 20

21 Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804