Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Similar documents
Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Intel VTune Amplifier XE

Revealing the performance aspects in your code

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Memory & Thread Debugger

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Microarchitectural Analysis with Intel VTune Amplifier XE

Simplified and Effective Serial and Parallel Performance Optimization

Jackson Marusarz Software Technical Consulting Engineer

Bei Wang, Dmitry Prohorov and Carlos Rosales

Jackson Marusarz Intel Corporation

Performance Analysis using Intel VTune Amplifier XE

Intel Xeon Phi Coprocessor Performance Analysis

Using Intel VTune Amplifier XE for High Performance Computing

Tutorial: Finding Hotspots with Intel VTune Amplifier - Linux* Intel VTune Amplifier Legal Information

Graphics Performance Analyzer for Android

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Vectorization Advisor: getting started

Eliminate Threading Errors to Improve Program Stability

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Eliminate Threading Errors to Improve Program Stability

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

What s P. Thierry

Getting Started with Intel SDK for OpenCL Applications

Tuning Python Applications Can Dramatically Increase Performance

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Intel Software Development Products Licensing & Programs Channel EMEA

Eliminate Memory Errors to Improve Program Stability

IXPUG 16. Dmitry Durnov, Intel MPI team

Intel Parallel Studio XE 2015

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Intel Cluster Checker 3.0 webinar

Kevin O Leary, Intel Technical Consulting Engineer

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

What's new in VTune Amplifier XE

Profiling: Understand Your Application

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Eliminate Memory Errors to Improve Program Stability

Efficiently Introduce Threading using Intel TBB

Getting Started Tutorial: Finding Hotspots

What s New August 2015

H.J. Lu, Sunil K Pandey. Intel. November, 2018

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Getting Started Tutorial: Finding Hotspots

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Expressing and Analyzing Dependencies in your C++ Application

Jomar Silva Technical Evangelist

Installation Guide and Release Notes

Using the Intel VTune Amplifier 2013 on Embedded Platforms

Installation Guide and Release Notes

More performance options

CERN IT Technical Forum

A Simple Path to Parallelism with Intel Cilk Plus

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Using Intel Inspector XE 2011 with Fortran Applications

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Stanislav Bratanov; Roman Belenov; Ludmila Pakhomova 4/27/2015

Intel Many Integrated Core (MIC) Architecture

Optimizing Film, Media with OpenCL & Intel Quick Sync Video

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

The Transition to PCI Express* for Client SSDs

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Oracle Developer Studio Performance Analyzer

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

Kirill Rogozhin. Intel

Software Optimization Case Study. Yu-Ping Zhao

Overview of Intel Parallel Studio XE

Intel Parallel Amplifier Sample Code Guide

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Getting Started Tutorial: Finding Hotspots

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Intel Xeon Phi Coprocessor

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Intel Threading Tools

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

Real World Development examples of systems / iot

IN-PERSISTENT-MEMORY COMPUTING WITH JAVA ERIC KACZMAREK INTEL CORPORATION

Knights Corner: Your Path to Knights Landing

Introduction to Performance Tuning & Optimization Tools

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Transcription:

Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016

Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate Data & Meaningful Analysis Accurate CPU, GPU and threading data OpenMP region efficiency analysis Powerful data analysis & filtering Data displayed on the source code Easy set-up, no special compiles Last week, Intel VTune Amplifier helped us find almost 3X performance improvement. This week it helped us improve the performance another 3X. Claire Cates Principal Developer SAS Institute Inc. For Windows* and Linux* From $899 (UI only now available on OS X*) http://intel.ly/vtune-amplifier-xe 2

Intel VTune Amplifier Tune Applications for Scalable Multicore Performance Agenda Data Collection Rich set of performance data Data Analysis - Find answers fast Flexible workflow User i/f and command line Compare results Remote collection New for 2017! Summary 3

Two Great Ways to Collect Data Intel VTune Amplifier Software Collector Hardware Collector Uses OS interrupts Uses the on chip Performance Monitoring Unit (PMU) Collects from a single process tree Collect system wide or from a single process tree. ~10ms default resolution ~1ms default resolution (finer granularity - finds small functions) Either an Intel or a compatible processor Requires a genuine Intel processor for collection Call stacks show calling sequence Optionally collect call stacks Works in virtual environments Works in a VM only when supported by the VM (e.g., vsphere*, KVM) No driver required Requires a driver - Easy to install on Windows - Linux requires root (or use default perf driver) No special recompiles - C, C++, C#, Fortran, Java, Assembly 4

Fewer Driver Hassles on Linux* Intel VTune Amplifier Previously added in 2015: Auto-rebuild Intel EBS driver Does advanced analysis stop working when an OS update is installed? Do you have to ask IT to rebuild the driver? No longer! Just setup the driver to auto-rebuild when the OS is updated. Auto-disable NMI watchdog Tired of turning off NMI watchdog to run advanced EBS profiling? Now you don t have to. We turn it off, then put it back the way it was. Since 2016! Perf can now collect stacks! Linux* kernel 2.6.32 or higher Use pre-installed perf driver IT won t install the Intel driver? Use perf! Intel EBS driver provides additional features not available in perf: Uncore events Multiple precise events New events for the latest processors, even on an older OS Easier access to the on-chip PMU for advanced performance profiling 5

A Rich Set of Performance Data Intel VTune Amplifier Software Collector Basic Hotspots Which functions use the most time? Concurrency Tune parallelism. Colors show number of cores used. Locks and Waits Tune the #1 cause of slow threaded performance: waiting with idle cores. Any IA86 processor, any VM, no driver Hardware Collector Advanced Hotspots Which functions use the most time? Where to inline? Statistical call counts General Exploration Where is the biggest opportunity? Cache misses? Branch mispredictions? Advanced Analysis Dig deep to tune access contention, etc. Higher res., lower overhead, system wide No special recompiles - C, C++, C#, Fortran, Java, Assembly 6

Intel VTune Amplifier Tune Applications for Scalable Multicore Performance Agenda Data Collection Rich set of performance data Data Analysis - Find answers fast Flexible workflow User i/f and command line Compare results Remote collection New for 2017! Summary 7

Find Answers Fast Intel VTune Amplifier Adjust Data Grouping (Partial list shown) Double Click Function to View Source Click [+] for Call Stack Filter by Timeline Selection (or by Grid Selection) Filter by Process & Other Controls Tuning Opportunities Shown in Pink. Hover for Tips 8

See Profile Data On Source / Asm Double Click from Grid or Timeline View Source / Asm or both CPU Time Right click for instruction reference manual Quick Asm navigation: Select source to highlight Asm Scroll Bar Heat Map is an overview of hot spots Click jump to scroll Asm 9

Caller / Callee Analyze Parent/Child Intel VTune Amplifier Click on a function in the left pane to see data for the callers and callees Quickly Find The Bottleneck In Time Critical Call Paths 10

Timeline Visualizes Thread Behavior Intel VTune Amplifier Transitions Locks & Waits CPU Time Basic Hotspots Advanced Hotspots Hovers: Optional: Use API to mark frames and user tasks Optional: Add a mark during collection 11

Visualize Parallel Performance Issues Look for Common Patterns Coarse Grain Locks High Lock Contention Low Concurrency Load Imbalance 12

More Big Picture Support Super Tiny bird s-eye view Helps recognize application phases and behavioral patterns Use context menu to select 13

Get The Data You Need Tune OpenMP for Efficiency and Scalability Typical Questions: Q: I put in pragmas, but why is my speed up far from linear? A: Parallelization inefficiency Q: I ran my app on a system with more cores but why does it run less efficiently than on the system with fewer cores? A: Scalability issues Data Needed: 1) Is the serial time of my application significant enough to prevent scaling? 2) How much gain can be achieved by tuning OpenMP? 3) Which OpenMP regions / loops / barriers will benefit most from tuning? 4) What are the inefficiencies with each region? 14

Tune OpenMP for Efficiency and Scalability Fast Answers: Is My OpenMP Scalable? How Much Faster Could It Be? New! 1) 2) 3) 4) The summary view shown above gives fast answers to four important OpenMP tuning questions: 1) Is the serial time of my application significant enough to prevent scaling? 2) How much performance can be gained by tuning OpenMP? 3) Which OpenMP regions / loops / barriers will benefit most from tuning? 4) What are the inefficiencies with each region? (click the link to see details) 15

Tune OpenMP for Efficiency and Scalability See the wall clock impact of inefficiencies, identify their cause Focus On What s Important What region is inefficient? Is the potential gain worth it? Why is it inefficient? Imbalance? Scheduling? Lock spinning? Intel Xeon Phi systems supported Imbalance Lock Fork Scheduling Potential Gain Fork Actual Elapsed Time Ideal Time Join Potential Gain New! 16

What is Hindering Parallel Performance? VTune Amplifier Identifies Parallel Region Inefficiencies New! Imbalance Likely culprit: Dynamic scheduling overhead 17

Jump to Parallel Region Source Code Find Answers Faster View data specific to the region at the source code level New! Tip: use -parallel-source-info=2 compiler option to embed source file name in region name (enables drill down to source file) 18

Tune OpenMP for Efficiency and Scalability See inside each parallel region Understand the cause of inefficiency New! Detailed Barrier to Barrier Analysis Tune each segment separately Easier to see tuning opportunities #pragma omp parallel { #pragma omp barrier #pragma omp single { { #pragma omp for { Fork Barrier-to-Barrier Segment 1 User Barrier Omp Single Parallel Region Barrier-to-Barrier Segment 2 Single Barrier Omp For Barrier-to-Barrier Segment 3 Omp For Barrier Join 19

Tune OpenMP for Efficiency and Scalability More precision where you need it most New! Profile Small Region Instances Working thread imbalance is a major performance issue Tune using precise imbalance measurement (trace-based) 20

Easier Multi-Rank Analysis of MPI + OpenMP Tune hybrid parallelism using ITAC + VTune Amplifier New! Intel Trace Analyzer & Collector (ITAC) - MPI Analyzer and Profiler 1) ITAC finds ranks with low MPI communication spin time These will benefit most from better OpenMP performance 2) Select these ranks for OpenMP profiling in VTune Amplifier 21

Easier Multi-Rank Analysis of MPI + OpenMP Tune hybrid parallelism using ITAC + VTune Amplifier Tune OpenMP performance of high impact ranks in VTune Amplifier New! Ranks sorted by OpenMP tuning impact on overall performance Process names link to OpenMP metrics Per-rank OpenMP Potential Gain and Serial Time metrics Detailed OpenMP metrics 22

Memory Access Analysis New! Intel VTune Amplifier 2016 New! Tune data structures for better performance Attribute cache misses to data structures Better Bandwidth Analysis for Non-Uniform Memory See Read & Write contributions to Total Bandwidth Easier tuning of multi-socket bandwidth Seeing total bandwidth can suggest data blocking opportunities to change a bandwidth bound app into a compute bound app. 23

Intel VTune Amplifier Tune Applications for Scalable Multicore Performance Agenda Data Collection Rich set of performance data Data Analysis - Find answers fast Flexible workflow User i/f and command line Compare results Remote collection New for 2017! Summary 24

Command Line Interface Automate analysis amplxe-cl is the command line: Windows: C:\Program Files (x86)\intel\vtune Amplifier XE \bin[32 64]\amplxe-cl.exe Linux: /opt/intel/vtune_amplifier_xe/bin[32 64]/amplxe-cl Help: amplxe-cl help Use UI to setup 1) Configure analysis in UI 2) Press Command Line button 3) Copy & paste command Great for regression analysis send results file to developer Command line results can also be opened in the UI 25

Interactive Remote Data Collection Performance analysis of remote systems just got a lot easier Interactive analysis 1) Configure SSH to a remote Linux* target 2) Choose and run analysis with the UI Command line analysis 1) Run command line remotely on Windows* or Linux* target 2) Copy results back to host and open in UI Conveniently use your local UI to analyze remote systems 26

Compare Results Quickly - Sort By Difference Intel VTune Amplifier Quickly identify cause of regressions. Run a command line analysis daily Identify the function responsible so you know who to alert Compare 2 optimizations What improved? Compare 2 systems What didn t speed up as much? 27

OS X* Host Support Intel VTune Amplifier Host Runs on OS X* Analyze data from Linux* Analyze data from Windows* No local OS X data collection No Extra Charge Separate download Uses your Windows or Linux license Easy Remote Collection SSH to remote Linux 28

Profile Python & Go! And Mixed Python / C++ / Fortran Low Overhead Sampling Accurate performance data without high overhead instrumentation Launch application or attach to a running process Precise Line Level Details No guessing, see source line level detail New! Mixed Python / native C, C++, Fortran Optimize native code driven by Python 32

Intel VTune Amplifier Tunes Knights Landing Processors 4 Critical Optimizations for Intel Xeon Phi Processors 1) High Bandwidth Memory Decide which data structures to place in MCDRAM See performance problems by memory hierarchy Measure DRAM and MCDRAM bandwidth 2) Scalability of MPI and OpenMP Serial vs. Parallel time Imbalance, overhead cost, parallel loop parameters 3) Micro Architecture Efficiency See the efficiency of your code in the core pipeline Zero in on details with custom PMU events 4) Vectorization Efficiency Use Intel Advisor Optimize for AVX-512 with or without AVX-512 hardware New! 33

Three Keys to HPC Performance: Threading, Memory Access, Vectorization Intel VTune Amplifier New! Threading: CPU Utilization Serial vs. Parallel time Top OpenMP regions by potential gain Tip: Use hotspot OpenMP region analysis for more detail Memory Access Efficiency Stalls by memory hierarchy Bandwidth utilization Tip: Use Memory Access analysis Vectorization: FPU Utilization FLOPS estimates from sampling Tip: Use Intel Advisor for precise metrics and vectorization optimization For 3rd, 5th, 6th Generation Intel Core processors and second generation Intel Xeon Phi processor code named Knights Landing. 34

Optimize Memory Access Memory Access Analysis - Intel VTune Amplifier 2017 Improved! Tune data structures for performance Attribute cache misses to data structures (not just the code causing the miss) Support for custom memory allocators Optimize NUMA latency & scalability True & false sharing optimization Auto detect max system bandwidth Easier tuning of inter-socket bandwidth Easier install, Latest processors No special drivers required on Linux* Intel Xeon Phi processor MCDRAM (high bandwidth memory) analysis 35

Storage Device Analysis (HDD, SATA or NVMe SSD) Intel VTune Amplifier Are You I/O Bound or CPU Bound? Explore imbalance between I/O operations (async & sync) and compute Storage accesses mapped to the source code See when CPU is waiting for I/O Measure bus bandwidth to storage New! Sliders set thresholds for I/O Queue Depth Slow task with I/O Wait Latency analysis Tune storage accesses with latency histogram Distribution of I/O over multiple devices 36

Intel Performance Snapshots Three Fast Ways to Discover Untapped Performance Is your application making good use of modern computer hardware? Run a test case during your coffee break. High level summary shows which apps can benefit most from code modernization and faster storage. Pick a Performance Snapshot: Application for non-mpi apps MPI for MPI apps Storage for systems. Servers and workstations with directly attached storage. Free download: http://www.intel.com/performance-snapshot Also included with Intel Parallel Studio and Intel VTune Amplifier products. New! New! 37

Application Performance Snapshot Discover opportunities for better performance with vectorization & threading Objectives Simple enough to run during a coffee break Highlight where code modernization can help Users Performance teams fast prioritization of which apps will benefit most All Developers size the potential performance gain from code modernization Non-Objectives Actionable tuning data that is another tool. Snapshot is just a fast health check. Free download: http://www.intel.com/performance-snapshot Also included with Intel Parallel Studio and Intel VTune Amplifier products. Preview! 38

Storage Performance Snapshot Discover if faster storage can improve server/workstation performance Learn It On One Coffee Break Easy setup Quickly see meaningful data System view of workload Any architecture Targeted Systems Servers & workstations with directly attached storage Not scale out storage clusters Linux kernel 2.6 or newer dstat 0.7 or newer Windows Server 2012, Windows 8 or newer Windows OS Preview! Free download: http://www.intel.com/performance-snapshot Also included with Intel Parallel Studio and Intel VTune Amplifier products. 40

Intel VTune Amplifier Tune Applications for Scalable Multicore Performance Agenda Data Collection Rich set of performance data Data Analysis - Find answers fast Flexible workflow User i/f and command line Compare results Remote collection New for 2016! Summary 41

Intel VTune Amplifier Faster, Scaleable Code, Faster Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits Analysis Cache miss, Bandwidth analysis 1 GPU Offload and OpenCL Kernel Tracing Find Answers Fast View Results on the Source / Assembly OpenMP Scalability Analysis, Graphical Frame Analysis Filter Out Extraneous Data Organize Data with Viewpoints Visualize Thread & Task Activity on the Timeline Easy to Use No Special Compiles C, C++, C#, Fortran, Java, ASM Visual Studio* Integration or Stand Alone Graphical Interface & Command Line Local & Remote Data Collection Analyze Windows* & Linux* data on OS X* 2 1 Events vary by processor. 2 No data collection on OS X* Quickly Find Tuning Opportunities See Results On The Source Code Tune OpenMP Scalability Visualize & Filter Data 42

Additional Material Intel VTune Amplifier Performance Profiler: Product page overview, features, FAQs Training materials movies, tech briefs, documentation, evaluation guides Reviews Support forums, secure support Additional Analysis Tools: Intel Inspector memory and thread checker/ debugger Intel Advisor thread prototyping tool for architects Additional Development Products: Intel Software Development Products 43

Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 45