What's new in VTune Amplifier XE

Similar documents
Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance

Stanislav Bratanov; Roman Belenov; Ludmila Pakhomova 4/27/2015

Intel Parallel Amplifier Sample Code Guide

Using the Intel VTune Amplifier 2013 on Embedded Platforms

Using Intel Inspector XE 2011 with Fortran Applications

Open FCoE for ESX*-based Intel Ethernet Server X520 Family Adapters

Intel IT Director 1.7 Release Notes

Enabling DDR2 16-Bit Mode on Intel IXP43X Product Line of Network Processors

Techniques for Lowering Power Consumption in Design Utilizing the Intel EP80579 Integrated Processor Product Line

Getting Compiler Advice from the Optimization Reports

C Language Constructs for Parallel Programming

Intel MKL Data Fitting component. Overview

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on SuSE*Enterprise Linux Server* using Xen*

Intel VTune Amplifier XE

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Using Intel VTune Amplifier XE for High Performance Computing

Intel(R) Threading Building Blocks

Intel MPI Library for Windows* OS

Overview of Intel Parallel Studio XE

Software Tools for Software Developers and Programming Models

VTune(TM) Performance Analyzer for Linux

Intel C++ Compiler Documentation

Product Change Notification

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Intel Platform Controller Hub EG20T

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Boot-Up Options

Intel Platform Controller Hub EG20T

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Getting Started Tutorial: Finding Hotspots

ECC Handling Issues on Intel XScale I/O Processors

MayLoon User Manual. Copyright 2013 Intel Corporation. Document Number: xxxxxx-xxxus. World Wide Web:

Intel Xeon Phi Coprocessor Performance Analysis

Third Party Hardware TDM Bus Administration

Getting Started Tutorial: Finding Hotspots

Product Change Notification

Getting Started Tutorial: Finding Hotspots

Product Change Notification

Product Change Notification

Product Change Notification

Product Change Notification

Product Change Notification

Product Change Notification

Product Change Notification

Continuous Speech Processing API for Host Media Processing

Installation Guide and Release Notes

Revealing the performance aspects in your code

Product Change Notification

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB

Intel EP80579 Software Drivers for Embedded Applications

Intel(R) Threading Building Blocks

Intel Platform Controller Hub EG20T

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

More performance options

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor PCI 16-Bit Read Implementation

Installation Guide and Release Notes

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation

Product Change Notification

Product Change Notification

Product Change Notification

Product Change Notification

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Product Change Notification

Product Change Notification

Overview of Intel MKL Sparse BLAS. Software and Services Group Intel Corporation

Intel Software Development Products Licensing & Programs Channel EMEA

Introduction to Intel Fortran Compiler Documentation. Document Number: US

Product Change Notification

Product Change Notification

Getting Started Tutorial: Analyzing Memory Errors

Product Change Notification

Enabling Hardware Accelerated Playback for Intel Atom /Intel US15W Platform and IEGD

Product Change Notification

Parallel Programming Models

Product Change Notification

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Product Change Notification

Intel IXP400 Software: Integrating STMicroelectronics* ADSL MTK20170* Chipset Firmware

Vectorization Advisor: getting started

Virtual PLATFORMS for complex IP within system context

Getting Started Tutorial: Finding Hotspots

Getting Started Tutorial: Identifying Hardware Issues

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Product Change Notification

Expressing and Analyzing Dependencies in your C++ Application

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Intel Thread Profiler

Graphics Performance Analyzer for Android

Tutorial: Finding Hotspots with Intel VTune Amplifier - Linux* Intel VTune Amplifier Legal Information

Performance Analysis using Intel VTune Amplifier XE

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Getting Started Tutorial: Analyzing Threading Errors

Product Change Notification

Product Change Notification

Optimizing Film, Media with OpenCL & Intel Quick Sync Video

Getting Started with Intel SDK for OpenCL Applications

Transcription:

What's new in VTune Amplifier XE Naftaly Shalev Software and Services Group Developer Products Division 1

Agenda What s New? Using VTune Amplifier XE 2013 on Xeon Phi coprocessors New and Experimental Features in Detail Summary 2

Agenda What s New? Using VTune Amplifier XE 2013 on Xeon Phi coprocessors New and Experimental Features in Detail Summary 3

What s New? General Exploration and Bandwidth analysis for the Intel Xeon Phi coprocessor Event-based sampling analysis for OpenCL* applications on the Intel Xeon Phi coprocessor (JIT collection) Support for upcoming 4th generation Intel Core processors, code named Haswell General Exploration viewpoint for Intel microarchitecture code named Ivy Bridge Frame analysis for OpenMP* parallel regions Attaching to Java* processes for hardware event-based sampling analysis types Loop mode analysis And many usability improvements 4

Agenda What s New? Using VTune Amplifier XE 2013 on Xeon Phi coprocessors New and Experimental Features in Detail Summary 5

Intel VTune Amplifier XE Analysis Types Hotspot Analysis Concurrency Analysis Locks and Waits Analysis Hardware Event-based Sampling Lightweight Hotspot (pre-defined) Advanced Analysis Types (pre-defined) General Exploration, Memory Access, Bandwidth Custom Analysis Types (created by a user) 6

Analysis types for Xeon Phi coprocessors 7

Configuring User-defined analyses 8

Native Launch configuration Application settings: Application: ssh Parameters: mic0 <app startup> Working directory: Usually does not matter Don t forget to set search directories under All files 9

Hardware based sampling results Lightweight Hotspot analysis Elapsed Time and CPI Top hotspots Average and Target Concurrency 10

Example #1 GUI is not appropriate for benchmarking Problem: It is hard to perform repetitive tasks with the GUI GUI environment is convenient to use but has higher pressure on the system Recommendation: Use command line to run repeatable experiments Use Get Command Line dialog to get the CL 11

Recommendation #1: Command line collection with VTune Amplifier XE Choose analysis type to use amplxe-cl -collect knc-lightweight-hotspots -- search-dir all:p=/lib/firmware/mic -- ssh mic0 /home/levent/sp.a.x Make sure MIC symbols can be found on the host Use SSH to launch the collection, setup SSH keys for password-less access Collect on one or several cards 12

Agenda What s New? Using VTune Amplifier XE 2013 on Xeon Phi coprocessors New and Experimental Features in Detail Summary 13

New and Experimental features Frame analysis for OpenMP parallel regions Caller/Callee analysis Find text in window Loop Analysis and Vectorization Analyses Call Stack and Context Switch analysis Power analysis support Processor Graphic Support 14

Frame analysis for OpenMP parallel regions Detailed information can be found in the user guide and at: http://software.intel.com/sites/products/documentation/do clib/stdxe/2013/amplifierxe/win/win_ug/guid-e188430a- B2F6-4901-83B4-A4355E74C025.htm Prerequisite: Compile your application using Intel Compiler 13.1 Update 2 or higher. The compiler inserts Frame API and emits notifications at fork and join points. 15

Viewing analysis results Summary: Identify the most time-consuming OpenMP functions use the Frame Rate histograms to identify parallel regions with the highest number of slow frames Bottom-up: Select Frame Domain grouping level and analyze CPU time spent in OpenMP frame domains (Frame time) and how many times the region was executed (Frame count) Tasks and Frames: Correlate information on the threads activity and frame rate for each OpenMP region - identify functions with low frame rate 16

Intel VTune Amplifier XE Caller/Callee Analysis 17

Find text in window 18

Intel VTune Amplifier XE Loop Analysis VTune Amplifier XE has enabled loop analysis feature Functions only (default) - the usual way of having only functions in the stacks Functions and loops - show the hierarchy of loops and functions in the same stack Loops only - show the structure of loops, hide functions

Intel VTune Amplifier XE Loop Analysis - Example 20

Intel VTune Amplifier XE Loop Analysis - Example 21

Intel VTune Amplifier XE Vectorization analysis Enabled via AMPLXE_EXPERIMENTAL variable 22

Intel VTune Amplifier XE Loop and Vectorization Analysis combined 23

Call Stack & Context Switch Analysis with Event Based Sampling 24

Intel VTune Amplifier XE Enabling Context Switch and Call Stack analysis for EBS Select Lightweight Hotspots or any Event-Based Sampling collection And check Collect stacks checkbox For more information: http://vtune-qa.inn.intel.com/twiki/pub/locollectors/articlesandpatents/event-based-stack-sampling-reference.pdf

Intel VTune Amplifier XE Context Switch Metrics Synchronization Context Switches Preemption Context Switches Wait Time Inactive Time Idle Time Idle Wakeups 26

Intel VTune Amplifier XE Context Switch and Call Stack for EBS OS executes all software threads in time slices usually referred to in the literature as thread execution quanta VTune Amplifier XE profiler handles thread quantum switches and performs all monitoring operations in correlation with the thread quantum layout

Intel VTune Amplifier XE Context Switch and Call Stack for EBS Collector gains control whenever a thread gets scheduled on and then off a processor Measures hardware performance events or timestamps, collects a call stack to the point where the thread gets activated and inactivated determines a reason for thread inactivation: can be an explicit request for synchronization or a so-called thread quantum expiration Also measures inactive time and the thread inactivation reason can be Wait Time Inactive Time

Intel VTune Amplifier XE Context Switch and Call Stack for EBS

Intel VTune Amplifier XE Energy Measurement Timestamp Wall-clock reference Event counter values Stack Timestamp Wall-clock reference Event counter values Sync Switched out because of: WaitForSingleObject( Handle ); thread 0 wait time thread 0 sampling intervals IPI IPI active time Timestamp Event counter values thread 1 inactive time thread 1 sampling intervals Quantum end Stacks processelement() à getnextitem() à dothejob() Registers and Memory A0 [rax + rbx*2 + 85], [A0 + rcx*8] Branches Was system idle? Did we wake it up? Was HW in a sleep state? (C-states measurable via MSRs) 20 JNZ 20 20 JA 20 RET 100 How many Joules per sample/function/call stack? (measurable via MSRs)

Performance, Parallelism, and Power Metrics Correlated Hotspots HW events Idle time Cx state residency Wait and inactive times Wakeups from idle Context switches Consumed energy (ujoules) Call stack System idled for ~25% of wait time System spent ~10% of idleness in C6 state Almost every wait brought the system to idle and then caused a wakeup Number of contended waits

Processor Graphics support in VTune Amplifier XE 32

Processor Graphic support Why? VTune Amplifier XE is a system-level profiler: Captures and correlates CPU/GPU activities for graphics, media and general purpose compute applications VTune Amplifier XE is a general purpose compute profiler: Detail collection and analysis of compute workloads across CPU/GPU: OpenCL 33

Processor Graphic support Detailed information available at: http://software.intel.com/en-us/articles/intel-vtuneamplifier-xe-getting-started-with-opencl-performanceanalysis-on-intel-hd-graphics 34

System-wide analysis of media application Select new tab to see detailed GPU data DMA packets on CPU threads originated GPU tasks GPU Time metric based on DMA packets correlated with CPU metrics Frames captured GEN GPU engines Turn on/off layers on timeline SW queue Windows Media Player, Win7, IVB Aggregated GPU usage: Engines (DMA) EU Array Usage (metrics) 35

Integrated GPU Media & Compute application Grouping for GEN compute tasks OpenCL kernels Video decoding thread Kernel work spaces Average values for HW metric per kernel GPU HW metrics Multi-thread rendering OpenCL kernel invocations Media OpenCL sample from OpenCL SDK, IVB 36

Case Study NBody application N bodies moving in a gravity field Runs on CPU and then on GPU 64k bodies for CPU, 256k bodies for GPU to maintain comparable execution times (similar statistical errors) Intel Core i7 3667U Intel HD Graphics 4000 37

Locating Issues on GPU Ugly: High rate of L3 misses and GPU memory references Bad: GPU stalled 60% of time Good: GPU fully utilized 38

Optimized for Shared Local Memory Pretty: Utilizing GPU Shared Local Memory => lowered L3 misses Stalls dropped down to 40%, gained 10% performance 39

Summary VTune Amplifier XE extended its capabilities for Xeon family and Xeon Phi coprocessors Helping with Vectorization, Parallelism, and data locality analysis We recommend our tuning guide at http://software.intel.com/enus/articles/optimization-and-performance-tuning-for-intel-xeon-phicoprocessors-part-2-understanding Many new features and usage improvements 40

Acknowledgments This presentation was originally composed by Levent Akyil with contributions from: Alexei Alexandrov Stanislav Bratanov Naftaly Shalev 41

Optimization Notice Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 42

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vpro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vpro Inside, VTune, Xeon, Xeon Phi, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2013. Intel Corporation. http://intel.com/software/products 43