Simplified and Effective Serial and Parallel Performance Optimization

Size: px

Start display at page:

Download "Simplified and Effective Serial and Parallel Performance Optimization"

Vivian Ward
5 years ago
Views:

1 HPC Code Modernization Workshop at LRZ Simplified and Effective Serial and Parallel Performance Optimization Performance tuning Using Intel VTune Performance Profiler

2 Performance Tuning Methodology Goal: minimize the time it takes your program / module / function to execute Identify hotspots and focus on them Frequently: Just a few functions (20% of code take 80% of time ) Optimize these parts (with compiler or hand optimizations) Check for hotspots again, and find new ones How to optimize the hotspots? Maximize CPU utilization and/or minimize elapsed time Keep all cores busy with useful work: Optimal thread/task parallelism Most efficient code execution on each core: Best instructions level parallelism: e.g. 4 micro-instruction/cycle on Intel Core architecture SIMD-parallel ( packed SSE/AVX instructions) code instead of scalar code Minimize stalls caused by memory access Avoid long latency, non-pipelined instructions like division Optimal branch prediction - avoid bad speculation, no assists.. a lot more 2

Intel VTune Amplifier XE Supports for all Steps of a Systematic Performance

Focus tuning on functions taking time: Hotspots See call stacks See time on source

locks by wait time Red/Green for CPU utilization during wait Windows & Linux Low

3 Intel VTune Amplifier XE Supports for all Steps of a Systematic Performance Analysis Where is my application Spending Time? Wasting Time? Waiting Too Long? Focus tuning on functions taking time: Hotspots See call stacks See time on source See cache misses on your source See functions sorted by # of cache misses See locks by wait time Red/Green for CPU utilization during wait Windows & Linux Low overhead No special compiler requirements Advanced Profiling For Scalable Multicore Performance 3

4 Intel VTune Amplifier XE Tune Applications for Scalable Multicore Performance Fast, Accurate Performance Profiles Hotspot (Statistical call tree) Call counts (Statistical) Hardware-Event Sampling Thread Profiling Visualize thread interactions on timeline Balance workloads Easy set-up Pre-defined performance profiles Use a normal production build Find Answers Fast Filter extraneous data View results on the source / assembly Compatible Microsoft, GCC, Intel compilers C/C++, Fortran, Assembly,.NET, Java Latest Intel processors and compatible processors 1 Windows or Linux Visual Studio Integration (Windows) Standalone user i/f and command line 32 and 64-bit 1 IA32 and Intel 64 architectures. Many features work with compatible processors. Event based sampling requires a genuine Intel Processor. 4

5 Intel VTune Amplifier XE Feature Highlights Basic Hot Spot Analysis (Statistical Call Graph) Locates the time consuming regions of your application Provides associated call-stacks that let you know how you got to these time consuming regions Call-tree built using these call stacks Thread Profiling Visualize thread activity and lock transitions in the timeline Provides lock profiling capability Shows CPU/Core utilization and concurrency information Advanced Hotspot and Architecture Analysis Based on Hardware Event-based Sampling (EBS) Pre-defined tuning experiments General Exploration Analysis 5

Click [+] for Call Stack Thread timeline Call stack Filter by

6 Hotspots analysis Hotspot functions Adjust Data Grouping Hotspot Functions Change Viewpoint Function CPU time (Partial list shown) Click [+] for Call Stack Thread timeline Call stack Filter by Timeline Selection (or by Grid Selection) Filter by Module & Other Controls 6

Hotspots analysis Source View Source View Assembly View Self and Total Time on Source / Asm Right click for instruction reference manual Quick Asm

7 Hotspots analysis Source View Source View Assembly View Self and Total Time on Source / Asm Right click for instruction reference manual Quick Asm navigation: Select source to highlight Asm Click jump to scroll Asm Quickly scroll to hot spots. Scroll Bar Heat Map is an overview of hot spots 7

8 Thread Profiling Wait-, Overhead- and Spin Time Threading library internals Thread1 Waiting lib Thread1 Thread2 Waiting lib Thread2 Thread3 Waiting spinning Thread3 Overhead or spin Spin wait Thread running 1sec 1sec 1sec 1sec 1sec 1sec Thread waiting Elapsed Time: 6 seconds CPU Time: T1 (4s) + T2 (3s) + T3 (3s) = 10 seconds CPU Usage Wait Time: T1(2s) + T2(2s) + T3 (2s) = 6 seconds Overhead and spin time: T1(1s) + T2(1s) + T3(1s) = 3 s

9 Concurrency Analysis Bottom-Up view. CPU Usage CPU Usage Overhead Wait Overhead Thread is running Thread is waiting Thread Transitions 9

10 OpenMP* Performance Analysis A new feature in release

11 OpenMP Programming Model Fork-Join Parallelism: Master thread spawns a team of threads as needed OpenMP Team := Master + Workers Parallelism is added incrementally: that is, the sequential program evolves into a parallel program Master Thread Parallel Regions 5/7/

12 OpenMP Parallel RegionExample #pragma omp parallel // assume N=12 #pragma omp parallel #pragma omp for for(i = 1, i < N+1, i++) c[i] = a[i] + b[i]; #pragma omp for i = 1 i = 5 i = 2 i = 6 i = 3 i = 7 i = 4 i = 8 i = 9 i = 10 i = 11 i = 12 Implicit barrier Threads are assigned an independent set of iterations Threads must wait at the end of worksharing construct 5/7/

13 VTune Amplifier XE/OpenMP Analysis Enhancing OpenMP analysis with a set of metrics to answer the following questions: Is serial time of my application significant to prevent scaling? How efficient is my OpenMP parallelization? How much gain I can take if invest in reducing load imbalance/overhead? What regions are more perspective to invest? Metrics are based on elapsed time direct improvement possibilities on application wall clock time 13

14 VTune - OpenMP* Analysis Serial time: time spent by the application outside any OpenMP* region in the master thread during collection: Fork Elapsed time - [Elapsed time of all Parallel regions] Effective CPU time of a Parallel Region Instance: ([CPU time] [Spin Time] [Overhead Time]) where CPU, Spin and Overhead time aggregated by threads in the Region instance Estimated Ideal time of a Region Instance: [Effective CPU time ] / [Number of Threads] Potential Gain of a Parallel Region Instance: [Region Instance Elapsed Time] [Estimated Ideal Time of the Region Instance] Potential Gain of a Region: [Potential Gain of all instances of a Region] Region Instance Elapsed Time Estimated Ideal time of a Region Instance: [Effective CPU time] / [Number of OpenMP Threads] Join Effective CPU time Spin (busy wait Imbalance, Lock Contention) Passive wait (Not consuming CPU) Overhead (Creation, Scheduling, Reduction) Potential Gain Potential Gain of a Program: [Potential Gain of all Regions] 14

VTune Amplifier XE/OpenMP Analysis Tracing of OpenMP is used to provide region/work sharing context - Provided to VTune by Intel OpenMP Runtime: Fork-Join time points of parallel regions with number

15 VTune Amplifier XE/OpenMP Analysis Tracing of OpenMP is used to provide region/work sharing context - Provided to VTune by Intel OpenMP Runtime: Fork-Join time points of parallel regions with number of working threads - Overhead of tracing can be substantial used carefully per region instance on region forkjoin points Sampling to determine different kinds of overhead, synchronization spinning etc. - Any type of VTune analysis that support CPU time calculation (such as hotspots, advanced-hotspots with or without stacks, etc.) 15

16 VTune Amplifier XE/OpenMP Analysis Metrics in Summary 16

because it doesn t focus on the top time consuming region, it focuses on the

17 VTune Amplifier XE/OpenMP Analysis Metrics in Grid Improved CPU time hierarchy The potential gain metric can be more important than CPU or elapsed time, because it doesn t focus on the top time consuming region, it focuses on the region where you get the maximum results from tuning OpenMP Regions marked in timeline pane 17

18 VTune Amplifier XE/OpenMP Analysis Drilldown to region source from grid Region/.. grouping 18

VTune Amplifier XE/OpenMP Analysis Interpreting OpenMP Analysis Data Executing serial code Scenario Find ways to minimize the serial execution sections by either introducing more parallelism or by

19 VTune Amplifier XE/OpenMP Analysis Interpreting OpenMP Analysis Data Executing serial code Scenario Find ways to minimize the serial execution sections by either introducing more parallelism or by doing algorithm or microarchitecture tuning for the sections that seem unavoidably serial For high core count machines serial sections have a severe negative impact on potential scaling and should be minimized as much as possible. 19

synchronization inside regions by using an OpenMP reduction, omp atomic construct or thread local storage

20 VTune Amplifier XE/OpenMP Analysis Interpreting OpenMP Analysis Data Synchronization objects and waiting time Scenario Big Potential Gain Spin Time: Both Load Imbalance and Lock Contention Try to avoid synchronization inside regions by using an OpenMP reduction, omp atomic construct or thread local storage where possible To detect what particular synchronization object causes the problem, collect a Locks and Waits analysis 20

21 VTune Amplifier XE/OpenMP Analysis Interpreting OpenMP Analysis Data Load imbalance Scenario Notice the significant potential gain 1.061s from overall 5.561s, this means there is room for improvement 21

VTune Amplifier XE/OpenMP Analysis Interpreting OpenMP Analysis Data A well-balanced parallel region Scenario All threads are busy, no overhead or spinning (no red color) Potential gain is small

22 VTune Amplifier XE/OpenMP Analysis Interpreting OpenMP Analysis Data A well-balanced parallel region Scenario All threads are busy, no overhead or spinning (no red color) Potential gain is small Majority of CPU time is effective This doesn t mean that everything is perfect, e.g. there may be micro-architectural issues etc. but there are no issues indicated in OpenMP with thread-based parallelism and load balancing 22

23 Advanced Hotspot and Architecture Analysis Using Performance Monitoring Events

Allocate/ Rename/ Retirement Front End Out Of Order Execution Caches ALU Shift Logical X87 MUL X87 DIV SIMD ALU SIMD MUL SIMD Shift SIMD

24 4 th Generation Intel Core Micro Architecture To Uncore 256K L2 Unified Cache 32K L1 Instruction Cache BPU Legacy Decode Pipeline Decoded ICache MSROM micro-op queue 4 UOPS_ISSUED Line Fill Buffers 32K L1 Data Cache Load Data Store Address Store data Store Address Allocate/ Rename/ Retirement Front End Out Of Order Execution Caches ALU Shift Logical X87 MUL X87 DIV SIMD ALU SIMD MUL SIMD Shift SIMD Logical FMA Branch UOPS_EXECUTED Scheduler INT / AVX/ SSE / X87 Register Stacks Load buff Store buff Reorder buff UOPS_RETIRED 4 24

25 Performance Monitoring Unit (PMU) Available in all Intel processor: Core and Uncore to watch Events 1000s of Events in current CPU generation Performance counters can be programmed to count Events through specific MSRs Events can be divided into the following categories, depending on how they are collected and interpreted: Fixed events Programmable events Precise events 25

26 Performance Monitoring Unit Performance counters Core performance monitoring of CPU of IVB-HSW generations Each core has 8 counters; 4 per thread with SMT Measure 7 performance events at a time (4 Programmable, 3 Fixed) Same in mobile and server CPU lines Measure Uncore events in addition to Core events Distributed design with separate blocks of counters in different architectural units (MC, LLC, GT, etc.) Mobile and server lines have different designs Not thread-specific. Thread-specific counting can only be done in the core The event names change for each processor generation, but the performance analysis concepts stay the same! 26

27 Performance Monitoring Unit Event Names PMU events of Core architecture are typically qualified by one or more masks (qualifier, U-Mask) which create sub-events from main events via a dot-notation (setting the k bits in the control register). Example: MEM_UOPS_RETIRED.SPLIT_LOADS_PS MEM: an event related to memory subsystem UOPS_RETIRED: counts number of retired µops SPLIT_LOADS: load µops that split cache line PS: precise event Thus this events counts: The number of line-split load uops retired and in total 14 sub-events for the main event MEM_UOPS_RETIRED exist (including PS versions) 27

28 Event Based Performance Analysis Event-based Sampling (EBS) Processor events can be monitored using sampling and counting technologies Sampling: Allows to profile all active software on the system, including operating system, device driver, and application software Sample - a HW interruption happens when a N of Events counted N is programmable In a sample we automatically collect: o Thread and process ID's o Load module o Instruction Pointer (IP) Instruction pointer is then used to derive the function name and source line number from the debug information created at compile time 28

29 How Event-based Sampling works Core Core CPU Core Core processor interrupts by PMU when Counter = N Thread Cnt= 1 2 N 2N 3N 4N Save execution context and performance data An Event occurrences N = Sample after value Events = samples * sample after value At the end of sampling session we get statistical snapshot of the system where we can see how many and what samples collected by each active module (binary) how many and what samples collected by each function in a module how many and what samples collected by specific instruction 29

How to use Event Based Sampling VTune Amplifier XE From Intel Atom to Intel Xeon and Xeon Phi processors family support is in VTune Amplifier XE VTune Amplifier XE for EBS Driver[less] based

30 How to use Event Based Sampling VTune Amplifier XE From Intel Atom to Intel Xeon and Xeon Phi processors family support is in VTune Amplifier XE VTune Amplifier XE for EBS Driver[less] based collection Driver SDK for unsupported Linux OS es Several predefined profiles Advanced Hotspots General Exploration Memory Access Bandwidth Timeline View Integrated into all Analysis Types Source/Assembly Viewing Compatible with C/C++, Fortran, Assembly, Java,.NET C# Command-line, or Standalone interface for Windows* or Linux*, 32- or 64-bit 30

31 Systematically Determine the (Primary) Bottleneck A Top-Down hierarchy implemented by the General Exploration classifies the application s utilization of the CPU cores on the top level into 4 categories: Front-End Bound Back-End Bound Bad Speculation Retiring The metric of CPU core utilization is what is being done in each cycle for each of the potentially 4 micro operation slots ( pipeline slots ) The Core architecture can execute up to 4 u-ops per cycle! The primary bottleneck has the highest fraction of pipeline slots, and should be investigated first! 31

32 Simplified Pipeline Flow Front-End Back-End Fetch & Decode Instructions, Predict Branches µ-op µ-op µ-op µ-op Execution Core Re-Order & Execute Instructions, Retire Retirement Commit Results to Memory UOPS_ISSUED UOPS_EXECUTED UOPS_RETIRED 32

33 Bottleneck Domain Performance is classified according to what happened for each slot available to the application or hotspot: Micro-ops Issued? No Allocation Stall? Yes Micro-op ever Retire? No Yes No Yes FE Bound BE Bound Bad Speculation Retiring Back-End not stalled and Memory accesses, Speculative execution Successful of retirement Front-End delivers Less execution, dispatch instructions and needs to path be length consumes than 4 micro-ops / cycle allocation bottlenecks reverted cycles 33

34 Bottleneck Domain Transparently, VTune uses the following events & ratios to detect the cycles slots belonging to each of the 4 domains: BE_Bound =1 - (FE_Bound + Retiring + Bad_Speculation) Bad_Speculation =(UOPS_ISSUED.ANY-UOPS_RETIRED.RETIRE_SLOTS) / N Memory issues Execution issues FE_Bound =IDQ_UOPS_NOT_DELIVERED.CORE / N Retiring = UOPS_RETIRED.RETIRE_SLOTS / N Since the width of the pipe is 4 slots at key stages: N = 4 * CPU_CLK_UNHALTED.THREAD

35 General Exploration View: Top Level

36 Top Down Approach 36

37 Example: Top-Down with a Memory Bound Issue Drill Down DRAM Bound Function

38 Not only in VTune: General Exploration in Open- Source PMU-Tools Open-source tool from Andi Kleene see contains a lot of interesting references for more details too 38

39 Links for General Exploration Analysis Whitepaper How to Tune Applications Using a Top-down Characterization of Microarchitectural Issues : Tools VTune Amplifier XE 2015 Basic support in PBA Performance Bottleneck Analyzer ocperf / toplev A wrapper on top of the Linux perf utility Tutorial on Analysis Methodologies and Tools ISCA `

40 Summary The Intel VTune Amplifier XE can be used to find: Source code for performance bottlenecks Characterize the amount of parallelism in an application Determine which synchronization locks or APIs are limiting the parallelism in an application Understand problems limiting CPU instruction level parallelism Instrument user code for better understanding of execution flow defined by threading runtimes 40

41 Questions? 41

42 Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

VTune Amplifier XE/OpenMP Analysis 1 There are several major reasons why working threads wait: When the master thread is executing a serial region, the worker threads are in the OpenMP runtime

44 VTune Amplifier XE/OpenMP Analysis 1 There are several major reasons why working threads wait: When the master thread is executing a serial region, the worker threads are in the OpenMP runtime waiting for the next parallel region barrier 2 3 When synchronization objects are used inside a parallel region, threads can wait on a lock release, contending with other threads for a shared resource (Synchronization on locks) When a thread finishes a parallel region, it waits at a barrier for the other threads to finish. (Load imbalance) The number of loop iterations < the number of working threads so several threads from the team are waiting at the barrier not doing useful work at all (Not enough parallel work)

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP