Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

Changing Hardware Impacts Software More Cores More Threads Wider Vectors Intel Xeon Processor Intel Xeon Phi 64-bit 5100 series 5500 series 5600 series E5-2600 E5-2600 V2 E5-2600 V3 E5-2600 V4 Platinum 8180 processor Knights Landing Core(s) 1 2 4 6 8 12 18 22 28 Threads 2 2 8 12 16 24 36 44 56 SIMD Width Optimization Notice 128 128 128 128 256 256 256 256 512 High performance software must be both Parallel (multi-thread, multi-process) Vectorized *Product specification for launched and shipped products available on ark.intel.com. Copyright 2017, Intel Corporation. All rights reserved. *Other names and 72 288 512 3

Vectorize and Thread for Dramatic Performance Gains Together they are more effective than either one alone Automatic Vectorization Not Enough Explicit pragmas and optimization often required 187x Vectorized & Threaded The Difference Is Growing With Each New Generation of Hardware Intel Xeon 2007 Processor: X5472 codenamed: Harpertown 2009 X5570 Nehalem 2010 X5680 Westmere 2012 E5-2600 Sandy Bridge 2013 E5-2600 v2 Ivy Bridge 2014 E5-2600 v3 Haswell 2016 E5-2600 v4 Broadwell Threaded Vectorized Serial Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Configurations at the end of this presentation. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 4

Intel Advisor Vectorization Advisor Get breakthrough vectorization performance Faster Vectorization Optimization: Vectorize where it will pay off most Quickly ID what is blocking vectorization Tips for effective vectorization Safely force compiler vectorization Optimize memory stride The data and guidance you need: Compiler diagnostics + Performance Data + SIMD efficiency Detect problems & recommend fixes Loop-Carried Dependency Analysis Memory Access Patterns Analysis Optimize for AVX-512 with or without access to AVX-512 hardware Part of Intel Parallel Studio XE Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and http://intel.ly/advisor-xe 5

The Right Data At Your Fingertips Get all the data you need for high impact vectorization Filter by which loops are vectorized! Trip Counts What prevents vectorization? Focus on hot loops What vectorization issues do I have? Which Vector instructions are being used? How efficient is the code? Get Faster Code Faster! Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 6

Find Effective Optimization Strategies Intel Advisor: Cache-aware roofline analysis Roofs Show Platform Limits Memory, cache & compute limits Dots Are Loops Bigger, red dots take more time so optimization has a bigger impact Dots farther from a roof have more room for improvement Higher Dot = Higher GFLOPs/sec Optimization moves dots up Algorithmic changes move dots horizontally Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and Which loops should we optimize? A and G are the best candidates B has room to improve, but will have less impact E, C, D, and H are poor candidates Roofs Roofline tutorial video New! 7

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

What is the roofline model? Do you know how fast you should run? Comes from Berkeley Performance is limited by equations/implementation & code generation/hardware 2 hardware limitations PEAK Flops PEAK Bandwidth The application performance is bounded by hardware specifications Gflop/s= min Platform PEAK Platform BW AI Arithmetic Intensity (Flops/Bytes) 9

Platform PEAK FlopS How many floating point operations per second Gflop/s= min Platform PEAK Platform BW AI Theoretical value can be computed by specification Example with 2 sockets Intel Xeon Processor E5-2697 v2 PEAK FLOP = 2 x 2.7 x 12 x 8 x 2 = 1036.8 Gflop/s Number of sockets Core Frequency Number of cores Number of single precision element in a SIMD register More realistic value can be obtained by running Linpack =~ 930 Gflop/s on a 2 sockets Intel Xeon Processor E5-2697 v2 1 port for addition, 1 for multiplication 10

Platform PEAK bandwidth How many bytes can be transferred per second Gflop/s= min Platform PEAK Platform BW AI Theoretical value can be computed by specification Example with 2 sockets Intel Xeon Processor E5-2697 v2 PEAK BW = 2 x 1.866 x 8 x 4 = 119 GB/s Number of sockets Byte per channel Memory Frequency Number of mem channels More realistic value can be obtained by running Stream =~ 100 GB/s on a 2 sockets Intel Xeon Processor E5-2697 v2 11

Drawing the Roofline Defining the speed of light Gflop/s= min Platform PEAK Platform BW AI 2 sockets Intel Xeon Processor E5-2697 v2 Peak Flop = 1036 Gflop/s Peak BW = 119 GB/s 1036 Gflops/s AI [Flop/B] 12

What is the performance boundary? Manual way to do it Manual counting on matrix/matrix multiplication for(i=0; i<n; i++) for(j=0; j<n; j++) for(k=0; k<n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j] # add = N * N * N #Read = 3 * N * N * 4 bytes # mul = N * N * N #Write = N * N * 4 bytes AI = 2N3 16N 2 = 1 8 N 15

Compute the maximum performance BW * Arithmetic Intensity Gflop/s= min Platform PEAK Platform BW AI 2 sockets Intel Xeon Processor E5-2697 v2 Peak Flop = 1036 Gflop/s Peak BW = 119 GB/s Gflops/s 1036 119 If N = 8, sgemm should not be able to perform better than 119 GFlop/s on a 2 sockets Ivy Bridge For sgemm AI = 1/8 N If N = 8, AI = 1 1 AI [Flop/B] 8.7 16

And NOW? How to get better performance? 1036 119 Gflops/s Vectorization + threading Optimize memory access 1 8.7 17

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

Roofline in Intel Advisor The cache aware roofline model Intel Advisor implements a Cache Aware Roofline Model (CARM) - Algorithmic, Cumulative (L1+L2+LLC+DRAM) traffic-based - Invariant for the given code / platform combination How does it work? - Counts every memory movement - Bytes and Flops -> Instrumentation - Time -> Sampling CARM: Cache aware Roofline Model DRAM: DRAM aware Roofline Model TRAM: Theoretical Roofline Model Typically AI_CARM < AI_DRAM < AI_TRAM 19

Understanding the roofline in Intel Advisor Intel Advisor for vectorization optimization Purely compute bound Purely Cache/DRAMbound 20

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

Roofline model and compiler optimizations

Roofline model and optimizations Matrix/matrix addition void addition(float* a, float* b, float* c, int size){ int i, j; for(j=0; j<size; j++){ for(i=0; i<size; i++){ c[i*size + j] = a[i*size + j]+b[i*size + j]; } } } Let s have a look at the roofline model

Roofline model and optimizations Compilation with O1 Very poor performance, far from the DRAM roofline!

Roofline model and optimizations Lets look at the Memory Access Pattern Analysis Constant stride found!!! Looks like loops should be reversed

Roofline model and optimizations Compilation with O3

Vectorization of Loop carried dependency

Vectorization of loop carried dependency Loop carried dependency void addition(float* a, float* b, float* c, int size){ int i, j; for(i=0; i<size; i++){ for(j=pad; j<size; j++){ c[i*size + j] = a[i*size + j]+c[i*size + j-pad]; } } }

Vectorization of loop carried dependency

Vectorization of loop carried dependency Loop carried dependency

Vectorization of loop carried dependency Loop carried dependency void addition(float* a, float* b, float* c, int size){ int i, j; for(i=0; i<size; i++){ #pragma omp simd safelen(4) In this case, we assume that pad >=4 for(j=pad; j<size; j++){ c[i*size + j] = a[i*size + j]+c[i*size + j-pad]; } } }

Vectorization of loop carried dependency

Vectorization of loop carried dependency Safelen was 4

Vectorization of function call

Vectorization of a function call with OMP Function call inside of a loop can prevent the vectorization for(int i=0; i<size; i++){ for(int j=0; j<size; j++){ single_line_addition(a, c, i*size + j); } } //function is defined in another compilation unit void single_line_addition(float* a, float* c, int ind){ c[ind] = a[ind]+c[ind]; }

Vectorization of a function call with OMP

Vectorization of a function call with OMP Advisor tells you that this pattern can be a problem and proposes a solution

Vectorization of a function call with OMP Omp declare simd for(int i=0; i<size; i++){ #pragma omp simd for(int j=0; j<size; j++){ single_line_addition(a, c, i*size + j); } } #pragma omp declare simd uniform(a, c) linear(ind) void single_line_addition(float* a, float* c, int ind);

Vectorization of a function call with OMP

Vectorization of a function call with OMP Before Intel Advisor for vectorization optimization After

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

References Roofline model proposed by Williams, Waterman, Patterson: http://www.eecs.berkeley.edu/~waterman/papers/roofline.pdf Cache-aware Roofline model: Upgrading the loft (Ilic, Pratas, Sousa, INESC- ID/IST, Thec Uni of Lisbon) http://www.inesc-id.pt/ficheiros/publicacoes/9068.pdf 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and

Resources Intel Advisor Threading Design & Prototyping: Product page overview, features, FAQs, support Training materials movies, tech briefs, documentation Evaluation guides step by step walk through Reviews Additional Analysis Tools: Intel VTune Amplifier performance profiler Intel Inspector - memory and thread checker / debugger Additional Development Products: Intel Software Development Products Intel Distribution for Python* accelerated Python distribution Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 43

Code that performs and outperforms Download a free, 30-day trial of Intel Parallel Studio XE 2018 today software.intel.com/en-us/intel-parallel-studio-xe And Don t Forget To fill out the evaluation survey via a URL that will be provided at the end of the day OR Watch your email for a link to the survey P.S. Everyone who fills out the survey will receive a personalized certificate indicating completion of the training!

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 45

Advisor works with GCC and Microsoft Compilers Adds bonus capabilities with the Intel Compiler Advisor using GCC, Microsoft or Intel Compiler: Finds un-vectorized loops Analyze SIMD, AVX, AVX2, AVX-512 Dependency Analysis safely force vectorization with a pragma Memory Access Pattern Analysis - optimize stride and caching Trip Counts FLOPS metrics with masking Roofline Analysis balance memory vs. compute optimization Intel Compiler Adds: Usually better optimized vectorization Better compiler optimization messages Intel Advisor with Intel Compiler Adds: Finds inefficiently vectorized loops and estimates performance gain Compiler optimization report messages displayed on the source More tips for improving vectorization Optimize for AVX-512 even without AVX-512 hardware Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 47

Configurations for 2007-2016 Benchmarks Platform Hardware and Software Configuration Platform Intel Xeon 5472 Processor Intel Xeon X5570 Processor Intel Xeon X5680 Processor Intel Xeon E5 2690 Processor Intel Xeon E5 2697v2 Processor Intel Xeon E5 2600v3 Processor Intel Xeon E5 2600v4 Processor Intel Xeon E5 2600v4 Processor Unscaled Core Cores/S Frequency ocket Num L1 Data Sockets Cache L2 Cache L3 Cache Memory Memory Memory Frequency Access H/W Prefetchers Enabled HT Enabled Turbo Enabled C States O/S Name 3.0 GHZ 4 2 32K 6 MB None 32 GB 800 MHz UMA Y N N Disabled Fedora 20 2.9 GHZ 4 2 32K 256K 8 MB 48 GB 1333 MHz NUMA Y Y Y Disabled Fedora 20 3.33 GHZ 6 2 32K 256K 12 MB 48 MB 1333 MHz NUMA Y Y Y Disabled Fedora 20 2.9 GHZ 8 2 32K 256K 20 MB 64 GB 1600 MHz NUMA Y Y Y Disabled Fedora 20 2.7 GHZ 12 2 32K 256K 30 MB 64 GB 1867 MHz NUMA Y Y Y Disabled RHEL 7.1 2.2 GHz 18 2 32K 256K 46 MB 128 GB 2133 MHz NUMA Y Y Y Disabled Fedora 20 2.3 GHz 18 2 32K 256K 46 MB 256 GB 2400 MHz NUMA Y Y Y Disabled RHEL 7.0 2.2 GHz 22 2 32K 256K 56 MB 128 GB 2133 MHz NUMA Y Y Y Disabled CentOS 7.2 Operating System 3.11.10-301.fc20 3.11.10-301.fc20 3.11.10-301.fc20 3.11.10-301.fc20 3.10.0-229.el7.x86_64 3.13.5-202.fc20 3.10.0-123. el7.x86_64 3.10.0-327. el7.x86_64 Compiler Version icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 48