Native Computing and Optimization on Intel Xeon Phi

Size: px

Start display at page:

Download "Native Computing and Optimization on Intel Xeon Phi"

Dortha Kennedy
5 years ago
Views:

1 Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales

2 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning tasks Optimization Vectorization Alignment Parallelization Profiling with Vtune

3 What is a native application? It is an application built to run exclusively on the MIC coprocessor. MIC is not binary compatible with the host processor Instruction set is similar to Pentium, but not all 64 bit scalar extensions are included. MIC has 512 bit vector extensions, but does NOT have MMX, SSE, or AVX extensions. Native applications can t be used on the host CPU, and vice versa.

4 Why run a native application? It is possible to login and run applications on the MIC without any host intervention Easy way to get acquainted with the properties of the MIC Performance studies Single card scaling tests (OMP/MPI) No issues with data exchange with host The native code probably performs quite well on the host CPU once you build a host version

5 Building a native application Cross-compile on the host (login or compute nodes) No compilers installed on coprocessors MIC is fully supported by the Intel C/C++ and Fortran compilers (v13+): icc -openmp -mmic mysource.c -o myapp.mic ifort -openmp -mmic mysource.f90 -o myapp.mic The -mmic flag causes the compiler to generate a native mic executable It is convenient to use a.mic extension to differentiate MIC executables

6 Running a native application on Stampede Options to run from mic0 from a compute node: 1. Traditional ssh remote command execution c % ssh mic0 ls Clumsy if environment variables or directory changes needed 2. Interactively login to mic: c % ssh mic0 Then use as a normal server 3. Explicit launcher: c % micrun./a.out.mic 4. Implicit launcher: c %./a.out.mic

7 Native Application Launcher The micrun launcher has three nice features: It propagates the current working directory It propagates the shell environment (with translation) Environment variables that need to be different on host and coprocessor need to be defined using the MIC_ prefix on the host. E.g., c % export MIC_OMP_NUMTHREADS=183 c % export MIC_KMP_AFFINITY= verbose,balanced It propagates the command return code back to the host shell These features work whether the launcher is used explicitly or implicitly

8 Environmental Variables on the MIC If you ssh to mic0 and run directly from the card use the regular names: OMP_NUM_THREADS KMP_AFFINITY I_MPI_PIN_PROCESSOR_LIST If you use the launcher, use the MIC_ prefix to define them on the host: MIC_OMP_NUM_THREADS MIC_KMP_AFFINITY MIC_I_MPI_PIN_PROCESSOR_LIST You can also define a different prefix: export MIC_ENV_PREFIX=MYMIC MYMIC_OMP_NUM_THREADS...

9 Native Execution Quirks The mic runs a lightweight version of Linux Some tools are missing: w, numactl Some tools have reduced functionality: ps Relatively few libraries have been ported to the coprocessor environment Environment needs to be propagated to coprocessor This issue make the implicit or explicit launcher approach even more important

10 Best Practices For Running Native Apps Always bind processes to cores For MPI tasks (more on next presentation) I_MPI_PIN I_MPI_PIN_MODE I_MPI_PIN_PROCESSOR_LIST For threads KMP_AFFINITY={compact, scatter, balanced} KMP_AFFINITY=explicit,proclist=[0,1,2,3,4] Adding verbose will dump the full affinity information when the run starts Adding granularity=fine binds to specific thread contexts and may help in codes with heavy L1 cache reuse The MIC is a single chip, so there is no need for numactl If other affinity options can t be used the command taskset is available.

KMP_AFFINITY Example compact 1 2 3 4 5 6 7 8

11 KMP_AFFINITY Example compact balanced scatter

12 Logical to Physical Processor Mapping Hardware: Physical Cores are Logical Cores are Mapping is not what you are used to! OS proc 1 maps to package 0 core 0 thread 0 OS proc 2 maps to package 0 core 0 thread 1 OS proc 3 maps to package 0 core 0 thread 2 OS proc 4 maps to package 0 core 0 thread 3 OS proc 5 maps to package 0 core 1 thread 0 [ ] OS proc 0 maps to package 0 core 60 thread 0 OS proc 241 maps to package 0 core 60 thread 1 OS proc 242 maps to package 0 core 60 thread 2 OS proc 243 maps to package 0 core 60 thread 3 OpenMP threads start binding to logical core 1, not logical core 0 For compact mapping 240 OpenMP threads are mapped to the first 60 cores No contention for the core containing logical core 0 the core that the O/S uses most But for scatter and balanced mappings, contention for logical core 0 begins at 61 threads Not much performance impact unless O/S is very busy Best to avoid core 60 for native and symmetric jobs

13 How Do I Tune Native Applications? Vectorization and Parallelization are critical! Single-thread scalar performance: ~1 GHz Pentium Vector width is 512 bits 8 double precision values / 16 single precision values You don t want to lose factors of 8-16 in performance Compiler reports provide important information about effectiveness of compiler at vectorization Start with a simple code the compiler reports can be very long & hard to follow There are lots of options & reports! Details at: /composerxe/compiler/cpp-lin/index.htm

14 Vectorization Reports New syntax for Intel 15 and above: Set optimization report level $ -opt-report=0-5 Set optimization report phase $ -opt-report-phase=vec Set optimization report output destination $ -opt-report-file=stdout Default optimization report verbosity level is 2 Default optimization report output will be saved in a file with extension.optrpt For Intel 13 you can still use vec-report# where # is a number from1 to 7 in increasing level of verbosity.

15 Vectorization Report Characteristics Increasing verbosity levels provide more detailed diagnostic information about every loop, including Loops successfully vectorized Loops not vectorized & reasons Specific dependency info for failures to vectorize Array alignment for each loop Unrolling depth for each loop Estimated speedup factor for vectorized loop Quirks For Intel 13 most/all of the vectorization messages repeated with the line number of the call site ignore these and look at the messages with the line number of the actual loop Reported reasons for not vectorizing are not very helpful look at specific dependency info & remember about C aliasing rules This has improved significantly with Intel 15

16 Vectorization Report Example Begin optimization report for: main(int, char **) Report from: Vector optimizations [vec] LOOP BEGIN at./vector.c(34,2) remark #15527: loop was not vectorized: function call to rand(void) cannot be vectorized [./vector.c(34,33) ] LOOP END LOOP BEGIN at./vector.c(35,2) remark #15527: loop was not vectorized: function call to rand(void) cannot be vectorized [./vector.c(35,33) ] LOOP END Explicit reasons for lack of vectorization LOOP BEGIN at./vector.c(38,5) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed FLOW dependence between line 38 and line 38 LOOP END LOOP BEGIN at./vector.c(43,2) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at./vector.c(45,3) remark #15300: LOOP WAS VECTORIZED LOOP END LOOP END Guidance for how to get additional information

17 Detailed Vectorization Report Report from: Vector optimizations [vec]. LOOP BEGIN at./vector.c(38,5) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between y line 38 and y line 38 LOOP END Variable with dependency mentioned explicitly LOOP BEGIN at./vector.c(45,3) remark #15388: vectorization support: reference x has aligned access [./vector.c(46,4) ] remark #15388: vectorization support: reference y has aligned access [./vector.c(46,4) ] remark #15388: vectorization support: reference z has aligned access [./vector.c(46,4) ] remark #15399: vectorization support: unroll factor set to 8 remark #15300: LOOP WAS VECTORIZED Alignment and unroll factor mentioned explicitly remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 11 remark #15477: vector loop cost: remark #15478: estimated potential speedup: remark #15479: lightweight vector operations: 5 remark #15488: --- end vector loop cost summary --- LOOP END Loop cost and estimated speedup

18 Detailed Vectorization Report (Host) Report from: Vector optimizations [vec]. LOOP BEGIN at./vector.c(38,5) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between y line 38 and y line 38 LOOP END LOOP BEGIN at./vector.c(45,3) remark #15388: vectorization support: reference x has aligned access [./vector.c(46,4) ] remark #15388: vectorization support: reference y has aligned access [./vector.c(46,4) ] remark #15388: vectorization support: reference z has aligned access [./vector.c(46,4) ] remark #15399: vectorization support: unroll factor set to 4 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 2 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 11 remark #15477: vector loop cost: remark #15478: estimated potential speedup: remark #15479: lightweight vector operations: 4 remark #15480: medium-overhead vector operations: 1 remark #15488: --- end vector loop cost summary --- LOOP END Unroll set to 4, not 8 (AVX, 256 bit) Memory access information Much lower potential speedup

19 OpenMP complicates things Remember that OpenMP will split loops in ways that can break 64-Byte alignment alignment depends on thread count, so multiple loop version are generated by the compiler: LOOP BEGIN at./vector_omp.c(44,2) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at./vector_omp.c(46,3) <Peeled> LOOP END LOOP BEGIN at./vector_omp.c(46,3) remark #15300: LOOP WAS VECTORIZED LOOP END LOOP BEGIN at./vector_omp.c(46,3) <Alternate Alignment Vectorized Loop> LOOP END LOOP BEGIN at./vector_omp.c(46,3) <Remainder> remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END LOOP END

20 Tuning Tools Currently there is no support for gprof when compiling native applications Profiling is supported by Intel s Vtune product Available on Stampede Example coming up! For MPI parallel applications the Intel Trace Analyzer (Itac) supports MIC execution Also available on Stampede More later

21 Starting with VTUNE Prepare the code Compile your code with "-g". Remember to add any optimization flags after the -g flag or you will revert to no optimization. Collect the performance data Load the vtune module with "module load vtune" Get an interactive session of prepare a slurm batch script Use the command line collection, for example, for hotspots on MIC: $ amplxe-cl -collect knc-hotspots -mrte-mode=native -no-summary -app-working-dir. --./myapp

22 Preset Profiling Sets MIC knc-hotspots, knc-bandwidth, knc-generalexploration Host bandwidth, general-exploration, hotspots, advanced-hotspots, snb-memory-access, snbaccess-contention, snb-port-saturation, snb-coreport-saturation

23 Important Notes You probably want to use the "-target-duration-type=medium" for mic collections, as this makes the post-processing much faster. I believe this samples every 100 ms. Use "amplxe-cl --help" to get additional help, there are may options for collection and I will not cover them all in this presentation. You can also generate text based reports, but you have to do that in the the development nodes, since vtune needs access to the license server for that: $ amplxe-cl -report summary./r000hs You may want to use the linux "xrandr" command to change the size of the remote desktop that is opened.

24 Plain Text Reports amplxe-cl -report summary r./r000hs General Exploration Metrics Parameter r000hs CPU Time Clockticks Instructions Retired CPI Rate Cache Usage 0.0 Vectorization Usage 0.0 TLB Usage 0.0 Collection and Platform Info Parameter r000hs Application Command Line./vec.mic User Name carlos Operating System Intel MIC Platform Software Stack (Built by Poky 7.0) 3.3 \n \l Computer Name c mic0.stampede.tacc.utexas.edu Result Size

25 Plain Text Reports (II) CPU --- Parameter r000hs Name Intel(R) Xeon(R) E5 processor Frequency Logical CPU Count 244 Summary Elapsed Time: CPU Time: Average CPU Usage: CPI Rate: Event summary Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample CPU_CLK_UNHALTED INSTRUCTIONS_EXECUTED

26 Analyzing the Collected Performance Data Open a new stampede session in a different terminal Establish a vnc session. Find out how (if you have not done it before) in the Stampede user guide under "remote desktop access": Fire up the vtune GUI and point to the collected data (typically a directory named r00*): $ amplxe-gui./r000hs

27 Open Collected Data

28 Summary Tab (Hotspots) CPU usage histogram is a good tool to judge overall device utilization Looks terrible because this is serial code!

29 Summary Tab (General Exploration) No binding specified for this run, so we see some jumping around Very small case, so 100% L1 hits and no L2 (read) misses

30 Looking at the Bottlenecks High level overview, need to filter results Bottom panel allows to filter our application only (next slide)

31 Filtered Data Filtered data shows only our application information Multiple threads shown separately in the bottom panel for OMP code Time to look at the source code!

32 Source View Extremely useful and easy to use Once the hotspots are identified, you can examine the assembly code

33 Assembly View Selected source lines correspond to shaded assembly lines on the right hand side panel Easiest way I know of of correlating hotspots with generated assembly!

34 Generating New Projects

35 Specifying the Analysis Type Lots of options here, including a handy command line button at the bottom

36 Command Line Options

37 Performance Tuning Notes Xeon Phi always has multithreading enabled Four thread contexts per physical core Registers are replicated L1D, L1I, and (private, unified) L2 caches are shared Vector Unit instruction issue limitation: The vector unit can issue one instruction per cycle L1D Cache can deliver 64 Bytes (1 vector register) every cycle But a thread can issue a vector instruction every other cycle Need at least two threads to fully utilize the vector unit Using 3-4 threads does not increase maximum issue rate, but often helps tolerate latency

38 Intel reference material Main Software Developers web page: A list of links to very good training material at: Many answers can also be found in the Intel forums: Specific information about building and running native applications: Debugging:

39 More Intel reference Material Search for these at by document number This is more likely to get the most recent version than searching for the document number via Google. Primary Reference: Intel Xeon Phi Coprocessor System Software Developers Guide (document or ) Advanced Topics: Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual (document ) Intel Xeon Phi Coprocessor (codename: Knights Corner) Performance Monitoring Units (document ) WARNING: Intel sometimes describes the number of vector registers as 16 in this documents. The actual number is 32.

40 Questions? For more information:

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning