Intel Xeon Phi Workshop. Bart Oldeman, McGill HPC May 7, 2015

Size: px

Start display at page:

Download "Intel Xeon Phi Workshop. Bart Oldeman, McGill HPC May 7, 2015"

Marlene Bryan
5 years ago
Views:

1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC May 7,

2 Online Slides OR 2

3 Outline Login and Setup Overview of Xeon Phi Interacting with Xeon Phi Linux shell Ways to program the Xeon Phi Native Programming Offload Programming Choosing how to run your code 3

4 Exercise 0: Login and Setup Please use class accounts to access reserved resources $ ssh class##@guillimin.hpc.mcgill.ca enter password [class##@lg 1r17 n01]$ module add ifort_icc/15.0 ifort_icc module: In Intel's documentation you will see instructions to run 'source compilervars.sh' On guillimin, this script is replaced by this module 4

5 Exercise 0: Workshop Files Please copy the workshop files to your home directory $ cp R /software/workshop/phi/* ~/. Contains: Code for the exercises An example submission script Solutions to the exercises 5

6 Exercise 1: Interactive Session For the workshop, we want interactive access to the phis 1r17 n01]$ qsub I l nodes=1:ppn=16:mics=2 l walltime=8:00:00 [class01@aw 4r13 n01]$ module add ifort_icc/15.0 6

7 Exercise 2: OpenMP on CPU Compile the OpenMP program axpy_omp.c $ icc openmp o axpy axpy_omp.c Run this program on the aw node $./axpy This program ran on the host Use OMP_NUM_THREADS environment variable to control the number of parallel threads 7

8 What is a Xeon Phi? 8

9 What is a Xeon Phi? A device for handling computationally expensive hot spots in your code ('co-processor' or 'accelerator') Large number of low-powered, but low cost (computational overhead, power, size, monetary cost) processors (modified Pentium cores) Supercomputer on a chip : Teraflops through massive parallelism (dozens or 100s of parallel threads) Heterogeneous computing: Host and Phi can work together on the problem 9

10 What was the ASCI Red? 1997, first teraflop supercomputer, same compute power as single Xeon Phi 4,510 nodes (9298 processors), total 1,212 GB of RAM,12.5 TB of disk storage 850 kw vs. 225 W for Xeon Phi 10

15 Terminology MIC = Many Integrated Cores (Intel developed architecture) GPU = Graphics Processing Unit (Xeon Phi is not a GPU, but we will refer to GPUs) Possibly confusing terminology: Architecture: Many Integrated Cores (MIC) Product name (uses MIC architecture): Intel Xeon Phi Development Codename: Knight's corner The device (in contrast to the host ) The target of an offload statement 15

16 MIC architecture under the hood 16

MIC architecture under the hood (c) 2013 Jim

18 Xeon Phis on Guillimin Nodes 50 Xeon Phi nodes, 2 devices per node = 100 Xeon Phis 2 x Intel Sandy Bridge EP E (8-core, 2.6 GHz, 20MB Cache, 115W) 64 GB RAM Cards 2 x Intel Xeon Phi 5110P 60 cores, GHz, 30 MB cache, 8 GB memory (GDDR5), Peak SP FP: 2.0 TFlops, Peak DP FP: 1.0 Tflops (=1.053GHz*60 cores*8 vector lanes*2 flops/fma) 18

19 Comparisons Notes: Chart denotes theoretical maximum values. Actual performance is application dependent The K20 GPU has 13 streaming multiprocessors (SMXs) with 2496 CUDA cores, not directly comparable to x86 cores The K20 GPU and Xeon Phi have GDDR5 memory, the Sandy Bridge has DDR3 memory Accelerator workloads can be shared with the host CPUs 19

20 Benchmark Tests Matrix multiplication results SE10P is a Xeon Phi Coprocessor with slightly higher specifications than the 5110P Source: Saule et. al,

21 Benchmark Tests Embarrassingly parallel financial Monte-Carlo Iterative financial Monte-Carlo with regression across all paths Tesla GPU is a K20X, which has slightly higher specifications than the K20 Source: xcelerit blog, Sept. 4, 2013 ( 21

22 How can accelerators help you do science? Two ways of thinking about speedup from parallelism: 1: Compute a fixed-size problem faster Amdahl's law describes diminishing returns from adding more processors 2: Choose larger problems in the time you have Gustafson's law: Problem size can often scale linearly with number of processors 22

23 Ways to use accelerators Accelerated application Libraries Directives and Pragmas (OpenMP) Explicit parallel programming (OpenMP, MPI, OpenCL, TBB, etc.) Increasing effort 23

24 MIC Software Stack 24

25 MIC Linux Libraries The following Linux Standard Base (LSB) libraries are available on the Xeon Phi Library glibc libc libm libdl librt libcrypt libutil libstdc++ libgcc_s libz libcurses libpam Purpose GNU C standard library The C standard library The math library Dynamic linking POSIX real-time library (shared memory, time, etc.) Passwords, encryption Utility functions GNU C++ standard library Low-level functions for gcc Lossless compression Displaying characters in terminal Authentication 25

26 Focus For Today Some - Accelerated Applications and Libraries Incredibly useful for research, relatively easy to use Will not teach you much about how Xeon Phis work Some - Explicit Programming Device supports your favourite parallel programming models We will keep the programming simple for the workshop Yes - Directives/Pragmas and compilation Will teach you about Xeon Phis We will focus mainly on OpenMP and MPI as parallel programming models 26

27 Scheduling Xeon Phi Jobs Workshop jobs run on a single node + one or two Xeon Phi devices $ qsub l nodes=1:ppn=16:mics=2 $ qsub./subjob.sh Example: submiccheck.sh 27

28 Exercise 3: Interacting with Phi Use your interactive session on the phi nodes Log in to phi cards $ ssh mic0 Try some of your favourite Linux commands $ cat /proc/cpuinfo less $ cat /proc/meminfo less $ cat /etc/issue $ uname a $ env How many cores are available? How much memory is available? What operating system is running? What special environment variables are set? 28

29 Filesystem on Phi The guillimin GPFS filesystem is mounted on the Xeon Phis using NFS You can access your home directory, project space(s), scratch, and the /software directory In general, reading and writing to the file system from the phi is very slow Performance tip: minimize data transfers (and therefore file system use) from the Phi and use /tmp for temporary files (file system in 8 GB memory) 29

30 Native mode and Offload mode There are two main ways to use the phi Compile a program to run directly on the device (native mode) Compile a program to run on the CPU, but offload hotspots to the device (offload mode, heterogeneous computing) Offload is more versatile Uses resources of both node and device 30

31 Automatic Offload mode Offloading linear algebra to the Phi using MKL Only need to set MKL_MIC_ENABLE=1 Can be used by Python, R, Octave, Matlab, etc. Or from Intel C/C++/Fortran using -mkl switch Only effective for large matrices, at least 512x512 to 9216x9216, depending on function Uses both node and device Example: module python/2.7.3-mkl 31

32 Exercise 4: Automatic offload See the script matmul.py, multiplying two random 8192x8192 matrices Run on host only $module add python/2.7.3 MKL $python matmul.py 8192, , Use automatic offload $export MKL_MIC_ENABLE=1 $export OFFLOAD_REPORT=1 $python matmul.py 8192, , Now experiment with smaller and larger values of M, N, and K. 32

33 Exercise 5: Native Compilation axpy_omp.c is a regular OpenMP vector a*x+y program Only phi-specific code is within a #ifdef OFFLOAD pre-compiler conditional Use the compiler option -mmic to compile for native mic execution $icc o axpy_omp.mic mmic openmp axpy_omp.c Attempt to run this program on the CPU. The Linux kernel automatically runs it on the Phi as micrun./axpy_omp.mic, where /usr/bin/micrun is a shell script. Attempt to run this program via ssh. $ ssh mic0./axpy_omp.mic It fails. Why? 33

34 Exercise 5: Native Compilation Using plain ssh the library paths are not set up properly! Alternatively, use micnativeloadex to copy libraries from the host and execute the program on the mic. $ micnativeloadex./axpy_omp.mic Environment variables can be changed via the MIC_ prefix: MIC_OMP_NUM_THREADS=60./axpy_omp.MIC Device number selected with the OFFLOAD_DEVICES variable (e.g. 1) $ OFFLOAD_DEVICES=1./axpy_omp.MIC 34

35 Results Host: $ icc openmp o axpy axpy_omp.c;./axpy OpenMP threads: 16 GFLOPS = , SECS = , GFLOPS per sec = Offload: $ icc DOFFLOAD openmp o axpy_offload axpy_omp.c; OMP_PLACES=threads./axpy_offload OpenMP threads: 16 OpenMP 4 Offload GFLOPS = , SECS = 5.959, GFLOPS per sec = Native: $ icc openmp o axpy.mic mmic axpy_omp.c; MIC_OMP_PLACES=threads./axpy.MIC OpenMP threads: 240 GFLOPS = , SECS = 5.317, GFLOPS per sec = Host (16 Sandybridge cores) 1.9x Offload to Xeon Phi 2.1x Native on Xeon Phi 35

36 Offload Mode Offload computational hotspots to a Xeon Phi device Requires instructions in the code Intel's offload pragmas Older, more documentation available Vendor lock-in (Code depends on hardware and compilers from single supplier) Used in previous workshops, see intel_offload folder for examples OpenMP 4.0 Open standard More high-level than Intel's pragmas (compiler knows more) Device agnostic (use with hosts, GPUs, or Phis) Currently, only newest compilers support it (ifort_icc/ and ifort_icc/15.0) Used in this workshop OpenCL (module add intel_opencl) Lower level Open standard, device agnostic Other standards will likely emerge e.g. CAPS compiler by French company CAPS entreprise supports OpenACC for Xeon Phi. Also future GCC

37 Offload Mode (OpenMP4 - C/C++) Program runs on the CPU Programmer specified hotspots are 'offloaded' to the device #pragma omp target device(1) Variables and functions can be declared on the device #pragma omp declare target static int *data; #pragma omp end declare target Data is usually copied to and from the device (data can be an array section) map(tofrom:data[5:3]) Data can also be allocated on the device without copying map(alloc:data[:20]) 37

38 Offload Mode (OpenMP4 - Fortran) Program runs on the CPU Programmer specified hotspots are 'offloaded' to the device!$omp target device(0) parallel loop, parallel section!$omp end target Variables, subroutines, functions can be declared on the device!$omp declare target (data) Data is usually copied to and from the device map(tofrom:data) 38

39 Offload Mode (Intel - C/C++) Program runs on the CPU Programmer specified hotspots are 'offloaded' to the device #pragma offload target(mic:0) Variables can be declared on the device #pragma offload_attribute(push, target(mic)) static int *data; #pragma offload_attribute(pop) Data is usually copied to and from the device in(varname : length(arraylength)) out(varname : length(arraylength)) inout(varname : length(arraylength)) 39

40 Offload Mode (Intel - Fortran) Program runs on the CPU Programmer specified hotspots are 'offloaded' to the device!dir$ OFFLOAD BEGIN target(mic:0)...!dir$ END OFFLOAD Variables can be declared on the device!dir$ OPTIONS /offload_attribute_target=mic integer, dimension(:) :: data!dir$ END OPTIONS Data is usually copied to and from the device in(varname : length(arraylength)) out(varname : length(arraylength)) inout(varname : length(arraylength)) 40

41 What is the output? #include <stdio.h> Please see memory.c int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom: tofrom:data) { data += 2; } printf("data: %d\n", data); } return 0; 41

42 What is the output? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma target map(tofrom: tofrom:data) { data += 2; } A) data: 5 B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above printf("data: %d\n", data); } return 0; 42

43 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom: tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above } printf("data: %d\n", data); return 0; Explanation: default for map(data) is map(tofrom:data) 43

44 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above printf("data: %d\n", data); } return 0; 44

45 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above } printf("data: %d\n", data); return 0; Explanation: data points to uninitialized memory on the device when 2 is added.. 45

46 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above printf("data: %d\n", data); } return 0; 46

47 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above } printf("data: %d\n", data); return 0; Explanation: data is changed to 7 on the device, but the modified data is never copied back to the host. 47

48 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above printf("data: %d\n", data); } return 0; 48

49 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { } int data = 5; #pragma omp target map(tofrom:data) { data += 2; } printf("data: %d\n", data); return 0; B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above Explanation: A variable referenced in a target construct that is not declared in the construct is implicitly treated as if it had appeared in a map clause with a map-type of tofrom. 49

50 Important Points about Memory Device memory is different than host memory Device memory not accessible to host code Host memory not accessible to device code Data is copied to and from the device using pragmas (offload mode) or scp (native mode) Some programming models may use a virtual shared memory 50

51 Exercise 6: Offload Programming The file offload.c is an OpenMP CPU program Compile and run: $ icc o offload openmp O0 offload.c Modify this program to use the phi card Use #pragma omp declare target so some_work()is compiled for execution on the Phi Write an appropriate pragma to offload the some_work() call in main() to the Phi device, using the correct map clauses for transferring in_array and out_array. Compile and run the program Try: export OFFLOAD_REPORT=3 run your program again (no need to re-compile) 51

52 Exercise 7: Environment Variables in Offload Programming Compile and run hello_offload.c, and hello.c $ icc openmp o hello_offload hello_offload.c;./hello_offload $ icc openmp o hello hello.c;./hello How many OpenMP threads are used by each (default)? Change OMP_NUM_THREADS $ export OMP_NUM_THREADS=4 Now how many OpenMP threads are used by each? Set a different value for OMP_NUM_THREADS for offload execution: $ export MIC_ENV_PREFIX=MIC $ export MIC_OMP_NUM_THREADS=5 Now how many OpenMP threads are used by each? Note: Offload execution copies your environment variables unless you have one or more environment variables beginning with $MIC_ENV_PREFIX_ Otherwise, only copies MIC environment variables Note: Set variables for a specific coprocessor: $ export MIC_1_OMP_NUM_THREADS=5 52

53 How should we compile/run this for offload execution? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target { data += 2; } printf("data: %d\n", data); A) icc -mmic -openmp code.c;./a.out B) icc -openmp code.c;./a.out C) icc -mmic -openmp code.c; micnativeloadex a.out D) icc -openmp code.c; micnativeloadex a.out E) None of the above } return 0; 53

54 How should we compile/run this for offload execution? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target { data += 2; } printf("data: %d\n", data); A) icc -mmic -openmp code.c;./a.out B) icc -openmp code.c;./a.out C) icc -mmic -openmp code.c; micnativeloadex a.out D) icc -openmp code.c; micnativeloadex a.out E) None of the above } return 0; 54

55 How should we compile/run this for native execution? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target { data += 2; } printf("data: %d\n", data); A) icc -mmic -openmp code.c;./a.out B) icc -openmp code.c;./a.out C) icc -mmic -openmp code.c; micnativeloadex a.out D) icc -openmp code.c; micnativeloadex a.out E) None of the above } return 0; 55

56 How should we compile/run this for native execution? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target { data += 2; } printf("data: %d\n", data); A) icc -mmic -openmp code.c;./a.out B) icc -openmp code.c;./a.out C) icc -mmic -openmp code.c; micnativeloadex a.out D) icc -openmp code.c; micnativeloadex a.out E) None of the above } return 0; For target pragmas the compiler gives a warning but they are ignored for native-execution programs. 56

57 How should we set OMP_NUM_THREADS for native execution? A) B) C) $ export OMP_NUM_THREADS=240 $ export MIC_ENV_PREFIX=MIC $ export MIC_OMP_NUM_THREADS=240./a.out $ micnativeloadex./a.out e OMP_NUM_THREADS=240 57

58 How should we set OMP_NUM_THREADS for native execution? A) B) C) $ export OMP_NUM_THREADS=240 $ export MIC_ENV_PREFIX=MIC $ export MIC_OMP_NUM_THREADS=240./a.out $ micnativeloadex./a.out e OMP_NUM_THREADS=240 58

59 Memory Persistence Data transfers to/from the device are expensive and should be minimized By default variables are allocated at beginning of offload segment, and free'd at the end Data can persist on the device between offload segments Must have a way to prevent freeing and reallocation of memory if we wish to reuse it 59

60 Memory Persistence //Allocate the arrays only once on target #pragma omp target data map(to:in_data[:size]) \ map(from:out_data[:size]) { for(i=0;i<n;i++){ //Do not copy data inside of loop #pragma omp target {...offload code... } // Copy out_data from target to host #pragma omp target update from(out_data[:size]) // do something with out_data } } 60

61 Memory Persistence!Allocate the arrays only once on target!$omp target data map(to:in_data(:size)) map(from:out_data(:size)) DO i=1,n!do not allocate or free on target inside of loop!$omp target...offload code...!$omp end target!copy out_data from target to host!$omp target update from(out_data(:size))!do something with out_data END DO!$OMP end target data 61

62 Exercise 8: Memory Persistence Modify your solution to exercise 5 (offload.c) or copy the solution offload_soln.c from the solutions directory We would like to transfer, allocate, and free memory for in_array and out_array only once, instead of once per iteration 62

63 Vectorization Two main requirements to achieve good performance Multithreading Vectorization for (i=0; i<4; i++) c[i] = a[i] + b[i]; Vectorization - Compiler interprets a sequence of steps (e.g. a loop) as a single vector operation Xeon Phi has 512 bit-wide (16 floats) SIMD registers for vectorized operations 63

65 Vectorization Use qopt report[=n] qopt report phase=vec (replaces vec report[=n]) compiler option to get a vectorization report If your code doesn't automatically vectorize, you must communicate to the compiler how to vectorize Use array notation (e.g. Intel Cilk Plus) Use #pragma omp simd (carefully) Avoid Data dependencies Strided (non-sequential) memory access 65

66 Intel Cilk Plus C/C++ language extensions for multithreaded programming Available in Intel compilers (>= composer XE 2010) and gcc (>= 4.9) Keywords cilk_for - Parallel for loop cilk_spawn - Execute function asynchronously cilk_sync - Synchronize cilk_spawn'd tasks Array notation array-expression[lower-bound : length : stride] C[0:5:2][:] = A[:]; #pragma simd Simplest way to manually vectorize a code segment More information: 66

67 Exercise 9: Vectorization Compile offload_novec.c with vectorization reporting $ icc openmp o offload_novec qopt report=3 qopt report phase=vec offload_novec.c sini.c Note that the loop around the call to sini() does not vectorize. Why? We know that this algorithm should vectorize (try compiling offload_soln.c with qopt report=3 qoptreport phase=vec) Put a simd clause behind for in the omp parallel for pragma before the loop and recompile. Alternative: use the ipo switch. 67

68 MPI on Xeon Phi Message Passing Interface (MPI) is a popular parallel programing standard especially useful for parallel programs using multiple nodes There are three main ways to use MPI with Xeon Native mode - directly on the device Symmetric mode - MPI processes run on both the CPU and the Xeon Phi Offload mode - MPI used for inter-node communication, code portions offloaded to Xeon Phi (e.g. with OpenMP) 68

69 Exercise 10: Native MPI on Xeon Phi Setup your environment for Intel MPI on Xeon Phi $ module add intel_mpi $ export I_MPI_MIC=enable Compile hello.c for native execution $ mpiicc mmic o hello.mic hello_mpi.c Run with mpirun (executed on host) $ I_MPI_FABRICS=shm mpirun n 60 host mic0./hello.mic Normally use fewer MPI processes per node without explicit shm setting 69

70 Exercise 11: Symmetric MPI on Xeon Phi Use the same environment described in exercise 9 Compile binaries for MIC and CPU $ mpiicc mmic o hello.mic hello_mpi.c $ mpiicc o hello hello_mpi.c Intel MPI must know the difference between MIC and CPU binaries $ export I_MPI_MIC_POSTFIX=.MIC Run with mpirun $ mpirun perhost 1 n 2 host localhost,mic0./hello Or without I_MPI_MIC_POSTFIX=.MIC $ mpirun host localhost n 3./hello : host mic0 n 10./hello.MIC Use export I_MPI_FABRICS=shm:tcp in case of issues. 70

71 Symmetric mode load-balancing Phi tasks will run slower than host tasks Programmer's responsibility to balance workloads between fast and slow processors 71

72 Optimizing for Xeon Phi General tips Optimize for the host node Xeon processors first Expose lots of parallelism SIMD Vectorization Minimize data transfers Try different numbers of threads from 60 to 240 Try different thread affinities e.g. OpenMP: $ export MIC_ENV_PREFIX=MIC $ MIC_OMP_PLACES=threads/cores 72

73 KMP_AFFINITY/OMP_PLACES OMP_PLACES=threads or KMP_AFFINITY=compact: Likely to leave cores unused KMP_AFFINITY=scatter (default): Neighbouring threads on different cores - do not share cache KMP_AFFINITY=balanced Neighbouring threads on the same core - more efficient cache utilization 73

74 Identifying Accelerator Algorithms SIMD Parallelizability Number of concurrent threads (need dozens) Minimize conditionals and divergences Operations performed per datum transferred to device (FLOPs/GB) Data transfer is overhead Keep data on device and reuse it 74

75 Identifying Accelerator Algorithms SIMD Parallelizability Number of concurrent threads (need dozens) Minimize conditionals and divergences Operations performed per datum transferred to device (FLOPs/GB) Data transfer is overhead Keep data on device and reuse it 75

76 Which algorithm gives the most Phi performance boost? Put the following in order from least work per datum to most: i) matrix-vector multiplication ii) matrix-matrix multiplication iii) matrix trace (sum of diagonal elements) A) i, ii, iii B) iii, i, ii C) iii, ii, i D) i, iii, ii E) They are all about the same 76

77 Which algorithm gives the most Phi performance boost? Put the following in order from least work per datum to most: i) matrix-vector multiplication ii) matrix-matrix multiplication iii) matrix trace (sum of diagonal elements) A) i, ii, iii B) iii, i, ii C) iii, ii, i D) i, iii, ii E) They are all about the same 77

78 Which algorithm gives the most Phi performance boost? Matrix trace assume you naively transfer the entire matrix to the device Work Data Matrix vector multiplication Work Data Matrix-matrix multiplication Work Data 78

79 Choosing a mode Study optimized CPU run time Study native mode scaling (30, 60, 120, 240 threads) Are there functions which execute faster on phi? No Is native mode (at any scaling) faster than the CPU? No Collect CPU and native mode profiling data Yes Work per datum benefit > cost? No Consider running on CPU only Yes Yes Consider native mode Consider offloading those functions 79

80 Review We learned how to: Gain access to the Xeon Phis through Guillimin's scheduler Log in and explore the Xeon Phis operating system Compile and run parallel software for native execution on the Xeon Phis Compile and run parallel software for offload execution on the Xeon Phis Offload pragmas (target, map) Data persistence (target data) Ensure your code vectorizes for maximum performance Choose when to use the Xeon Phi and which mode to use 80

81 Keep Learning... Xeon Phi documentation, training materials, example codes: General parallel programming: Xeon Phi Tutorials: Questions:

82 What Questions Do You Have? 82

83 Bonus Topics (Time permitting) 83

84 More Xeon Phi practice Intel Math Kernel Library (MKL) examples $ cp $MKLROOT/examples/*. $ tar xvf examples_mic.tgz Compile and run Hybrid MPI+OpenMP for native and offload execution misc/hybrid_mpi_omp_mv4.c Compile and run OpenCL code for offload execution misc/vecadd_opencl.c 84

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms: