Intel Xeon Phi Workshop. Bart Oldeman, McGill HPC May 7, 2015

Size: px
Start display at page:

Download "Intel Xeon Phi Workshop. Bart Oldeman, McGill HPC May 7, 2015"

Transcription

1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC May 7,

2 Online Slides OR 2

3 Outline Login and Setup Overview of Xeon Phi Interacting with Xeon Phi Linux shell Ways to program the Xeon Phi Native Programming Offload Programming Choosing how to run your code 3

4 Exercise 0: Login and Setup Please use class accounts to access reserved resources $ ssh class##@guillimin.hpc.mcgill.ca enter password [class##@lg 1r17 n01]$ module add ifort_icc/15.0 ifort_icc module: In Intel's documentation you will see instructions to run 'source compilervars.sh' On guillimin, this script is replaced by this module 4

5 Exercise 0: Workshop Files Please copy the workshop files to your home directory $ cp R /software/workshop/phi/* ~/. Contains: Code for the exercises An example submission script Solutions to the exercises 5

6 Exercise 1: Interactive Session For the workshop, we want interactive access to the phis 1r17 n01]$ qsub I l nodes=1:ppn=16:mics=2 l walltime=8:00:00 [class01@aw 4r13 n01]$ module add ifort_icc/15.0 6

7 Exercise 2: OpenMP on CPU Compile the OpenMP program axpy_omp.c $ icc openmp o axpy axpy_omp.c Run this program on the aw node $./axpy This program ran on the host Use OMP_NUM_THREADS environment variable to control the number of parallel threads 7

8 What is a Xeon Phi? 8

9 What is a Xeon Phi? A device for handling computationally expensive hot spots in your code ('co-processor' or 'accelerator') Large number of low-powered, but low cost (computational overhead, power, size, monetary cost) processors (modified Pentium cores) Supercomputer on a chip : Teraflops through massive parallelism (dozens or 100s of parallel threads) Heterogeneous computing: Host and Phi can work together on the problem 9

10 What was the ASCI Red? 1997, first teraflop supercomputer, same compute power as single Xeon Phi 4,510 nodes (9298 processors), total 1,212 GB of RAM,12.5 TB of disk storage 850 kw vs. 225 W for Xeon Phi 10

11 Performance vs. parallelism (c) 2013 Jim Jeffers and James Reinders, used with permission. 11

12 Performance vs. parallelism (c) 2013 Jim Jeffers and James Reinders, used with permission. 12

13 Performance vs. parallelism (c) 2013 Jim Jeffers and James Reinders, used with permission. 13

14 Performance vs. parallelism (c) 2013 Jim Jeffers and James Reinders, used with permission. 14

15 Terminology MIC = Many Integrated Cores (Intel developed architecture) GPU = Graphics Processing Unit (Xeon Phi is not a GPU, but we will refer to GPUs) Possibly confusing terminology: Architecture: Many Integrated Cores (MIC) Product name (uses MIC architecture): Intel Xeon Phi Development Codename: Knight's corner The device (in contrast to the host ) The target of an offload statement 15

16 MIC architecture under the hood 16

17 MIC architecture under the hood (c) 2013 Jim Jeffers and James Reinders, used with permission. 17

18 Xeon Phis on Guillimin Nodes 50 Xeon Phi nodes, 2 devices per node = 100 Xeon Phis 2 x Intel Sandy Bridge EP E (8-core, 2.6 GHz, 20MB Cache, 115W) 64 GB RAM Cards 2 x Intel Xeon Phi 5110P 60 cores, GHz, 30 MB cache, 8 GB memory (GDDR5), Peak SP FP: 2.0 TFlops, Peak DP FP: 1.0 Tflops (=1.053GHz*60 cores*8 vector lanes*2 flops/fma) 18

19 Comparisons Notes: Chart denotes theoretical maximum values. Actual performance is application dependent The K20 GPU has 13 streaming multiprocessors (SMXs) with 2496 CUDA cores, not directly comparable to x86 cores The K20 GPU and Xeon Phi have GDDR5 memory, the Sandy Bridge has DDR3 memory Accelerator workloads can be shared with the host CPUs 19

20 Benchmark Tests Matrix multiplication results SE10P is a Xeon Phi Coprocessor with slightly higher specifications than the 5110P Source: Saule et. al,

21 Benchmark Tests Embarrassingly parallel financial Monte-Carlo Iterative financial Monte-Carlo with regression across all paths Tesla GPU is a K20X, which has slightly higher specifications than the K20 Source: xcelerit blog, Sept. 4, 2013 ( 21

22 How can accelerators help you do science? Two ways of thinking about speedup from parallelism: 1: Compute a fixed-size problem faster Amdahl's law describes diminishing returns from adding more processors 2: Choose larger problems in the time you have Gustafson's law: Problem size can often scale linearly with number of processors 22

23 Ways to use accelerators Accelerated application Libraries Directives and Pragmas (OpenMP) Explicit parallel programming (OpenMP, MPI, OpenCL, TBB, etc.) Increasing effort 23

24 MIC Software Stack 24

25 MIC Linux Libraries The following Linux Standard Base (LSB) libraries are available on the Xeon Phi Library glibc libc libm libdl librt libcrypt libutil libstdc++ libgcc_s libz libcurses libpam Purpose GNU C standard library The C standard library The math library Dynamic linking POSIX real-time library (shared memory, time, etc.) Passwords, encryption Utility functions GNU C++ standard library Low-level functions for gcc Lossless compression Displaying characters in terminal Authentication 25

26 Focus For Today Some - Accelerated Applications and Libraries Incredibly useful for research, relatively easy to use Will not teach you much about how Xeon Phis work Some - Explicit Programming Device supports your favourite parallel programming models We will keep the programming simple for the workshop Yes - Directives/Pragmas and compilation Will teach you about Xeon Phis We will focus mainly on OpenMP and MPI as parallel programming models 26

27 Scheduling Xeon Phi Jobs Workshop jobs run on a single node + one or two Xeon Phi devices $ qsub l nodes=1:ppn=16:mics=2 $ qsub./subjob.sh Example: submiccheck.sh 27

28 Exercise 3: Interacting with Phi Use your interactive session on the phi nodes Log in to phi cards $ ssh mic0 Try some of your favourite Linux commands $ cat /proc/cpuinfo less $ cat /proc/meminfo less $ cat /etc/issue $ uname a $ env How many cores are available? How much memory is available? What operating system is running? What special environment variables are set? 28

29 Filesystem on Phi The guillimin GPFS filesystem is mounted on the Xeon Phis using NFS You can access your home directory, project space(s), scratch, and the /software directory In general, reading and writing to the file system from the phi is very slow Performance tip: minimize data transfers (and therefore file system use) from the Phi and use /tmp for temporary files (file system in 8 GB memory) 29

30 Native mode and Offload mode There are two main ways to use the phi Compile a program to run directly on the device (native mode) Compile a program to run on the CPU, but offload hotspots to the device (offload mode, heterogeneous computing) Offload is more versatile Uses resources of both node and device 30

31 Automatic Offload mode Offloading linear algebra to the Phi using MKL Only need to set MKL_MIC_ENABLE=1 Can be used by Python, R, Octave, Matlab, etc. Or from Intel C/C++/Fortran using -mkl switch Only effective for large matrices, at least 512x512 to 9216x9216, depending on function Uses both node and device Example: module python/2.7.3-mkl 31

32 Exercise 4: Automatic offload See the script matmul.py, multiplying two random 8192x8192 matrices Run on host only $module add python/2.7.3 MKL $python matmul.py 8192, , Use automatic offload $export MKL_MIC_ENABLE=1 $export OFFLOAD_REPORT=1 $python matmul.py 8192, , Now experiment with smaller and larger values of M, N, and K. 32

33 Exercise 5: Native Compilation axpy_omp.c is a regular OpenMP vector a*x+y program Only phi-specific code is within a #ifdef OFFLOAD pre-compiler conditional Use the compiler option -mmic to compile for native mic execution $icc o axpy_omp.mic mmic openmp axpy_omp.c Attempt to run this program on the CPU. The Linux kernel automatically runs it on the Phi as micrun./axpy_omp.mic, where /usr/bin/micrun is a shell script. Attempt to run this program via ssh. $ ssh mic0./axpy_omp.mic It fails. Why? 33

34 Exercise 5: Native Compilation Using plain ssh the library paths are not set up properly! Alternatively, use micnativeloadex to copy libraries from the host and execute the program on the mic. $ micnativeloadex./axpy_omp.mic Environment variables can be changed via the MIC_ prefix: MIC_OMP_NUM_THREADS=60./axpy_omp.MIC Device number selected with the OFFLOAD_DEVICES variable (e.g. 1) $ OFFLOAD_DEVICES=1./axpy_omp.MIC 34

35 Results Host: $ icc openmp o axpy axpy_omp.c;./axpy OpenMP threads: 16 GFLOPS = , SECS = , GFLOPS per sec = Offload: $ icc DOFFLOAD openmp o axpy_offload axpy_omp.c; OMP_PLACES=threads./axpy_offload OpenMP threads: 16 OpenMP 4 Offload GFLOPS = , SECS = 5.959, GFLOPS per sec = Native: $ icc openmp o axpy.mic mmic axpy_omp.c; MIC_OMP_PLACES=threads./axpy.MIC OpenMP threads: 240 GFLOPS = , SECS = 5.317, GFLOPS per sec = Host (16 Sandybridge cores) 1.9x Offload to Xeon Phi 2.1x Native on Xeon Phi 35

36 Offload Mode Offload computational hotspots to a Xeon Phi device Requires instructions in the code Intel's offload pragmas Older, more documentation available Vendor lock-in (Code depends on hardware and compilers from single supplier) Used in previous workshops, see intel_offload folder for examples OpenMP 4.0 Open standard More high-level than Intel's pragmas (compiler knows more) Device agnostic (use with hosts, GPUs, or Phis) Currently, only newest compilers support it (ifort_icc/ and ifort_icc/15.0) Used in this workshop OpenCL (module add intel_opencl) Lower level Open standard, device agnostic Other standards will likely emerge e.g. CAPS compiler by French company CAPS entreprise supports OpenACC for Xeon Phi. Also future GCC

37 Offload Mode (OpenMP4 - C/C++) Program runs on the CPU Programmer specified hotspots are 'offloaded' to the device #pragma omp target device(1) Variables and functions can be declared on the device #pragma omp declare target static int *data; #pragma omp end declare target Data is usually copied to and from the device (data can be an array section) map(tofrom:data[5:3]) Data can also be allocated on the device without copying map(alloc:data[:20]) 37

38 Offload Mode (OpenMP4 - Fortran) Program runs on the CPU Programmer specified hotspots are 'offloaded' to the device!$omp target device(0) parallel loop, parallel section!$omp end target Variables, subroutines, functions can be declared on the device!$omp declare target (data) Data is usually copied to and from the device map(tofrom:data) 38

39 Offload Mode (Intel - C/C++) Program runs on the CPU Programmer specified hotspots are 'offloaded' to the device #pragma offload target(mic:0) Variables can be declared on the device #pragma offload_attribute(push, target(mic)) static int *data; #pragma offload_attribute(pop) Data is usually copied to and from the device in(varname : length(arraylength)) out(varname : length(arraylength)) inout(varname : length(arraylength)) 39

40 Offload Mode (Intel - Fortran) Program runs on the CPU Programmer specified hotspots are 'offloaded' to the device!dir$ OFFLOAD BEGIN target(mic:0)...!dir$ END OFFLOAD Variables can be declared on the device!dir$ OPTIONS /offload_attribute_target=mic integer, dimension(:) :: data!dir$ END OPTIONS Data is usually copied to and from the device in(varname : length(arraylength)) out(varname : length(arraylength)) inout(varname : length(arraylength)) 40

41 What is the output? #include <stdio.h> Please see memory.c int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom: tofrom:data) { data += 2; } printf("data: %d\n", data); } return 0; 41

42 What is the output? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma target map(tofrom: tofrom:data) { data += 2; } A) data: 5 B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above printf("data: %d\n", data); } return 0; 42

43 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom: tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above } printf("data: %d\n", data); return 0; Explanation: default for map(data) is map(tofrom:data) 43

44 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above printf("data: %d\n", data); } return 0; 44

45 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above } printf("data: %d\n", data); return 0; Explanation: data points to uninitialized memory on the device when 2 is added.. 45

46 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above printf("data: %d\n", data); } return 0; 46

47 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above } printf("data: %d\n", data); return 0; Explanation: data is changed to 7 on the device, but the modified data is never copied back to the host. 47

48 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target map(tofrom:data) { data += 2; } B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above printf("data: %d\n", data); } return 0; 48

49 What is the output? A) data: 5 #include <stdio.h> int main(int argc, char* argv[]) { } int data = 5; #pragma omp target map(tofrom:data) { data += 2; } printf("data: %d\n", data); return 0; B) data: 7 C) data: 2 D) Error or segmentation fault E) None of the above Explanation: A variable referenced in a target construct that is not declared in the construct is implicitly treated as if it had appeared in a map clause with a map-type of tofrom. 49

50 Important Points about Memory Device memory is different than host memory Device memory not accessible to host code Host memory not accessible to device code Data is copied to and from the device using pragmas (offload mode) or scp (native mode) Some programming models may use a virtual shared memory 50

51 Exercise 6: Offload Programming The file offload.c is an OpenMP CPU program Compile and run: $ icc o offload openmp O0 offload.c Modify this program to use the phi card Use #pragma omp declare target so some_work()is compiled for execution on the Phi Write an appropriate pragma to offload the some_work() call in main() to the Phi device, using the correct map clauses for transferring in_array and out_array. Compile and run the program Try: export OFFLOAD_REPORT=3 run your program again (no need to re-compile) 51

52 Exercise 7: Environment Variables in Offload Programming Compile and run hello_offload.c, and hello.c $ icc openmp o hello_offload hello_offload.c;./hello_offload $ icc openmp o hello hello.c;./hello How many OpenMP threads are used by each (default)? Change OMP_NUM_THREADS $ export OMP_NUM_THREADS=4 Now how many OpenMP threads are used by each? Set a different value for OMP_NUM_THREADS for offload execution: $ export MIC_ENV_PREFIX=MIC $ export MIC_OMP_NUM_THREADS=5 Now how many OpenMP threads are used by each? Note: Offload execution copies your environment variables unless you have one or more environment variables beginning with $MIC_ENV_PREFIX_ Otherwise, only copies MIC environment variables Note: Set variables for a specific coprocessor: $ export MIC_1_OMP_NUM_THREADS=5 52

53 How should we compile/run this for offload execution? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target { data += 2; } printf("data: %d\n", data); A) icc -mmic -openmp code.c;./a.out B) icc -openmp code.c;./a.out C) icc -mmic -openmp code.c; micnativeloadex a.out D) icc -openmp code.c; micnativeloadex a.out E) None of the above } return 0; 53

54 How should we compile/run this for offload execution? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target { data += 2; } printf("data: %d\n", data); A) icc -mmic -openmp code.c;./a.out B) icc -openmp code.c;./a.out C) icc -mmic -openmp code.c; micnativeloadex a.out D) icc -openmp code.c; micnativeloadex a.out E) None of the above } return 0; 54

55 How should we compile/run this for native execution? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target { data += 2; } printf("data: %d\n", data); A) icc -mmic -openmp code.c;./a.out B) icc -openmp code.c;./a.out C) icc -mmic -openmp code.c; micnativeloadex a.out D) icc -openmp code.c; micnativeloadex a.out E) None of the above } return 0; 55

56 How should we compile/run this for native execution? #include <stdio.h> int main(int argc, char* argv[]) { int data = 5; #pragma omp target { data += 2; } printf("data: %d\n", data); A) icc -mmic -openmp code.c;./a.out B) icc -openmp code.c;./a.out C) icc -mmic -openmp code.c; micnativeloadex a.out D) icc -openmp code.c; micnativeloadex a.out E) None of the above } return 0; For target pragmas the compiler gives a warning but they are ignored for native-execution programs. 56

57 How should we set OMP_NUM_THREADS for native execution? A) B) C) $ export OMP_NUM_THREADS=240 $ export MIC_ENV_PREFIX=MIC $ export MIC_OMP_NUM_THREADS=240./a.out $ micnativeloadex./a.out e OMP_NUM_THREADS=240 57

58 How should we set OMP_NUM_THREADS for native execution? A) B) C) $ export OMP_NUM_THREADS=240 $ export MIC_ENV_PREFIX=MIC $ export MIC_OMP_NUM_THREADS=240./a.out $ micnativeloadex./a.out e OMP_NUM_THREADS=240 58

59 Memory Persistence Data transfers to/from the device are expensive and should be minimized By default variables are allocated at beginning of offload segment, and free'd at the end Data can persist on the device between offload segments Must have a way to prevent freeing and reallocation of memory if we wish to reuse it 59

60 Memory Persistence //Allocate the arrays only once on target #pragma omp target data map(to:in_data[:size]) \ map(from:out_data[:size]) { for(i=0;i<n;i++){ //Do not copy data inside of loop #pragma omp target {...offload code... } // Copy out_data from target to host #pragma omp target update from(out_data[:size]) // do something with out_data } } 60

61 Memory Persistence!Allocate the arrays only once on target!$omp target data map(to:in_data(:size)) map(from:out_data(:size)) DO i=1,n!do not allocate or free on target inside of loop!$omp target...offload code...!$omp end target!copy out_data from target to host!$omp target update from(out_data(:size))!do something with out_data END DO!$OMP end target data 61

62 Exercise 8: Memory Persistence Modify your solution to exercise 5 (offload.c) or copy the solution offload_soln.c from the solutions directory We would like to transfer, allocate, and free memory for in_array and out_array only once, instead of once per iteration 62

63 Vectorization Two main requirements to achieve good performance Multithreading Vectorization for (i=0; i<4; i++) c[i] = a[i] + b[i]; Vectorization - Compiler interprets a sequence of steps (e.g. a loop) as a single vector operation Xeon Phi has 512 bit-wide (16 floats) SIMD registers for vectorized operations 63

64 Vectorization (c) 2013 Jim Jeffers and James Reinders, used with permission. 64

65 Vectorization Use qopt report[=n] qopt report phase=vec (replaces vec report[=n]) compiler option to get a vectorization report If your code doesn't automatically vectorize, you must communicate to the compiler how to vectorize Use array notation (e.g. Intel Cilk Plus) Use #pragma omp simd (carefully) Avoid Data dependencies Strided (non-sequential) memory access 65

66 Intel Cilk Plus C/C++ language extensions for multithreaded programming Available in Intel compilers (>= composer XE 2010) and gcc (>= 4.9) Keywords cilk_for - Parallel for loop cilk_spawn - Execute function asynchronously cilk_sync - Synchronize cilk_spawn'd tasks Array notation array-expression[lower-bound : length : stride] C[0:5:2][:] = A[:]; #pragma simd Simplest way to manually vectorize a code segment More information: 66

67 Exercise 9: Vectorization Compile offload_novec.c with vectorization reporting $ icc openmp o offload_novec qopt report=3 qopt report phase=vec offload_novec.c sini.c Note that the loop around the call to sini() does not vectorize. Why? We know that this algorithm should vectorize (try compiling offload_soln.c with qopt report=3 qoptreport phase=vec) Put a simd clause behind for in the omp parallel for pragma before the loop and recompile. Alternative: use the ipo switch. 67

68 MPI on Xeon Phi Message Passing Interface (MPI) is a popular parallel programing standard especially useful for parallel programs using multiple nodes There are three main ways to use MPI with Xeon Native mode - directly on the device Symmetric mode - MPI processes run on both the CPU and the Xeon Phi Offload mode - MPI used for inter-node communication, code portions offloaded to Xeon Phi (e.g. with OpenMP) 68

69 Exercise 10: Native MPI on Xeon Phi Setup your environment for Intel MPI on Xeon Phi $ module add intel_mpi $ export I_MPI_MIC=enable Compile hello.c for native execution $ mpiicc mmic o hello.mic hello_mpi.c Run with mpirun (executed on host) $ I_MPI_FABRICS=shm mpirun n 60 host mic0./hello.mic Normally use fewer MPI processes per node without explicit shm setting 69

70 Exercise 11: Symmetric MPI on Xeon Phi Use the same environment described in exercise 9 Compile binaries for MIC and CPU $ mpiicc mmic o hello.mic hello_mpi.c $ mpiicc o hello hello_mpi.c Intel MPI must know the difference between MIC and CPU binaries $ export I_MPI_MIC_POSTFIX=.MIC Run with mpirun $ mpirun perhost 1 n 2 host localhost,mic0./hello Or without I_MPI_MIC_POSTFIX=.MIC $ mpirun host localhost n 3./hello : host mic0 n 10./hello.MIC Use export I_MPI_FABRICS=shm:tcp in case of issues. 70

71 Symmetric mode load-balancing Phi tasks will run slower than host tasks Programmer's responsibility to balance workloads between fast and slow processors 71

72 Optimizing for Xeon Phi General tips Optimize for the host node Xeon processors first Expose lots of parallelism SIMD Vectorization Minimize data transfers Try different numbers of threads from 60 to 240 Try different thread affinities e.g. OpenMP: $ export MIC_ENV_PREFIX=MIC $ MIC_OMP_PLACES=threads/cores 72

73 KMP_AFFINITY/OMP_PLACES OMP_PLACES=threads or KMP_AFFINITY=compact: Likely to leave cores unused KMP_AFFINITY=scatter (default): Neighbouring threads on different cores - do not share cache KMP_AFFINITY=balanced Neighbouring threads on the same core - more efficient cache utilization 73

74 Identifying Accelerator Algorithms SIMD Parallelizability Number of concurrent threads (need dozens) Minimize conditionals and divergences Operations performed per datum transferred to device (FLOPs/GB) Data transfer is overhead Keep data on device and reuse it 74

75 Identifying Accelerator Algorithms SIMD Parallelizability Number of concurrent threads (need dozens) Minimize conditionals and divergences Operations performed per datum transferred to device (FLOPs/GB) Data transfer is overhead Keep data on device and reuse it 75

76 Which algorithm gives the most Phi performance boost? Put the following in order from least work per datum to most: i) matrix-vector multiplication ii) matrix-matrix multiplication iii) matrix trace (sum of diagonal elements) A) i, ii, iii B) iii, i, ii C) iii, ii, i D) i, iii, ii E) They are all about the same 76

77 Which algorithm gives the most Phi performance boost? Put the following in order from least work per datum to most: i) matrix-vector multiplication ii) matrix-matrix multiplication iii) matrix trace (sum of diagonal elements) A) i, ii, iii B) iii, i, ii C) iii, ii, i D) i, iii, ii E) They are all about the same 77

78 Which algorithm gives the most Phi performance boost? Matrix trace assume you naively transfer the entire matrix to the device Work Data Matrix vector multiplication Work Data Matrix-matrix multiplication Work Data 78

79 Choosing a mode Study optimized CPU run time Study native mode scaling (30, 60, 120, 240 threads) Are there functions which execute faster on phi? No Is native mode (at any scaling) faster than the CPU? No Collect CPU and native mode profiling data Yes Work per datum benefit > cost? No Consider running on CPU only Yes Yes Consider native mode Consider offloading those functions 79

80 Review We learned how to: Gain access to the Xeon Phis through Guillimin's scheduler Log in and explore the Xeon Phis operating system Compile and run parallel software for native execution on the Xeon Phis Compile and run parallel software for offload execution on the Xeon Phis Offload pragmas (target, map) Data persistence (target data) Ensure your code vectorizes for maximum performance Choose when to use the Xeon Phi and which mode to use 80

81 Keep Learning... Xeon Phi documentation, training materials, example codes: General parallel programming: Xeon Phi Tutorials: Questions:

82 What Questions Do You Have? 82

83 Bonus Topics (Time permitting) 83

84 More Xeon Phi practice Intel Math Kernel Library (MKL) examples $ cp $MKLROOT/examples/*. $ tar xvf examples_mic.tgz Compile and run Hybrid MPI+OpenMP for native and offload execution misc/hybrid_mpi_omp_mv4.c Compile and run OpenCL code for offload execution misc/vecadd_opencl.c 84

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

Overview of Intel Xeon Phi Coprocessor

Overview of Intel Xeon Phi Coprocessor Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

the Intel Xeon Phi coprocessor

the Intel Xeon Phi coprocessor the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava, , Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC

Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC # pg. Subject Purpose directory 1 3 5 Offload, Begin (C) (F90) Compile and Run (CPU, MIC, Offload) offload_hello 2 7 Offload, Data Optimize

More information

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information) Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015 Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

SCALABLE HYBRID PROTOTYPE

SCALABLE HYBRID PROTOTYPE SCALABLE HYBRID PROTOTYPE Scalable Hybrid Prototype Part of the PRACE Technology Evaluation Objectives Enabling key applications on new architectures Familiarizing users and providing a research platform

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Xeon Phi Coprocessors on Turing

Xeon Phi Coprocessors on Turing Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Introduction to the Intel Xeon Phi on Stampede

Introduction to the Intel Xeon Phi on Stampede June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors

More information

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E5-2630 Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Lab MIC Experiments 4/25/13 TACC

Lab MIC Experiments 4/25/13 TACC Lab MIC Experiments 4/25/13 TACC # pg. Subject Purpose directory 1 3 5 Offload, Begin (C) (F90) Compile and Run (CPU, MIC, Offload) offload_hello 2 7 Offload, Data Optimize Offload Data Transfers offload_transfer

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution. Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. Ostrava, Intel MIC Programming Workshop, Hardware Overview & Native Execution Ostrava, 7-8.2.2017 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Native Computing and Optimization on Intel Xeon Phi

Native Computing and Optimization on Intel Xeon Phi Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native

More information

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Special considerations for MPI on Intel Xeon Phi and using the Intel Trace Analyzer and Collector Gergana Slavova gergana.s.slavova@intel.com

More information

Introduc)on to Hyades

Introduc)on to Hyades Introduc)on to Hyades Shawfeng Dong Department of Astronomy & Astrophysics, UCSSC Hyades 1 Hardware Architecture 2 Accessing Hyades 3 Compu)ng Environment 4 Compiling Codes 5 Running Jobs 6 Visualiza)on

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

HPC code modernization with Intel development tools

HPC code modernization with Intel development tools HPC code modernization with Intel development tools Bayncore, Ltd. Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona Microprocessor

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

Practical Introduction to Message-Passing Interface (MPI)

Practical Introduction to Message-Passing Interface (MPI) 1 Outline of the workshop 2 Practical Introduction to Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Theoretical / practical introduction Parallelizing your

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Debugging Intel Xeon Phi KNC Tutorial

Debugging Intel Xeon Phi KNC Tutorial Debugging Intel Xeon Phi KNC Tutorial Last revised on: 10/7/16 07:37 Overview: The Intel Xeon Phi Coprocessor 2 Debug Library Requirements 2 Debugging Host-Side Applications that Use the Intel Offload

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

Programming Intel R Xeon Phi TM

Programming Intel R Xeon Phi TM Programming Intel R Xeon Phi TM An Overview Anup Zope Mississippi State University 20 March 2018 Anup Zope (Mississippi State University) Programming Intel R Xeon Phi TM 20 March 2018 1 / 46 Outline 1

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi MIC Training Event at TACC Lars Koesterke Xeon Phi MIC Xeon Phi = first product of Intel s Many Integrated Core (MIC) architecture Co- processor PCI Express card Stripped down Linux

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel

More information

PRACE PATC Course: Intel MIC Programming Workshop MPI LRZ,

PRACE PATC Course: Intel MIC Programming Workshop MPI LRZ, PRACE PATC Course: Intel MIC Programming Workshop MPI LRZ, 27.6.- 29.6.2016 Intel Xeon Phi Programming Models: MPI MPI on Hosts & MICs MPI @ LRZ Default Module: SuperMUC: mpi.ibm/1.4 SuperMIC: mpi.intel/5.1

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Cluster Clonetroop: HowTo 2014

Cluster Clonetroop: HowTo 2014 2014/02/25 16:53 1/13 Cluster Clonetroop: HowTo 2014 Cluster Clonetroop: HowTo 2014 This section contains information about how to access, compile and execute jobs on Clonetroop, Laboratori de Càlcul Numeric's

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen OpenMPand the PGAS Model CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen LastTime: Message Passing Natural model for distributed-memory systems Remote ( far ) memory must be retrieved before use Programmer

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Computer Architecture and Structured Parallel Programming James Reinders, Intel Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer

More information

rabbit.engr.oregonstate.edu What is rabbit?

rabbit.engr.oregonstate.edu What is rabbit? 1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

New User Seminar: Part 2 (best practices)

New User Seminar: Part 2 (best practices) New User Seminar: Part 2 (best practices) General Interest Seminar January 2015 Hugh Merz merz@sharcnet.ca Session Outline Submitting Jobs Minimizing queue waits Investigating jobs Checkpointing Efficiency

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

Beacon Quickstart Guide at AACE/NICS

Beacon Quickstart Guide at AACE/NICS Beacon Intel MIC Cluster Beacon Overview Beacon Quickstart Guide at AACE/NICS Each compute node has 2 8- core Intel Xeon E5-2670 with 256GB of RAM All compute nodes also contain 4 KNC cards (mic0/1/2/3)

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

OpenMP on Ranger and Stampede (with Labs)

OpenMP on Ranger and Stampede (with Labs) OpenMP on Ranger and Stampede (with Labs) Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition November 6, 2012 Based on materials developed by Kent

More information

Practical Introduction to

Practical Introduction to 1 2 Outline of the workshop Practical Introduction to What is ScaleMP? When do we need it? How do we run codes on the ScaleMP node on the ScaleMP Guillimin cluster? How to run programs efficiently on ScaleMP?

More information

Introduction to HPC and Optimization Tutorial VI

Introduction to HPC and Optimization Tutorial VI Felix Eckhofer Institut für numerische Mathematik und Optimierung Introduction to HPC and Optimization Tutorial VI January 8, 2013 TU Bergakademie Freiberg Going parallel HPC cluster in Freiberg 144 nodes,

More information

Vincent C. Betro, Ph.D. NICS March 6, 2014

Vincent C. Betro, Ph.D. NICS March 6, 2014 Vincent C. Betro, Ph.D. NICS March 6, 2014 NSF Acknowledgement This material is based upon work supported by the National Science Foundation under Grant Number 1137097 Any opinions, findings, and conclusions

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

Hands-on with Intel Xeon Phi

Hands-on with Intel Xeon Phi Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling

More information

Intel Math Kernel Library (Intel MKL) Latest Features

Intel Math Kernel Library (Intel MKL) Latest Features Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

The BioHPC Nucleus Cluster & Future Developments

The BioHPC Nucleus Cluster & Future Developments 1 The BioHPC Nucleus Cluster & Future Developments Overview Today we ll talk about the BioHPC Nucleus HPC cluster with some technical details for those interested! How is it designed? What hardware does

More information

Allinea DDT Debugger. Dan Mazur, McGill HPC March 5,

Allinea DDT Debugger. Dan Mazur, McGill HPC  March 5, Allinea DDT Debugger Dan Mazur, McGill HPC daniel.mazur@mcgill.ca guillimin@calculquebec.ca March 5, 2015 1 Outline Introduction and motivation Guillimin login and DDT configuration Compiling for a debugger

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Agenda 1 Agenda-Day 1 HPC Overview What is a cluster? Shared v.s. Distributed Parallel v.s. Massively Parallel Interconnects

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products for High Performance Computing and Parallel Programming Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN

More information