Xeon Phi Coprocessors on Turing

Size: px

Start display at page:

Download "Xeon Phi Coprocessors on Turing"

Bruno Henderson
5 years ago
Views:

1 Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High Performance Computing Group 1

2 Overview The Turing cluster has ten nodes with two Xeon Phi coprocessors. Each coprocessor has 60 available cores. They can be utilized using the offload, native, or symmetric execution modes. Using the Phi Coprocessors Once connected to the cluster, type phi-login to drop into a shell on one of the Phi hosts. Then, you can run micinfo to see that two coprocessors are installed. Example In this example, the Intel C Compiler will be used to compile the C code available here. Download the file and run the commands below. Make sure to include the mmic option. [tstil004@turing1 ~]$ phi-login [tstil004@crphi-008 ~]$ module load /cm/shared/compilers/intel/ics/ /base [tstil004@crphi-008 ~]$ icc -o helloflops3 helloflops3.c -mmic -openmp This is going to run in native mode so add this to ~/.ssh/environment, as well as any other libraries that you may need on the coprocessor. [tstil004@crphi-008 ~]$ cat ~/.ssh/environment LD_LIBRARY_PATH=/cm/shared/apps/ics/ /composer_xe_2013_sp /compiler/lib/m ic From here, you can ssh directly to the coprocessor and run the program. The output is also included. [tstil004@crphi-008 ~]$ ssh crphi-008-mic0 ~/helloflops3 Initializing Starting Compute Using 240 threads... Gflops = , Secs = 3.492, GFlops per sec = Information Technology Services High Performance Computing Group 2

3 Intel Vtune Amplifier Example This builds on the previous example. Run the following to add the binaries to your $PATH and then launch the GUI. Make sure to load the ICS module if you haven't done so already. This takes care of licensing. ~]$ module load /cm/shared/compilers/intel/ics/ /base ~]$ source /cm/shared/apps/ics/ /vtune_amplifier_xe/amplxe-vars.csh ~]$ amplxe-gui & A window should appear. Information Technology Services High Performance Computing Group 3

At this point, go to File -> New -> Project. Give it a name and pick your location. Click Create Project. A window will appear with the project properties. Below are the fields.

4 At this point, go to File -> New -> Project. Give it a name and pick your location. Click Create Project. A window will appear with the project properties. Below are the fields. Application: /usr/bin/ssh Application parameters: crphi-008-mic0 /home/tstil004/helloflops3 Working directory: /home/tstil004 There are more settings available but that's enough to run the example. To run natively, use ssh to first login to mic0 on the same node, crphi-008. You could also use crphi-008-mic1. Please note that after typing phi-login, the scheduler picked node crphi-008. You would substitute crphi-008 for whichever node you were given. Information Technology Services High Performance Computing Group 4

5 Click OK. Information Technology Services High Performance Computing Group 5

6 Then go to File -> New -> Analysis. Under Knights Corner Platform, click on Hotspots. In this example, we're using mic0 so leave the 0 there. Click Start. Information Technology Services High Performance Computing Group 6

7 You will see the output in your terminal and the results of the run will appear in the appropriate tabs. Information Technology Services High Performance Computing Group 7

8 Appendix The C code example can also be found below [1]. helloflops3 A simple example that gets lots of Flops on Intel(r) Xeon Phi(tm) coprocessors. using openmp to scale Taken from Jeffers and Reinders. "Intel Xeon Phi Coprocessor High-Performance Programming." p #include <stdio.h> #include <stdlib.h> #include <string.h> #include <omp.h> #include <sys/time.h> dtime utility routine to return the current wall clock time double dtime() double tseconds = 0.0; struct timeval mytime; gettimeofday(&mytime,(struct timezone*)0); tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6); return( tseconds ); #define FLOPS_ARRAY_SIZE (1024*1024) #define MAXFLOPS_ITERS #define LOOP_COUNT 128 Floating pt ops per inner loop iteration #define FLOPSPERCALC 2 define some arrays - 64 bye aligned for fast cache access float fa[flops_array_size] attribute ((align(64))); float fb[flops_array_size] attribute ((align(64))); Main program - pedal to the metal...calculate tons o'flops! int main(int argc, char *argv[]) int numthreads; int i,j,k; double tstart, tstop, ttime; double gflops = 0.0; float a=1.1; initialize the compute arrays printf("initializing\r\n"); omp_set_num_threads(2); kmp_set_defaults("kmp_affinity=compact"); #pragma omp parallel for for(i=0; i<flops_array_size; i++) Information Technology Services High Performance Computing Group 8

9 if (i==0) numthreads = omp_get_num_threads(); fa[i] = (float)i + 0.1; fb[i] = (float)i + 0.2; printf("starting Compute\n"); tstart = dtime(); use omp to scale the calculation across the threads requested need to set environment variables OMP_NUM_THREADS and KMP_AFFINITY #pragma omp parallel for private(j,k) for (i=0; i<numthreads; i++) each thread will work on its own array section int offset = i*loop_count; loop many times to get lots of calculations for(j=0; j<maxflops_iters; j++) scale 1st array and add in the 2nd array for(k=0; k<loop_count; k++) fa[k+offset] = a * fa[k+offset] + fb[k+offset]; tstop = dtime(); # of gigaflops we just calculated gflops = (double)( 1.0e-9 * numthreads * LOOP_COUNT * MAXFLOPS_ITERS * FLOPSPERCALC ); elapsed time ttime = tstop - tstart; Print the results if ((ttime) > 0.0) printf("using %d threads...\n", numthreads); printf("gflops = %10.3lf, Secs = %10.3lf, GFlops per sec = %10.3lf\n", gflops, ttime, gflops/ttime); return( 0 ); end main() Sources 1. Jeffers, Jim, and James Reinders. Intel Xeon Phi Coprocessor High-performance Programming. Amsterdam: Elsevier; Print. Information Technology Services High Performance Computing Group 9

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming