Beacon Quickstart Guide at AACE/NICS

Beacon Intel MIC Cluster Beacon Overview Beacon Quickstart Guide at AACE/NICS Each compute node has 2 8- core Intel Xeon E5-2670 with 256GB of RAM All compute nodes also contain 4 KNC cards (mic0/1/2/3) with 8GB of RAM each, which can be accessed directly using micssh Throughout this document, "beacon#" represents the name of a generic Beacon compute node that should be replaced with the actual node name, while "beacon#- mic#" represents the name of a MIC on a compute node A queuing system is in place that gives users their own compute nodes to help prevent users from accessing the same MIC resource at the same time MIC Programming Models Native Mode All code runs on the MIC card directly Any libraries used will need to be recompiled for native mode To compile a program for native mode use the compiler flag mmic Parallelism across the cores is typically done through threads The executable, input files, and all libraries need to be copied over to the MIC card The location of all native mode libraries, custom or provided by a module, needs to be added to the LD_LIBRARY_PATH environment variable Offload Mode Code starts running on host Parallel regions of code are specified to be run on the MIC using pragmas/directives Data is either copied explicitly to the card or implicitly (used for complex data types involving pointers, only available in C++) Automatic Offload (AO) is available for certain MKL functions o?gemm,?trsm,?trmm,?potrf,?geqrf, and?getrf

Access and Login Beacon can be accessed via SSH ssh USERNAME@beacon.nics.utk.edu NOTE: there is only one production Beacon system. Beacon-login1.nics.utk.edu is now Beacon. The PASSCODE is your 4 digit PIN followed by the 6 digit number displayed on the OTP token. Once connected to Beacon, you will be placed on the login node Preventative Maintenance Unless otherwise stated in the Message of the Day upon logging into Beacon, Beacon is scheduled for preventative maintenance every Wednesday from 8am- 12pm EST/EDT. Compiling Upon connecting to beacon-lgn, all Intel compilers should be available for use immediately. Only the Intel compilers support the Intel MIC architecture at this time. Language Intel Compiler / MPI Wrapper C icc / mpiicc C++ icpc / mpiicc Fortran ifort / mpiifort Notes about configure scripts: When trying to build an application/library for native mode use with configure, the environment should be setup to use the proper compiler flag. export CC="icc -mmic" OR "mpiicc mmic" export CXX="icpc -mmic" OR "mpiicc mmic" export F77="ifort -mmic" OR "mpiifort mmic" export F90="ifort -mmic" OR "mpiifort mmic" Sometimes the script tries to run test binaries on the host. Since native MIC binaries cannot run on the host, the script will complain and exit. One workaround is to force a cross- compilation using --host=x86_64-k1omlinux If the --host option is not available, or the script still exits, then the configure script needs to be fooled with a dummy flag first, and then the flag needs to be changed back before running make.

export CC="icc -DMMIC" export CXX="icpc -DMMIC" export F77="ifort -DMMIC" export F90="ifort -DMMIC" Run configure files=$(find./* -name Makefile) perl -p -i -e 's/-dmmic/-mmic/g' $files export CC="icc -mmic" export CXX="icpc -mmic" export F77="ifort -mmic" export F90="ifort -mmic" Run make The compiler flag may affect files other than the Makefiles, and you may need to adjust them manually. You can find them using grep grep -R DMMIC./ In some instances, you may need to specify the native mode linker and archiver export LD="/usr/linux-k1om-4.7/bin/x86_64-k1om-linux-ld" export AR="/usr/linux-k1om-4.7/bin/x86_64-k1om-linux-ar"

Requesting Compute Nodes with the Scheduler While compiling should be done on the login nodes, actual computations should be done on the compute nodes. Compute nodes are named beacon# and their corresponding MICs are beacon#-mic0 thru beacon#-mic1 Users can request an interactive session on a compute node using o qsub I A ACCOUNT_NAME Users can get more than one node by using o qsub I A ACCOUNT_NAME l nodes=# By default, interactive jobs last 1 hour, but users can request more time by using o qsub I A ACCOUNT_NAME l nodes=#,walltime=hh:mm:ss An example ACCOUNT_NAME would be UT-AACE Users can also submit jobs using a submission script Sample Submission Script #!/bin/bash #PBS -N jobname #PBS A ACCOUNT_NAME #PBS -l nodes=1 #PBS -l walltime=2:00:00 # Change to directory where script was called cd $PBS_O_WORKDIR # Run executable./program arguments

File Systems There are 3 file systems on Beacon 1. NFS home space at /nics/[a-d]/home/$user 2. Lustre scratch space at /lustre/medusa/$user 3. Local SSD scratch space at $TMPDIR The login node only has access to the NFS and Lustre scratch spaces All compute nodes have access to all three file systems, with each compute node having unique local SSD scratch spaces The MICs only have access to the Lustre and local SSD scratch spaces TMPDIR The environment variable TMPDIR is created for the users after being allocated compute nodes through the queuing system. Its absolute path is determined by the job id assigned by the scheduler. The compute nodes mount the local SSD scratch space at $TMPDIR. Given the speed of the SSD drives, using $TMPDIR is preferable to using the Lustre scratch space. On compute nodes, unique temporary directories are found at $TMPDIR/mic0, $TMPDIR/mic0/lib and $TMPDIR/mic0/bin to aid in copying files to the KNC cards. A similar directory structure exists for mic1: $TMPDIR/mic1, $TMPDIR/mic1/lib and $TMPDIR/mic1/bin. The first and second KNC cards mount these directories respectively. For mic0, if the local SSD scratch space is to be used, the native mode binary should be copied over to $TMPDIR/mic0, and native mode libraries should be copied over to $TMPDIR/mic0/lib. Native mode MPI and OpenMP libraries are copied by default. All other libraries, including those from modules, need to be copied manually. If there are additional utility binaries, they can be copied to $TMPDIR/mic0/bin. Similar file transfers can be made to mic1, if necessary. Alternatively, the Lustre scratch space can be used directly if any issues are found using $TMPDIR. Local SSD Compute node mic0 $TMPDIR Coprocessors beacon#- mic0 beacon# $TMPDIR mic1 $TMPDIR beacon#- mic1

Custom Beacon Scripts Any secure communication with a MIC requires unique ssh keys that are automatically generated once the scheduler assigns compute nodes Custom scripts have been created to use these ssh keys, which prevent prompts asking using users for passwords Traditional Command ssh scp mpirun/mpiexec Custom Beacon Script micssh micscp micmpiexec Running Jobs and Copying Files to the KNC Cards After compiling source code, request a compute node to run the executable. Once connected to a compute node, offload mode executables can be run directly. Native mode executables require manual copying of libraries, binaries, and input data to either the SSD scratch space cp native_program.mic $TMPDIR/mic0 cp necessary_library.so $TMPDIR/mic0/lib or the Lustre scratch space with a folder_name of your choice mkdir /lustre/medusa/$user/folder_name cp native_program.mic /lustre/medusa/$user/folder_name mkdir /lustre/medusa/$user/folder_name/lib cp necessary_library.so /lustre/medusa/$user/folder_name/lib Once files are copied over, direct access to a KNC card is available through the micssh command micssh beacon#-mic0 To see the files that were copied to the local SSD scratch space, you will have to change directory to TMPDIR. cd $TMPDIR ls If native mode libraries were copied to the Lustre scratch space, then LD_LIBRARY_PATH needs to be modified accordingly export LD_LIBRARY_PATH=/lustre/medusa/$USER/folder_name/lib:$LD_LIBRARY_ PATH After the native mode application is run, type exit to return back to the compute node host. Output files located on the local SSD scratch space can then copied from

$TMPDIR/mic0 and/or $TMPDIR/mic1 to the user's home directory or to the Lustre scratch space. Files not copied from the local SSD scratch space will be lost once the interactive session is over. If you are planning to run MPI on MICs on multiple nodes with the local SSD scratch space, you also need to copy files to the MICs you plan to use located on the other assigned compute nodes. This can be done manually by first determining which nodes you have been assigned using cat $PBS_NODEFILE. Then, for each assigned node, copying the necessary files using micssh or micscp: micssh beacon# cp absolute_path/file_to_copy $TMPDIR/mic# or micscp absolute_path/file_to_copy beacon#:$tmpdir/mic# Instead of doing this manually, the custom allmicput script can be used Allmicput The allmicput script can easily copy files to $TMPDIR on all assigned MICs Usage: allmicput [[-t] FILE...] [-l LIBRARY...] [-x BINARY...] [-d DIR FILE...] Copy listed files to the corresponding directory on every MIC card in the current PBS job. [-t] FILE... the specified file(s) are copied to $TMPDIR on each mic -T LISTFILE the files in LISTFILE are copied to $TMPDIR on each mic -l LIBRARY... the specified file(s) are copied to $TMPDIR/lib on each mic -L LISTFILE the files in LISTFILE are copied to $TMPDIR/lib on each mic -x BINARY... the specified file(s) are copied to $TMPDIR/bin on each mic -X LISTFILE the files in LISTFILE are copied to $TMPDIR/bin on each mic -d DIR FILE... the specified file(s) are copied to $TMPDIR/DIR on each mic -D DIR LISTFILE the files in LISTFILE are copied to $TMPDIR/DIR on each mic

Native Mode Shared Libraries Unless provided by a module, all shared libraries need to be recompiled for native mode use 1. Compile the library source code o icc mmic c fpic mylib.c o icpc mmic c fpic mylib.cpp o ifort mmic c fpic mylib.f90 2. Use the shared compiler flag to create the library from the object file o icc mmic shared o libmylib.so mylib.o o icpc mmic shared o libmylib.so mylib.o o ifort mmic shared o libmylib.so mylib.o 3. Compile and link the native application code with the native shared object o icc mmic main.c libmylib.so o icpc mmic main.cpp libmylib.so o ifort mmic main.f90 libmylib.so 4. Copy binary and library over to MIC before executing o cp a.out $TMPDIR/mic# o cp libmylib.so $TMPDIR/mic#/lib The location of all native mode libraries, custom or provided by a module, needs to be added to the LD_LIBRARY_PATH environment variable Debugging Intel debuggers are available for both the host and the KNC cards idbc is the command line debugger for the host micidbc is the command line debugger for the KNC cards o Usage: micidbc -wdir $TMPDIR -tco - rconnect=tcpip:mic0:2000 o refer to the Debugging on Beacon Lab for further details

Modules The modules software package is installed on Beacon and it allows you to dynamically modify your user environment by using modulefiles. Typical uses of "modulefiles" include the adjusting of the PATH and LD_LIBRARY_PATH environment variables for use with the particular module. Below are some commands using the modules Command module list module avail module load module unload module swap module help module show Description Show what modules are currently loaded Show what modules can be loaded Load a module Unload a module Swap a currently loaded module for an unloaded module Displays a description of the module Displays how a module would affect the environment if it were loaded Documentation and Sample Code Official Intel documentation can always be found at /global/opt/intel/composerxe/documentation/en_us Intel's sample codes can always be found at /global/opt/intel/composerxe/samples/en_us More detailed information on how to program for the MIC can be found at Intel's website: http://software.intel.com/mic-developer

Native Mode Example We will take a simple OpenMP code that calculates PI and run it on a MIC card on Beacon 1. SSH into beacon 2. request a compute node 3. make a folder if you wish and change directory to it 4. using your favorite text editor create the file omp_pi_native.c from the following #include <stdio.h> int main () int num_steps = 1000000; double step; int i; double pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for reduction(+:sum) for (i=1;i<= num_steps; i++) double x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step * sum; printf("pi is calculated to be = %f\n",pi); return 0; 5. Compile for native use on the MIC icc openmp mmic o omp_pi_native omp_pi_native.c 6. Copy omp_pi_native to mic0 cp omp_pi_native $TMPDIR/mic0 7. SSH to mic0 via micssh micssh beacon#-mic0 8. Change to the TMPDIR directory cd $TMPDIR 9. Set the environment variable OMP_NUM_THREADS to specify the number of threads to be used export OMP_NUM_THREADS=30 10. Run the executable./omp_pi_native 11. Exit the ssh session and return to the host exit

Offload Mode Example We will take the previous OpenMP code that calculates PI and specify a parallel region to run on a MIC card on Beacon via the offload target(mic) pragma 1. SSH into beacon 2. Make a folder if you wish and change directory to it 3. Using your favorite text editor create the file omp_pi_offload.c from the following #include <stdio.h> int main () int num_steps = 1000000; double step; int i; double pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma offload target(mic) #pragma omp parallel for reduction(+:sum) for (i=1;i<= num_steps; i++) double x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step * sum; printf("pi is calculated to be = %f\n",pi); return 0; 4. Compile for offload mode use on the MIC icc -openmp -o omp_pi_offload omp_pi_offload.c 5. Request a compute node 6. Change to directory containing omp_pi_offload 7. Set the environment variable MIC_ENV_PREFIX and MIC_OMP_NUM_THREADS to specify the number of threads to be used export MIC_OMP_PREFIX=MIC_ export MIC_OMP_NUM_THREADS=30 8. Run the binary 9. Try setting the Environment variable OFFLOAD_REPORT to 2 and run it again export OFFLOAD_REPORT=2

Offload Mode Example Using Shared Libraries This example is similar to the previous, only now the function calc_pi will be called from a library. First make a file named calc_pi.h from the following #ifndef CALC_PI_H #define CALC_PI_H // function prototype for calc_pi // Note how this function was marked for use with the Intel MIC coprocessor attribute ((target (MIC))) double calc_pi(int num_steps); #endif Next make a file named calc_pi.c from the following #include "calc_pi.h" double calc_pi(int num_steps) double step; int i; double pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for for (i=1;i<= num_steps; i++) double x = (i-0.5)*step; #pragma omp critical sum = sum + 4.0/(1.0+x*x); pi = step * sum; return pi; Now make a file named calc_pi_shrd.c from the following #include <stdio.h> #include "calc_pi.h" int main () int num_steps = 1000000; double pi; #pragma offload target(mic) out(pi) // Call the function found in the calc_pi.so library pi = calc_pi(num_steps); printf("pi is calculated to be = %f\n",pi); return 0;

On the login node, compile using 1. icc -c -fpic -openmp calc_pi.c 2. icc -shared -o libcalc_pi.so calc_pi.o 3. icc -L. -openmp -o calc_pi_shrd calc_pi_shrd.c -lcalc_pi Request a compute node, change directory to where calc_pi_shrd is located, and then run using 1. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:. 2../calc_pi_shrd

Asynchronous Offload Example Create the file async_offload_example.c from the following #include <stdio.h> #include <stdlib.h> //NOTE: You will want to set OFFLODAD_REPORT=2 to see the offload in action int main() char sv; int i; int *p; p = (int*)calloc(1000000,sizeof(int)); //initialize p for (i = 0; i < 1000000; i++) p[i]=-i; printf("\non host p[20] = %d\n",p[20]); fflush(stdout); //offload function to MIC //can use in, out, inout, nocopy #pragma offload target(mic:0) inout(p:length(1000000)) signal(&sv) for (i = 0; i < 1000000; i++) p[i]=2*p[i]; //immediately returns to do CPU calcs for (i = 0; i < 1000000; i++) p[i]=-p[i]; printf("\non host after computation p[20] = %d\n",p[20]); fflush(stdout); //stops until offload completes and sends back #pragma offload_wait target(mic:0) wait(&sv) //now, the hosts value should change printf("\non host after offload completes p[20] = %d\n",p[20]); fflush(stdout); //free mem on host free(p); return(0); This example allows the CPU and MIC to do work simultaneously. On the login node, compile using icc - o async_offload_example async_offload_example.c and run the binary. Note: The Fortran equivalent needs to initialize the signal variable sv to a unique (may not be the same as another signal variable) integer value greater than or equal to 1

Asynchronous Offload Transfer Example Create the file async_offload_transfer_example.c from the following #include <stdio.h> #include <stdlib.h> #define ALLOC alloc_if(1) free_if(0) #define FREE alloc_if(0) free_if(1) #define REUSE alloc_if(0) free_if(0) //NOTE: You will want to set OFFLOAD_REPORT=2 to see the offload in action int main() char sv, sv1; int i; int *p, *q; p = (int*)calloc(1000000,sizeof(int)); q = (int*)calloc(1000000,sizeof(int)); //now, let's look at allocating and freeing memory on the MIC //which can also be done asynchronously //will allocate mem on mic:0 without data copy #pragma offload_transfer target(mic:0) nocopy(p,q:length(1000000) ALLOC) signal(&sv) //now, you can be doing CPU work, since this returns immediately //initialize p,q for (i = 0; i < 1000000; i++) p[i]=-i; q[i]=-i; printf("\non host during data transfer p[20] = %d, q[20] = %d\n",p[20],q[20]); fflush(stdout); //now, do offload, copy in p, out q #pragma offload target(mic:0) in(p:length(1000000) REUSE) out(q:length(1000000) REUSE) wait(&sv) signal(&sv1) for (i = 0; i < 1000000; i++) q[i]=-p[i]; //will free mem on mic:0 without data copy #pragma offload_transfer target(mic:0) nocopy(p,q:length(1000000) FREE) wait(&sv1) printf("\non host after offload p[20] = %d, q[20] = %d\n",p[20],q[20]); fflush(stdout); //free mem on host free(p); free(q); return(0); Compile similarly to the previous example and then run the specified binary. In this example, the second offload waits until the first one finishes. During that time the CPU is initializing values of p and q locally, while mic0 is allocating memory.

Intel MPI on the MIC Architecture Access to the Intel MPI tools and libraries on Beacon is managed through the "module" system. The intel- mpi module is loaded by default upon login. Part of the Intel MPI environment is the "mpiicc" command. This command ensures that icc is invoked with the necessary options for MPI. Note: for FORTRAN, do NOT use mpif90, but rather mpiifort. The simple mpi_hello.c example below will demonstrate how to compile, and run MPI applications on Beacon #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) char name[64]; int rank, size, length; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(name, &length); printf("hello, World. I am %d of %d on %s\n", rank, size, name); MPI_Finalize(); return 0; To compile a version that runs on the host system: mpiicc -o mpi_hello mpi_hello.c To compile a version that runs on a MIC card: mpiicc -mmic -o mpi_hello.mic mpi_hello.c The.MIC suffix is there to distinguish that the binary is to be executed on a MIC card. Any suffix can be used, but a separate binary must be compiled. Now that binaries are created, compute nodes can be requested and the MPI applications can be launched. The command "micmpiexec" can be used to launch the MPI program on the host (Xeon) node by specifying the host node: micmpiexec -n 2 -host beacon#./mpi_hello

In order to run it on a MIC, you must first copy the correct binary to the card(s): cp mpi_hello.mic $TMPDIR/mic0/ cp mpi_hello.mic $TMPDIR/mic1/ and/or if using MICs not on the working node: micssh beacon# cp mpi_hello.mic $TMPDIR/mic0 micssh beacon# cp mpi_hello.mic $TMPDIR/mic1 In order to tell "mpiexec" to use the "micssh" command to access the MIC, we use the "micmpiexec" command to run the MPI program: micmpiexec -n 2 -host beacon#-mic0 wdir $TMPDIR $TMPDIR/mpi_hello.MIC And to use both MIC cards: micmpiexec -n 2 -wdir $TMPDIR -host beacon#-mic0 $TMPDIR/mpi_hello.MIC : -n 2 -wdir $TMPDIR -host beacon#-mic1 $TMPDIR/mpi_hello.MIC MPI can also be used in heterogeneous mode utilizing both the Xeon host and one or more MIC cards from any node you have been allocated: micmpiexec -n 2 -wdir $TMPDIR -host beacon#-mic0 $TMPDIR/mpi_hello.MIC : -n 2 -host beacon#./mpi_hello micmpiexec -n 2 -wdir $TMPDIR -host beacon#-mic0 $TMPDIR/mpi_hello.MIC : -n 2 -wdir $TMPDIR -host beacon#-mic1 $TMPDIR/mpi_hello.MIC : -n 2 -host beacon#./mpi_hello micmpiexec -n 2 -host beacon#./mpi_hello.mic : -n 2 - host beacon#./mpi_hello.mic Using Custom Native Libraries with MPI If custom native libraries are to be used, they should be properly copied over to $TMPDIR/mic#/lib. If that application is to be launched using micmpiexec, then the environment should already be properly set to use these libraries.

MPI Machine File Instead of listing all the MPI hosts by hand on the command line, a machine file can be created and used. The contents of the machine file should be of the form <host>:<number of ranks> The following is an example machine file named hosts_file beacon11:8 beacon12:8 beacon11-mic0:2 beacon11-mic1:2 beacon12-mic0:2 beacon12-mic1:2 This machine file could be used to launch an MPI application with micmpiexec -machinefile hosts_file -n 16./application : -wdir $TMPDIR -n 8 $TMPDIR/application.MIC generate- mic- hostlist A custom script named generate-mic-hostlist has been created for beacon that generates machine files for you generate-mic-hostlist TYPE NUM_MIC NUM_XEON > machines where TYPE=offload, micnative, or hybrid Note: if TYPE=offload, then the generated machine file is simply all the nodes the scheduler has assigned listed once NUM_MIC is the number of MPI ranks to place on each MIC NUM_XEON is the number of MPI ranks to place on each CPU host machines is the name of the machine file to be created Getting Help If you need assistance using the Beacon resources, or have any questions, comments, suggestions, or concerns regarding the use of Beacon, please send an email to help@nics.utk.edu

Issues to Look Out For If, at any time, you experience any of the following issues, please report them in an email to help@nics.utk.edu. In most cases, simply resubmitting your job again will work, but we still need to know about the issues encountered. - Failure to mount any directory Example: mount: mounting beacon1:/lustre/medusa/user on /lustre/medusa/user failed: Device or resource busy - The number of available nodes is 0. Use showq to see status of nodes and any reservations that might be present. Example: [user@beacon-lgn lib]$ showq active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING #### user1 Running 16 1:42:17 #### user2 Running 16 18:22:25 #### user3 Running 64 5:34:07 #### user4 Running 32 11:34:07 #### user5 Running 32 5:48:04 5 active jobs 80 of 80 processors in use by local jobs (100.00%) 4 of 5 nodes active (80.00%) - Your job is sitting in the queue or an interactive job is not starting. Please refer to the MOTD for information about maintenance (PM or EM). This is usually on Wednesday and occasionally on Tuesday when the file system is down. If you are submitting an interactive job with a walltime that crosses into the reservation for PM, it will not give you a node. Try submitting with a shorter walltime.

Example MOTD: Preventative maintenance (PM) will be performed on Beacon every Wednesday from 8am to noon Eastern time unless noted otherwise. The MIC driver on the compute nodes has been updated to version 2.1.3653-10. The PM for 4/3 has been cancelled. - You get the following message when a job is submitted. Warning: Cannot access allocation software. Please contact help@xsede.org if you need assistance. This is due to a scheduler/batch incompatibility issue that will be fixed once we obtain a new license. This does NOT mean your job didn t go through. Please ignore this message. This material is based upon work supported by the National Science Foundation under Grant Number 1137097. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.