Vincent C. Betro, Ph.D. NICS March 6, 2014

Size: px

Start display at page:

Download "Vincent C. Betro, Ph.D. NICS March 6, 2014"

Chrystal Washington
5 years ago
Views:

1 Vincent C. Betro, Ph.D. NICS March 6, 2014

conclusions or recommendations expressed in this material are those of the

2 NSF Acknowledgement This material is based upon work supported by the National Science Foundation under Grant Number Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

3 Programming Models Native Mode Everything runs on the MIC May have issues with libraries not existing, needing copied over (e.g., MKL) Offload Mode Serial portion runs on host Parallel portions are offloaded and run on the MIC

4 Offload Mode Code starts running on host and regions designated to be offloaded via pragmas are run on the MIC card when encountered The host CPU and the MIC cards do not share memory in hardware Data is passed to and from the MIC card explicitly or implicitly (C/C++ only) C/C++ Syntax #pragma offload <clauses> <statement> Fortran Syntax!dir$ offload <clauses> <statement> The statement immediately following the offload pragma/directive will be run on a coprocessor

5 I/O Proxies Standard I/O calls are proxied from the MIC card to the host Examples: Opening files or displaying output to the console, while called by the MIC card, is actually performed by the host I/O can be performed from an offload section Requires an NSF mounted file system, the file pointer to be passed into the offload clause through nocopy, and permissions on the file to be +xrw for micuser

6 Marking Variables/Functions for use on MIC In offload mode, the compiler needs to know ahead of time which functions will run on the MIC C/C++ Syntax Fortran Syntax attribute ((target(mic)))!dir$ attributes offload:<mic> :: <rtn-name> Also any variables that are to exist on both the host and the MIC need to be known by the compiler as well (C++ only, since virtual memory used) C/C++ Syntax #pragma offload_attribute(push, target(mic)) #pragma offload_attribute(pop)

7 Explicit Copy Programmer identifies the variables that need copying to and from the card in the offload directive C/C++ Example: #pragma offload target(mic) in(data:length(size)) Fortran Example:!dir$ offload target(mic) in(data:length(size)) Variables and pointers to be copied are restricted to scalars, structs of scalars, and arrays of scalars i.e. double *var is allowed, but not double **var.

8 Explicit Copy Clauses and Modifiers Clauses / Modifiers Syntax Semantics Target specification target( name[:card_number] ) Where to run construct Conditional offload if (condition) Boolean expression Inputs in(var-list modifiers opt ) Copy from host to coprocessor Outputs out(var-list modifiers opt ) Copy from coprocessor to host Inputs & outputs inout(var-list modifiers opt ) Copy host to coprocessor and back when offload completes Non-copied data nocopy(var-list modifiers opt ) Data is local to target Modifiers Specify pointer length length(element-count-expr) Copy N elements of the pointer s type Control pointer memory allocation Control freeing of pointer memory Control target data alignment alloc_if ( condition ) free_if ( condition ) align ( expression ) Allocate memory to hold data referenced by pointer if condition is TRUE Free memory used by pointer if condition is TRUE Specify minimum memory alignment on target

9 Implicit Copy This method is available only in C/C++ Sections of memory are maintained at the same virtual address on both the host and the MIC This enables sharing of complex data structures that contain pointers This shared memory is synchronized when entering and exiting an offload call Only modified data is transferred between CPU and MIC

10 Dynamic Memory Allocation Using Implicit Copies Special functions are needed in order to allocate and free dynamic memory for implicit copies _Offload_shared_malloc _Offload_shared_aligned_malloc _Offload_shared_free _Offload_shared_aligned_free

11 The _Shared keyword for Data and Functions What Syntax Semantics Function int _Shared f(int x) return x+1; Versions generated for both CPU and card; may be called from either side Global _Shared int x = 0; Visible on both sides File/Function static static _Shared int x; Visible on both sides, only to code within the file/function Class class _Shared x ; Class methods, members, and and operators are available on both sides Pointer to shared data int _Shared *p; p is local (not shared), can point to shared data A shared pointer int *_Shared p; p is shared; should only point at shared data Entire blocks of code #pragma offload_attribute( push, _Shared)! #pragma offload_attribute(pop) Mark entire files or large blocks of code _Shared using this pragma

12 Offloading using Implicit Copy Rather than using a pragma directive, the keyword _Offload is used when calling a function to be run on the MIC Examples: x = _Offload function(y) x = _Offload_to (card_number) function(y) Note: function needs to be defined using the _Cilk_shared keyword

13 Explicit/Implicit Copy Comparison Language Support Syntax Used for Offload via Explicit Data Copying Fortran, C, C++ (C++ functions may be called, but C++ classes cannot be transferred) Pragmas/Directives: #pragma offload in C/C+ +!dir$ omp offload directive in Fortran Offloads that transfer contiguous blocks of data Offload via Implicit Data Copying C, C++ Keywords: _Shared and _Offload Offloads that transfer all or parts of complex data structures, or many small pieces of data

14 Compiling Instructions For offload mode, no special compiler flag is needed To generate host only code (i.e ignore offload pragmas/directives) use the compiler flag no-offload

15 Running Code that Offloads on Beacon Request a compute node from beacon qsub I A UT-AACE Locate the generated binary and execute the host binary i.e../a.out

16 Useful Environment Variables The following applies only to offload mode execution All environment variables defined on the host are replicated on the MIC in offload mode To modify specific MIC vaules, MIC_ENV_PREFIX must be defined OMP_NUM_THREADS=8 OMP_STACKSIZE=16M MIC_ENV_PREFIX=MIC_ MIC_OMP_NUM_THREADS=96 MIC_OMP_STACKSIZE=4M For csh: setenv ENV_VARIABLE VALUE For sh: export ENV_VARIABLE=VALUE

17 Useful Environment Variables part 2 OFFLOAD_REPORT can be useful when trying to debug code that offloads o OFFLOAD_REPORT=1 o Gives basic information (e.g. CPU time) about whether code blocks marked for offload are running on the host or coprocessor o OFFLOAD_REPORT=2 o Gives detailed information (e.g. CPU time and data transfer) about the offload process Use MIC_HOST_LOG to output traces to a file o MIC_HOST_LOG=~/app/mic.log

18 Using Intel s Math Kernel Library (MKL) in Automatic Offload Mode Currently, only the following functions are automatic offload enabled BLAS:?GEMM,?TRSM,?TRMM,?SYRK, and?herk LAPACK: SGETRF, SPOTRF, and SGEQRF Just call the function and the magic happens behind the scenes #include mkl.h /* Necessary to use service funcs */ /* The following must be run to use auto-offload*/ mkl_mic_enable(); /* Or use MKL_MIC_ENABLE */ float *A, *B, *C; /* Matrices */ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);

19 Using Intel s Math Kernel Library (MKL) in Automatic Offload Mode The following environment variables also need to be set: export MKL_MIC_ENABLE=1 export OFFLOAD_DEVICES=0,1 export OFFLOAD_REPORT=2 (if you want to see report) Currently, Intel only supports Automatic offload to Mic0 and Mic1.

20 Using Intel s Math Kernel Library (MKL) in Automatic Offload Mode The row or column size of the matrix must be greater than 2048, or else it is just runs on the host. Row and column sizes need to be multiples of 16 at the moment Compile with the mkl compiler flag

21 Using Intel s Math Kernel Library (MKL) in Automatic Offload Mode Functions to fine-tune automatic offload mkl_mic_set_workdivision(double work_div, int device_num) Specify how much work is done on each device as a number between 0.0 and 1.0. Sum for all devices must add to 1.0, is fixed if does not Only need to set this for N-1 devices Host is device 0 (MKL_MIC_HOST_DEVICE), first MIC is device 1, second is device 2, etc. Setting the division to -1 invokes automatic load balancing (MKL_MIC_AUTO_WORKDIVISION) mkl_mic_get_workdivision(double *work_div, int device_num) Lets you find out how the runtime actually divided the work int mkl_mic_get_device_count() How many MIC cards were detected

22 Offload Transfer This pragma/directive simply transfers variables or arrays to/from the specified target using either all in clauses, or all out clauses Example: o #pragma offload_transfer target(mic:0) in(var_a,var_b,var_c) in(array_1:length(8)) o!dir$ offload_transfer target(mic:0) out(var_a,var_b,var_c) out(array_1:length(8)) The above pragma/directive is synchronous and the next statement is executed (on the host) only after data transfer is complete

23 Using Offload Transfer to Allocate/ Free Memory There may be times where you want to allocate/free memory on the MIC, but without any data transfer so that persistent data can be used between multiple offload calls Allocate memory on mic0 o #pragma offload_transfer target(mic:0) nocopy(array_a:length(8) alloc_if(1) free_if(0)) o!dir$ offload_transfer target(mic:0) nocopy(array_a:length(8) alloc_if(.true.) free_if(.false.)) Use allocated memory on mic0 o #pragma offload target(mic:0) inout(array_a:length(8) alloc_if(0) free_if(0)) o!dir$ offload target(mic:0) inout(array_a:length(8) alloc_if(.false.) free_if(.false.)) Free memory on mic0 o #pragma offload_transfer target(mic:0) nocopy(array_a:length(8) alloc_if(0) free_if(1)) o!dir$ offload_transfer target(mic:0) nocopy(array_a:length(8) alloc_if(.false.) free_if(.true.))

24 Asynchronous Offload with signal()/ wait() The CPU can do work while coprocessor(s) are executing an offload statement/block the signal/wait specifiers are used to denote asynchronous operations o #pragma offload signal(&tag) o #pragma offload wait(&tag) o!dir$ offload signal(tag) o!dir$ offload wait(tag) tag is an integer The offload call with the wait specifier will wait until the previous offload call finishes often used while data is being transferred with offload_transfer

25 Simulataneous Computing using OpenMP Any offloading call blocks until the statement completes To use both the host and MIC simultaneously, multiple threads need to be executed on host One or more threads that contain an offload call Other threads have the host do some work With OpenMP, this achieved using OpenMP task calls

26 OpenMP Task Calls in C/C++ #pragma omp parallel #pragma omp single #pragma omp task #pragma offload target(mic) <various serial code> #pragma omp parallel for for (int i=0; i<limit; i++) <parallel loop body> #pragma omp task <host code or another offload>

27 Intel s OpenMP Pi Offload Example Copy the following Intel file to current working directory cp /opt/intel/composerxe_mic/samples/en_us/c++/ mic_samples/intro_samplec/samplec08.c./ Note that additional sample files are located there Modify it so that it can be compiled and run as a standalone program rename the function to main, and have it return a value of 0 int main() return 0; Compile for offload mode on the MIC (the DEBUG keyword simply prints out the value of Pi in this code) icc offload-build DDEBUG o omp_pi_offload samplec08.c

28 Offload Example Continued Run the binary on a compute node Try setting OFFLOAD_REPORT to 2 and run it again export OFFLOAD_REPORT=2 Now reset the OFFLOAD_REPORT environment variable export OFFLOAD_REPORT=0 Key Points: o using #pragma offload target(mic) to offload the OpenMP parallel for loop o using the environment variables MIC_ENV_PREFIX and MIC_OMP_NUM_THREADS to change the number of OpenMP threads used by the MIC o using the environment variable OFFLOAD_REPORT to see timing/ communication information between the CPU and the MIC

29 Select Offload Examples The offload mode allows select portions of a code to run on the Intel MIC, while the rest of it runs on the host. Ideally, the offload regions are highly parallel What follows is select offload examples, provided by Intel, that demonstrate how to move data to and from the Intel MIC cards Intel has many offload examples located in the following directory /global/opt/intel/composerxe_mic/samples/en_us/c++/ mic_samples/intro_samplec/ They can be copied to a directory of your choice and then compiled with make mic

30 SampleC01 This code computes Pi on the MIC using #pragma offload float pi = 0.0f; int count = 10000; int i; #pragma offload target (mic) for (i=0; i<count; i++) float t = (float)((i+0.5f)/count); pi += 4.0f/(1.0f+t*t); pi /= count; #pragma offload target (mic) runs the very next line (or block of code if braces are used) on the Intel MIC In this case the whole for loop is run on the Intel MIC Note that pi was declared outside of the offload region, and it did not need to be explicitly copied to the MIC since it is a scalar

31 SampleC02 This code initializes 2 arrays on the host, and then has the Intel MIC add the arrays together, and store the result in a third array typedef double T; #define SIZE 1000 #pragma offload_attribute(push, target(mic)) static T in1_02[size]; static T in2_02[size]; static T res_02[size]; #pragma offload_attribute(pop) static void populate_02(t* a, int s); The #pragma offload_attribute(push/pop) pair marks the block of code between them to be used on both the host and the Intel MIC They could have been marked individually with attribute ((target(mic))) Without those statements, the Intel MIC would not be able to see/use the 3 arrays

32 SampleC02 Continued The sum of the 2 arrays is done by the Intel MIC Note that only a single Intel MIC core is used void sample02() int i; populate_02(in1_02, SIZE); populate_02(in2_02, SIZE); #pragma offload target(mic) for (i=0; i<size; i++) res_02[i] = in1_02[i] + in2_02[i];

33 SampleC03 This program is similar to SampleC02, except that it avoids unnecessary data transfer void sample03() int i; populate_03(in1_03, SIZE); populate_03(in2_03, SIZE); #pragma offload target(mic) in(in1_03, in2_03) out(res_03) for (i=0; i<size; i++) res_03[i] = in1_03[i] + in2_03[i]; Previously, all 3 arrays were copied to the card at the start of the offload call, and then copied back at the end of the offload call Now, only the in1_03 and in2_03 arrays are copied to the card, and only the res_03 array is copied back

34 SampleC04 This program is similar to the previous two samples, but now we are dealing with pointers instead of the static arrays directly void sample04() T* p1, *p2; int i, s; populate_04(in1_04, SIZE); populate_04(in2_04, SIZE); p1 = in1_04; p2 = in2_04; s = SIZE; #pragma offload target(mic) in(p1, p2:length(s)) out(res_04) for (i=0; i<s; i++) res_04[i] = p1[i] + p2[i]; Since the length of the pointer is not known, it must be explicitly passed as an argument res_04 is still a static array in this sample

35 SampleC05 This program is like the last except the sum of the arrays, via pointers, is now stored in a pointer to the result array This pointer needs to have its length specified as well Also, the summation now happens in the function get_result() get_result() did not need to be marked with attribute ((target(mic))) because it was called by the host and not by the Intel MIC void sample05() T my_result[size]; populate_05(in1_05, SIZE); populate_05(in2_05, SIZE); get_result(in1_05, in2_05, my_result, SIZE); static void get_result(t* pin1, T* pin2, T* res, int s) int i; #pragma offload target(mic) \ in(pin1, pin2 : length(s)) \ out(res : length(s)) for (i=0; i<s; i++) res[i] = pin1[i] + pin2[i];

36 SampleC07 In this program, an array of data is sent from the host to the Intel MIC in one offload call The array values are then doubled on the MIC in a separate offload call, as long as a MIC card exists #define SIZE 1000 attribute ((target(mic))) int array1[size]; attribute ((target(mic))) int send_array(int* p, int s); attribute ((target(mic))) void compute07(int* out, int size); void sample07() int in_data[16] = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ; int out_data[16]; int array_sent = 0; int num_devices; // Check if coprocessor(s) are installed and available num_devices = _Offload_number_of_devices(); #pragma offload target(mic : 0) array_sent = send_array(in_data, 16); #pragma offload target(mic : 0) if(array_sent) out(out_data) compute07(out_data, 16);

37 SampleC07 Continued Reminder, attribute ((target(mic))) makes it so both the host and the Intel MIC can see/use the variable/function The function _Offload_number_of_devices() returns how many Intel MIC cards are available The macro MIC lets you know if the MIC (value of 1) or host (value of 0) is currently evaluating the statements attribute ((target(mic))) int send_array(int* p, int s) int retval; int i; for (i=0; i<s; i++) array1[i] = p[i]; #ifdef MIC retval = 1; #else retval = 0; #endif attribute ((target(mic))) void compute07(int* out, int size) int i; for (i=0; i<size; i++) out[i] = array1[i]*2; // Return 1 if array initialization // was done on target return retval;

38 SampleC08 This program is like SampleC01, except now the Pi calculation is done using an OpenMP for loop on the Intel MIC to utilize the many cores float pi = 0.0f; int count = 10000; int i; #pragma offload target (mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) float t = (float)((i+0.5f)/count); pi += 4.0f/(1.0f+t*t); pi /= count;

39 OS Thread Affinity Mapping The Intel MIC coprocessor has N cores, each with 4 hardware thread contexts, for a total of M=4*N threads. The OS maps the hardware threads to M procs. OS proc (M-3) (M-2) (M-1) MIC core (N-1) (N-1) (N-1) (N-1) MIC thread (M-1) (M-4) (M-3) (M-2) The OS runs on proc 0, which lives on MIC core N. Therefore, avoid using procs 0, (M-3), (M-2), and (M-1) to avoid contention with the OS. This is especially important when using the offload model due to data transfer activity.

40 Additional Resources Several sample MIC programs are provided by Intel and can be found in /global/opt/intel/composerxe_mic/ Samples/en_US/C++/mic_samples/ intro_samplec /global/opt/intel/composerxe_mic/ Samples/en_US/Fortran/mic_samples/ intro_samplef Other documentation, presentations, and even a community forum can be found at AACE Wiki is limited to Beacon partners due to past NDA materials

41 Contact Vincent Betro NICS Support

Ryan Hulguin

Ryan Hulguin ryan-hulguin@tennessee.edu Outline Beacon The Beacon project The Beacon cluster TOP500 ranking System specs Xeon Phi Coprocessor Technical specs Many core trend Programming models Applications