6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

Size: px

Start display at page:

Download "6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance"

Calvin Perry
6 years ago
Views:

The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.

edu 2 E5-2630 Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon Phi support: icc, icpc, libraries,

pptx Xeon Phi Internals 3 Fused Multiply-Add 4 Each Xeon Phi core has: Instruction Decode Scalar Unit Vector Unit Scalar Registers Vector Registers 32K L1 Instruction Cache The Xeon Phi chip contains

1 The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon Phi support: icc, icpc, libraries, drivers 1 core for Linux + 56 cores * 4 hyperthreads/core = 224 hyperthreads for you to use 31S1P Xeon Phi system mic0 57 Cores 22 nm 8 GB of memory No disk Application support XeonPhi.pptx Xeon Phi Internals 3 Fused Multiply-Add 4 Each Xeon Phi core has: Instruction Decode Scalar Unit Vector Unit Scalar Registers Vector Registers 32K L1 Instruction Cache The Xeon Phi chip contains almost 5B transistors! Many scientific and engineering computations take the form: D = A + (B*C); A normal multiply-add would handle this as: tmp = B*C; D = A + tmp; A fused multiply-add does it all at once, that is, when the low-order bits of B*C are ready, they are immediately added into the low-order bits of A at the same time the higher-order bits of B*C are being multiplied. 32K L1 Data Cache 512K L2 Cache / Core Ring Bus Cache is 8-way set associative. Cache lines are 64 bytes. Assembly instructions are executed in-order. Thus, hyperthreading is important! Vector registers are 512 bits wide = 16 floats. They can perform Fused Multiply-Add (FMA). Theoretical performance = almost 1 TFLOPS Consider a Base 10 example: ( 123*456 ) 123 x Can start adding the 9 the moment the 8 is produced! 56,877 Note: Normal A+(B*C) FMA A+(B*C) Xeon Phi Peak Performance 5 Getting to rabbit and setting up your account 6 Lowercase letter L To login to rabbit: ssh rabbit.engr.oregonstate.edu l yourengrusername Clock freq x # cores x # vector lanes x 2 FMA / 2 cycles to decode = GHz x 56 x 16 x 2 / 2 = Put this in your rabbit account s.cshrc : setenv INTEL_LICENSE_FILE 28518@linlic.engr.oregonstate.edu setenv SINK_LD_LIBRARY_PATH /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/composer_xe_ /compiler/lib/mic/ setenv ICCPATH /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/composer_xe_2015/bin/ set path=( $path $ICCPATH ) source /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/bin/iccvars.csh intel TFLOPS Then activate these values like this: source.cshrc FMA stands for Fused Multiply-Add. (These will be activated automatically the next time you login.) To verify that the Xeon Phi card is there: ping mic0 To see the Xeon Phi card characteristics: micinfo To run some operational tests on the Xeon Phi: miccheck 1

2 Running ping 7 Running micinfo 8 rabbit 150% ping mic0 PING rabbit-mic0.engr.oregonstate.edu ( ) 56(84) bytes of data. 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=1 ttl=64 time=290 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=2 ttl=64 time=0.385 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=3 ttl=64 time=0.242 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=4 ttl=64 time=0.230 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=5 ttl=64 time=0.225 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=6 ttl=64 time=0.261 ms rabbit 151% micinfo MicInfo Utility Log Created Mon Jan 12 10:21: System Info HOST OS : Linux OS Version : el6.x86_64 Driver Version : MPSS Version : Host Physical Memory : MB Device No: 0, Device Name: mic0 Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-31S1P ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 57 Voltage : uv Frequency : khz Version Flash Version : SMC Firmware Version : SMC Boot Loader Version : uos Version : mpss3.4.2 Device Serial Number : ADKC Board Vendor ID : 0x8086 Device ID : 0x225e Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : Insufficient Privileges PCIe Speed : Insufficient Privileges PCIe Max payload size : Insufficient Privileges PCIe Max read req size : Insufficient Privileges Thermal Fan Speed Control Fan RPM Fan PWM Die Temp GDDR GDDR Vendor GDDR Version GDDR Density GDDR Size GDDR Technology GDDR Speed GDDR Frequency GDDR Voltage : 40 C : Elpida : 0x1 : 2048 Mb : 7936 MB : GDDR5 : GT/s : khz : uv Running miccheck 9 Running micsmc, I 10 rabbit 153% micsmc -a rabbit 152% miccheck MicCheck r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system... pass Test 1: Check mic driver is loaded... pass Test 2: Check number of devices driver sees in the system... pass Test 3: Check mpssd daemon is running... Pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF... pass Test 5 (mic0): Check ras daemon is available in device... pass Test 6 (mic0): Check running flash version is correct... pass Test 7 (mic0): Check running SMC firmware version is correct... pass mic0 (info): Device Series:... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID:... 0x225e Number of Cores: OS Version: mpss3.4.2 Flash Version: Driver Version: (root@rabbit.engr.oregonstate.edu) Stepping:... 0x3 Substepping:... 0x0 mic0 (temp): Cpu Temp: C Memory Temp: C Fan-In Temp: C Fan-Out Temp: C Core Rail Temp: C Uncore Rail Temp: C Memory Rail Temp: C Status: OK mic0 (freq): Core Frequency: GHz Total Power: Watts Low Power Limit: Watts High Power Limit: Watts Physical Power Limit: Watts mic0 (mem): Free Memory: MB Total Memory: MB Memory Usage: MB Running micsmc, II 11 Cross-compiling and running from rabbit 12 mic0 (cores): Device Utilization: User: 0.00%, System: 0.09%, Idle: 99.91% Per Core Utilization (57 cores in use) Core #1: User: 0.00%, System: 0.27%, Idle: 99.73% Core #2: User: 0.00%, System: 0.27%, Idle: 99.73% Core #3: User: 0.00%, System: 0.00%, Idle: % Core #4: User: 0.00%, System: 0.00%, Idle: % Core #5: User: 0.00%, System: 0.00%, Idle: % Core #6: User: 0.00%, System: 0.00%, Idle: % Core #7: User: 0.00%, System: 0.00%, Idle: % Core #8: User: 0.00%, System: 0.27%, Idle: 99.73% Core #9: User: 0.00%, System: 0.00%, Idle: % Core #10: User: 0.00%, System: 0.27%, Idle: 99.73% To compile on rabbit for rabbit: icpc -o.cpp -lm -openmp -align -qopt-report=3 -qopt-report-phase=vec To cross-compile on rabbit for the Xeon Phi: Note: the summary of vectorization success or failure is in a *.optvec file Core #50: User: 0.00%, System: 0.00%, Idle: % Core #52: User: 0.00%, System: 0.27%, Idle: 99.73% Core #53: User: 0.00%, System: 0.00%, Idle: % Core #54: User: 0.00%, System: 0.27%, Idle: 99.73% Core #55: User: 0.00%, System: 0.00%, Idle: % Core #56: User: 0.00%, System: 0.27%, Idle: 99.73% Core #57: User: 0.00%, System: 0.54%, Idle: 99.46% To execute on the Xeon Phi, type this on rabbit: 2

3 Gaining Access to the Cores, I 13 Gaining Access to the Cores, II 14 #pragma omp parallel for #pragma omp parallel sections #pragma omp section #pragma omp section float sum = 0.; #pragma omp parallel for reduction(+:sum) sum += A[ i ] * B[ i ] ; #pragma omp task Gaining Access to the Vector Units 15 Turning Off All Vectorization 16 icpc -mmic -o.cpp -lm -openmp -no-vec #pragma omp parallel for simd C[0:N] = A[0:N] * B[0:N] ; The sole (?) reason to do this is when running benchmarks to compare vector vs. scalar array processing. The Intel compiler does a great job of automatically vectorizing where it can. Warning: just because you didn t deliberately vectorize your code doesn t mean it didn t end up vectorized! Use the -no-vec flag. Vectorizing Conditionals 17 Reducing a Vector 18 if( D[ i ] == 0 ) else C[ i ] = A[ i ] + B[ i ] ; float f = sec_reduce_add( A[0:N] ); float f = sec_reduce_mul( A[0:N] ); float f = sec_reduce_max( A[0:N] ); float f = sec_reduce_min( A[0:N] ); int i = sec_reduce_max_ind( A[0:N] ); int i = sec_reduce_min_ind( A[0:N] ); In my tests, this was 3-4x as fast as this. C[ i ] = ( D[ i ] == 0 )? A[ i ] * B[ i ] : A[ i ] + B[ i ] ; boolean b = sec_reduce_all_zero( A[0:N] ); boolean b = sec_reduce_all_nonzero( A[0:N] ); boolean b = sec_reduce_any_zero( A[0:N] ); boolean b = sec_reduce_any_nonzero( A[0:N] ); You must specify the array length. An argument of A[:] will throw a compiler error. 3

4 float sum = 0.; sum += A[ i ]; Reducing a Vector 19 float *A = new float[nums] float *B = new float[nums]; float *C = new float[nums]; double *dtp = new double; double Time0 = omp_get_wtime( ); Offload Mode Data to send over Data to bring back on rabbit #pragma offload target(mic) in(a:length(nums)) in(b:length(nums)) out(c:length(nums)) out(dtp:length(1)) omp_set_num_threads( NUMT ); double time0 = omp_get_wtime( ); 20 In my tests, this was the same speed as this. #pragma omp parallel for simd for( int i = 0; i < NUMS; i++ ) C[i] = A[i] * B[i] ; double time1 = omp_get_wtime( ); *dtp = time1 - time0; on the Xeon Phi float sum = sec_reduce_add( A[0:N] ); double Time1 = omp_get_wtime( ); double overalldt = Time1 - Time0; double offloaddt = *dtp; fprintf( stderr, "%6d\t%6d\t%8.5lf\t%8.5lf\t%8.4f%%\t%8.2lf\n", NUMT, NUMS, overalldt, offloaddt, 100.*offloaddt/overalldt, ((double)nums)/offloaddt/ ); on rabbit Offload Mode 21 Offload Mode: Persistence Between Offloads 22 #define ALLOC #define REUSE alloc_if(1) alloc_if(0) You don t need to do anything special with the compile line: icpc -o.cpp -lm -openmp -align -qopt-report=3 -qopt-report-phase=vec./ #define RETAIN free_if(0) #define FREE free_if(1) #pragma offload target(mic) in(a:length(nums), ALLOC, RETAIN) out(c:length(nums), ALLOC, FREE ) #pragma offload target(mic) in(a:length(nums), REUSE, RETAIN) out(d:length(nums), ALLOC, RETAIN) #pragma offload target(mic) in(a:length(nums), REUSE, FREE) out(d:length(nums), REUSE, FREE ) To ensure alignment, replace this To ensure alignment, replace this float Temperature[NUMN]; float *A = (float *) malloc( NUMS*sizeof(float) ); float *B = (float *) malloc( NUMS*sizeof(float) ); float *C = (float *) malloc( NUMS*sizeof(float) ); with this: #define ALIGN64 declspec(align(64)) with this float *A = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *B = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *C = (float * )_mm_malloc( NUMS*sizeof(float), 64 ); ALIGN64 float Temperature[NUMN]; You then free memory with: _mm_free( A ); _mm_free( B ); _mm_free( C ); 4

If you want to ensure alignment, but still want to use C++ s new and delete, replace this: float *A = new float [NUMS]; float *B = new float [NUMS]; float *C = new float [NUMS]; 25 As You Create More

5 If you want to ensure alignment, but still want to use C++ s new and delete, replace this: float *A = new float [NUMS]; float *B = new float [NUMS]; float *C = new float [NUMS]; 25 As You Create More and More Threads, On Which Cores Do They End Up? If you want them spread out onto as many cores as possible, execute this: kmp_set_defaults( KMP_AFFINITY=scatter ); 26 with this float *pa = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *pb = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *pc = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *A = new(pa) float [NUMS]; float *B = new(pb) float [NUMS]; float *C = new(pc) float [NUMS]; You then free memory with: delete [ ] A; delete [ ] B; delete [ ] C; An advantage of using new and delete instead of malloc is that they allow you to use C++ constructors and destructors. If you want them packed onto the first core until it has 4, than onto the second core until it has 4, etc., execute this: kmp_set_defaults( KMP_AFFINITY=compact ); Use the scatter-mode if you want as much core-power applied to each thread as possible. Use the compact-mode if there is an advantage to some threads sharing a core s local memory with other threads. Running the Volume-Integration Program 27 Running the Volume-Integration Program 28 # of Threads MegaHeights per Second # of Divisions MegaHeights per Second # of Threads # of Divisions Multicore, no vectorization Reservation System

rabbit.engr.oregonstate.edu What is rabbit?

rabbit.engr.oregonstate.edu What is rabbit? 1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system