Лекция 8 Введение в программирование сопроцессора Intel Xeon Phi
|
|
- Moses Townsend
- 5 years ago
- Views:
Transcription
1 Лекция 8 Введение в программирование сопроцессора Intel Xeon Phi Курносов Михаил Георгиевич mkurnosov@gmail.com WWW: Курс «Параллельное программирование» Сибирский государственный университет телекоммуникаций и информатики (г. Новосибирск) Осенний семестр, 6 CC-BY
2 Intel Xeon Phi Intel Teraflops Research Chip (Polaris, x-7 гг.) 8 cores, 8 routers, self-correction system Single-Chip Cloud Computer (SCC, 9) 48 P54C Pentium cores, 4x6 D-mesh of tiles ( cores) Intel Larrabee (X- гг.) GPGPU: x86, cache coherency, 4-bit ring bus, 4-way SMT, разработка отменена в г. Larrabee GPU (SIGGRAPH, 8) Intel MIC (Intel Many Integrated Core Architecture) кеш-когерентный мультипроцессор с общей памятью, аппаратная многопоточность (4-way SMT), широкие векторные регистры (5 бит) Knights Ferry (): in-order cores, 4-way SMT, ring bus, 75 GFLOPS Knights Corner/ Xeon Phi (): nm, >= 5 cores Knights Landing ( nd gen., 4 nm, ): Silvermont (Atom) cores, 4 threads per core Knights Hill ( rd gen., nm)
3 Intel Knights Corner micro-architecture Больше 5 ядер Pentium P54C: x86 ISA, in-order, 4-way SMT, 5-bit SIMD units Cache: KB L data/i-cache, coherent L cache (5 KB), двусторонняя кольцевая шина Кольцевая шина: двусторонняя Подключается через шину PCI Express cnmiс (oak): Intel Xeon Phi A Cores: 57 4-way SMT (core #57 for OS only!): 4 threads RAM: GDDR5 6 GiB Intel Compiler: $ source /opt/intel_cluster/bin/iccvars.sh intel64
4 Подсчет простых чисел (serial version) int is_prime_number(int n) int limit = sqrt(n) + ; for (int i = ; i <= limit; i++) if (n % i == ) return ; return (n > )? : ; Число операций O( n) int count_prime_numbers(int a, int b) int nprimes = ; if (a <= ) /* Count '' as a prime number */ nprimes = ; a = ; if (a % == ) /* Shift 'a' to odd number */ a++; /* Loop over odd numbers: a, a +, a + 4,..., b */ for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes; Проверка всех нечетных чисел в интервале [a, b]
5 Подсчет простых чисел (OpenMP) int count_prime_numbers_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; #pragma omp parallel /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;
6 Подсчет простых чисел (OpenMP) int count_prime_numbers_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; cnmic (oak.cpct.sibsutis.ru) System board: ASUS ZPE-D8 WS (Dual CPU) CPU: x Intel Xeon E5-6v (.4GHz, Haswell, 6 cores) RAM: 64 GiB, DDR4 Mhz (4x8 GiB, 4x8 GiB) Coprocessor: Intel Xeon Phi A (Cores: 56 4-way SMT, RAM: GDDR5 6 GiB) $ icc -std=c99 -Wall -O -fopenmp -c primes.c -o primes.o $ icc -o primes primes.o -lm -fopenmp #pragma omp parallel /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) return nprimes; if (is_prime_number(i)) nprimes++; $ export OMP_NUM_THREADS= $./primes Count prime numbers in [, ] Result (host serial): 686 Result (host parallel): 686 Execution time (host serial):.76 Execution time (host parallel):.7 Speedup host_serial/host_omp:.7
7 Подсчет простых чисел: Xeon Phi offload (serial) #include <offload.h> double run_phi_serial() #ifdef INTEL_OFFLOAD printf("intel Xeon Phi devices: %d\n", _Offload_number_of_devices()); #endif int n; double t = wtime(); #pragma offload target(mic) out(n) n = count_prime_numbers_phi(a, b); t = wtime() - t; printf("result (phi serial): %d\n", n); return t; Структурный блок выгружается (offload) для выполнения на сопроцессор Требуется указать: объекты, которые необходимо скопировать в память сопроцессора перед выполнением блока (in) объекты, которые необходимо скопировать из памяти сопроцессора после выполнением блока (out)
8 Подсчет простых чисел: Xeon Phi offload (serial) attribute ((target(mic))) int count_prime_numbers_phi(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; Функции и переменные, доступ к которым осуществляется на сопроцессоре, помечаются атрибутом attribute ((target(mic))) (гетерогенная компиляция) /* Loop over odd numbers: a, a +, a + 4,..., b */ for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;
9 Подсчет простых чисел: Xeon Phi offload (serial) attribute ((target(mic))) int is_prime_number(int n) int limit = sqrt(n) + ; for (int i = ; i <= limit; i++) if (n % i == ) return ; return (n > )? : ; Функции и переменные, доступ к которым осуществляется на сопроцессоре, помечаются атрибутом attribute ((target(mic))) (гетерогенная компиляция)
10 Подсчет простых чисел: Xeon Phi offload (serial) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); printf("execution time (host serial): %.6f\n", thost_serial); printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); return ; На данном тесте ядро Intel Xeon Phi A в ~5 раз медленнее ядра Intel Xeon E5-6 v $ export OMP_NUM_THREADS= $./primes Count prime numbers in [, ] Result (host serial): 686 Result (host parallel): 686 Intel Xeon Phi devices: Result (phi serial): 686 Execution time (host serial):.8 Execution time (host parallel):.8 Execution time (phi serial): 8.96 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.74
11 Подсчет простых чисел: Xeon Phi offload (OpenMP) double run_phi_parallel() #ifdef INTEL_OFFLOAD printf("intel Xeon Phi devices: %d\n", _Offload_number_of_devices()); #endif int n; double t = wtime(); #pragma offload target(mic) out(n) n = count_prime_numbers_phi_omp(a, b); t = wtime() - t; printf("result (phi parallel): %d\n", n); return t;
12 Подсчет простых чисел: Xeon Phi offload (OpenMP) attribute ((target(mic))) int count_prime_numbers_phi_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp parallel #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;
13 Подсчет простых чисел: Xeon Phi offload (OpenMP) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); double tphi_par = run_phi_parallel(); printf("execution time (host serial): %.6f\n", thost_serial); printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial); printf("execution time (phi parallel): %.6f\n", tphi_par); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); printf("speedup host_omp/phi_omp: %.f\n", thost_par / tphi_par); printf("speedup host_serial/phi_omp: %.f\n", thost_serial / tphi_par); printf("speedup phi_serial/phi_omp: %.f\n", tphi_serial / tphi_par); return ;
14 Подсчет простых чисел: Xeon Phi offload (OpenMP) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); double tphi_par = run_phi_parallel(); $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC printf("execution time (host serial): %.6f\n", thost_serial); export MIC_OMP_NUM_THREADS=4 printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial);./primes printf("execution time (phi parallel): %.6f\n", tphi_par); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); printf("speedup host_omp/phi_omp: %.f\n", thost_par / tphi_par); printf("speedup host_serial/phi_omp: %.f\n", thost_serial / tphi_par); $./launch.sh printf("speedup phi_serial/phi_omp: %.f\n", tphi_serial / tphi_par); Count prime numbers in [, ] Execution time (host serial):.78 return ; Execution time (host parallel):.466 Execution time (phi serial): 8.6 Execution time (phi parallel):.7784 Ratio phi_serial/host_serial: 4.85 Phi-OpenMP-версия на 4 потоках в раза быстрее Speedup host_serial/host_omp:.6 последовательной Phi-версии Speedup host_omp/phi_omp:.4 Speedup host_serial/phi_omp:.58 Phi-OpenMP-версия медленее Host-OpenMP-версии в ~7 раз Speedup phi_serial/phi_omp:.4
15 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=verbose Execution time (host serial):.9 Execution time (host parallel):.86 Execution time (phi serial): 8.78 Execution time (phi parallel):.546 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:. Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:.4 Speedup phi_serial/phi_omp:../primes OMP: Info #4: KMP_AFFINITY: decoding xapic ids. OMP: Info #5: KMP_AFFINITY: cpuid leaf not supported - decoding legacy APIC ids. OMP: Info #49: KMP_AFFINITY: Affinity capable, using global cpuid info OMP: Info #54: KMP_AFFINITY: Initial OS proc set respected:,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,4,4,4,4,44,45,46,47,48,49,5,5,5,5,54,55,56,57,58,59,6,6,6,6,64,65,66,67,68,69,7,7,7,7,74,75,76,77,78,79,8,8,8,8,84,85,86,87,88,89,9,9,9,9,94,95,96,97,98,99,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,4,4,4,4,44,45,46,47,48,49,5,5,5,5,54,55,56,57,58,59,6,6,6,6,64,65,66,67,68,69,7,7,7,7,74,75,76,77,78,79,8,8,8,8,84,85,86,87,88,89,9,9,9,9,94,95,96,97,98,99,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4 OMP: Info #56: KMP_AFFINITY: 4 available OS procs OMP: Info #57: KMP_AFFINITY: Uniform topology OMP: Info #59: KMP_AFFINITY: packages x 56 cores/pkg x 4 threads/core (56 total cores)
16 Xeon Phi: привязка потоков к ядрам (affinity) OMP: Info #6: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 4 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 5 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 6 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 7 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 8 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 9 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread... OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc 4 maps to package core 55 thread Core Core Core Core 55
17 Xeon Phi: привязка потоков к ядрам (affinity) OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set 5 OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set 9 OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set... OMP: Info #4: KMP_AFFINITY: pid 4 thread 54 bound to OS proc set 7 OMP: Info #4: KMP_AFFINITY: pid 4 thread 55 bound to OS proc set Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55 Default Affinity
18 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,balanced Execution time (host serial):.5 Execution time (host parallel):.67 Execution time (phi serial): Execution time (phi parallel):.4887 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.7 Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:.5 Speedup phi_serial/phi_omp: 7.4./primes fine, balanced Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55
19 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,compact./primes Execution time (host serial):.5 Execution time (host parallel):.78 Execution time (phi serial): Execution time (phi parallel):.4787 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.7 Speedup host_omp/phi_omp:.7 Speedup host_serial/phi_omp:.8 Speedup phi_serial/phi_omp:.9 fine, compact Cores (HW threads): Core Core Core Core Core 4 Core 5 Core OMP threads:
20 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,scatter./primes Execution time (host serial):.4 Execution time (host parallel):.878 Execution time (phi serial): Execution time (phi parallel):.55 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.5 Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:. Speedup phi_serial/phi_omp:.7 fine, scatter Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55
21 Xeon Phi: привязка потоков к ядрам (affinity) NUM_THREADS = fine, scatter: core_id = thread_id % ncores Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: cores: 8x + 48x = 55 fine, balanced: Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: cores: 8x + 48x = 8 9 fine, compact: Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 9 OMP threads: cores: x4 =
22 Умножение матриц SGEMM
23 Умножение матриц SGEMM (serial) enum N =, M =, Q =, NREPS =, ; /* Naive matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_host(float *a, float *b, float *c, int n, int m, int q) /* FP ops: * n * q * m */ for (int i = ; i < n; i++) for (int j = ; j < q; j++) float s =.; for (int k = ; k < m; k++) s += a[i * m + k] * b[k * q + j]; c[i * q + j] = s; GigaFLOP = * M * Q * N / 9
24 Умножение матриц SGEMM (serial) double run_host(const char *msg, void (*sgemm_fun)(float *, float *, float *, int, int, int)) double gflop =. * N * Q * M * E-9; float *a, *b, *c; a = malloc(sizeof(*a) * N * M); b = malloc(sizeof(*b) * M * Q); c = malloc(sizeof(*c) * N * Q); srand(); for (int i = ; i < N; i++) for (int j = ; j < M; j++) a[i * M + j] = rand() % ; for (int i = ; i < M; i++) for (int j = ; j < Q; j++) b[i * Q + j] = rand() % ; /* Warmup */ double twarmup = wtime(); sgemm_fun(a, b, c, N, M, Q); twarmup = wtime() - twarmup;
25 Умножение матриц SGEMM (serial) /* Measures */ double tavg =.; double tmin = E6; double tmax =.; for (int i = ; i < NREPS; i++) double t = wtime(); sgemm_fun(a, b, c, N, M, Q); t = wtime() - t; tavg += t; tmin = (tmin > t)? t : tmin; tmax = (tmax < t)? t : tmax; tavg /= NREPS; printf("%s (%d runs): perf %.f GFLOPS; time: tavg %.6f, tmin %.6f, tmax %.6f, twarmup %.6f\n", msg, NREPS, gflop / tavg, tavg, tmin, tmax, twarmup); free(c); free(b); free(a); return tavg; # cnmic Intel Xeon CPU E5-6 v SGEMM N =, M =, Q = Host serial ( runs): perf.8 GFLOPS; time: tavg.974, tmin.888, tmax.844, twarmup.684 SGEMM N =, M =, Q = Host serial ( runs): perf.54 GFLOPS; time: tavg.58897, tmin.89, tmax.5474, twarmup.5786
26 Умножение матриц SGEMM (serial): opt /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_host_opt(float *a, float *b, float *c, int n, int m, int q) /* Permute loops k and j for improving cache utilization */ for (int i = ; i < n * q; i++) c[i] = ; /* FP ops: * n * m * q */ for (int i = ; i < n; i++) for (int k = ; k < m; k++) for (int j = ; j < q; j++) c[i * q + j] += a[i * m + k] * b[k * q + j];
27 Умножение матриц SGEMM (OpenMP) /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_host_omp(float *a, float *b, float *c, int n, int m, int q) #pragma omp parallel int k = ; #pragma omp for for (int i = ; i < n; i++) for (int j = ; j < q; j++) c[k++] =.; #pragma omp for for (int i = ; i < n; i++) for (int k = ; k < m; k++) for (int j = ; j < q; j++) c[i * q + j] += a[i * m + k] * b[k * q + j];
28 Умножение матриц SGEMM (OpenMP) int main(int argc, char **argv) int omp_only = (argc > )? : ; printf("sgemm N = %d, M = %d, Q = %d\n", N, M, Q); if (!omp_only) double t_host = run_host("host serial", &sgemm_host); double t_host_opt = run_host("host opt", &sgemm_host_opt); double t_host_omp = run_host("host OMP", &sgemm_host_omp); printf("speedup (host/host_opt): %.f\n", t_host / t_host_opt); printf("speedup (host_opt/host_omp): %.f\n", t_host_opt / t_host_omp); else char buf[56]; sprintf(buf, "Host OMP %d", omp_get_max_threads()); run_host(buf, &sgemm_host_omp); return ;
29 Умножение матриц SGEMM (OpenMP) int main(int argc, char **argv) int omp_only = (argc > )? : ; printf("sgemm N = %d, M = %d, Q = %d\n", N, M, Q); if (!omp_only) double t_host = run_host("host serial", &sgemm_host); double t_host_opt = run_host("host opt", &sgemm_host_opt); double t_host_omp = run_host("host OMP", &sgemm_host_omp); # cnmic Intel Xeon CPU E5-6 v ( threads) printf("speedup (host/host_opt): %.f\n", t_host / t_host_opt); SGEMM N =, M =, Q = printf("speedup (host_opt/host_omp): %.f\n", t_host_opt / t_host_omp); else char buf[56]; Speedup (host/host_opt): 4.87 sprintf(buf, "Host OMP %d", omp_get_max_threads()); Speedup (host_opt/host_omp):.89 run_host(buf, &sgemm_host_omp); SGEMM N =, M =, Q = return ; Host serial ( runs): perf.8 GFLOPS; time: tavg.85, tmin.976, tmax.9, twarmup.6 Host opt ( runs): perf 8.78 GFLOPS; time: tavg.788, tmin.78, tmax.85, twarmup.8679 Host OMP ( runs): perf 4.6 GFLOPS; time: tavg.964, tmin.94, tmax.98, twarmup.67 Host serial ( runs): perf.6 GFLOPS; time: tavg 9.989, tmin , tmax , twarmup Host opt ( runs): perf 6.58 GFLOPS; time: tavg.4979, tmin.487, tmax.467, twarmup.474 Host OMP ( runs): perf 4. GFLOPS; time: tavg.785, tmin.6996, tmax.794, twarmup.8474 Speedup (host/host_opt): 4. Speedup (host_opt/host_omp): 6.5
30 Умножение матриц SGEMM (Xeon Phi) /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_phi(float *a, float *b, float *c, int n, int m, int q) #pragma offload target(mic) in(a:length(n * m)) in(b:length(m * q)) out(c:length(n * q)) #pragma omp parallel int k = ; #pragma omp for for (int i = ; i < n; i++) for (int j = ; j < q; j++) c[k++] =.; #pragma omp for for (int i = ; i < n; i++) for (int k = ; k < m; k++) for (int j = ; j < q; j++) c[i * q + j] += a[i * m + k] * b[k * q + j];
31 Умножение матриц SGEMM (Xeon Phi) /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_phi(float *a, float *b, float *c, int n, int m, int q) #pragma offload target(mic) in(a:length(n * m)) in(b:length(m * q)) out(c:length(n * q)) #pragma omp parallel int k = ; #pragma omp for for (int i = ; i < n; i++) for (int j = ; j < q; j++) c[k++] =.; N = M = Q = #pragma omp for for (int i = ; i < n; i++) # Intel Xeon Phi A SGEMM N =, for M (int =, k = Q ; = k < m; k++) Phi OMP 56 (5 runs): for perf (int.49 j = ; GFLOPS; j < q; time: j++) tavg.657, tmin.6, tmax.66, twarmup.855 c[i * q + j] += a[i * m + k] * b[k * q + j]; SGEMM N =, M =, Q = Phi OMP (5 runs): perf 4.44 GFLOPS; time: tavg.49456, tmin.4549, tmax.666, twarmup SGEMM N =, M =, Q = Phi OMP 4 (5 runs): perf 9.4 GFLOPS; time: tavg.585, tmin.475, tmax.5659, twarmup
32 Умножение матриц SGEMM (Xeon Phi) /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_phi(float *a, float *b, float *c, int n, int m, int q) #pragma offload target(mic) in(a:length(n * m)) in(b:length(m * q)) out(c:length(n * q)) #pragma omp parallel int k = ; #pragma omp for for (int i = ; i < n; i++) for (int j = ; j < q; j++) c[k++] =.; N = M = Q = 5 #pragma omp for for (int i = ; i < n; i++) # Intel Xeon Phi A SGEMM N = 5, for M (int = 5, k = Q ; = 5 k < m; k++) Phi OMP (5 runs): for (int perf j = ; GFLOPS; j < q; time: j++) tavg.9989, tmin.7967, tmax.7986, twarmup c[i * q + j] += a[i * m + k] * b[k * q + j]; SGEMM N = 5, M = 5, Q = 5 Phi OMP 4 (5 runs): perf 8.86 GFLOPS; time: tavg.769, tmin.9698, tmax.445, twarmup
33 TeraFLOPS Умножение матриц SGEMM (Intel MKL)
34 Умножение матриц SGEMM (Intel MKL) #include <mkl.h> // for sgemm /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_phi_mkl(float *a, float *b, float *c, int n, int m, int q) /* * sblas_sgemm: C[] = alpha * A[] x B[] + beta * C[] */ float alpha =.; float beta =.; #pragma offload target(mic) in(a:length(n * m)) in(b:length(m * q)) out(c:length(n * q)) cblas_sgemm(cblasrowmajor, CblasNoTrans, CblasNoTrans, n, q, m, alpha, a, m, b, q, beta, c, q); double run_phi(const char *msg, void (*sgemm_fun)(float *, float *, float *, int, int, int)) a = mkl_malloc(sizeof(*a) * N * M, 64); //... mkl_free(a); return tavg; # Makefile CFLAGS := -Wall -g -std=c99 -fopenmp -mkl -O LDFLAGS := -mkl -lm -fopenmp
35 Умножение матриц SGEMM (Intel MKL) #include <mkl.h> // for sgemm /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_phi_mkl(float *a, float *b, float *c, int n, int m, int q) /* * sblas_sgemm: C[] = alpha * A[] x B[] + beta * C[] */ float alpha =.; float beta =.; #pragma offload target(mic) in(a:length(n * m)) in(b:length(m * q)) out(c:length(n * q)) cblas_sgemm(cblasrowmajor, CblasNoTrans, CblasNoTrans, n, q, m, alpha, a, m, b, q, beta, c, q); # Intel Xeon Phi A SGEMM N =, M =, Q = Phi MKL 4 ( runs): perf 7.6 GFLOPS; time: tavg.777, tmin.4476, tmax.479, twarmup.59 SGEMM double N run_phi(const = 5, M = 5, char Q *msg, = 5 void (*sgemm_fun)(float *, float *, float *, int, int, int)) Phi MKL 4 ( runs): perf 75.8 GFLOPS; time: tavg , tmin.6547, tmax.7788, twarmup.764 a = mkl_malloc(sizeof(*a) * N * M, 64); SGEMM // N... =, M =, Q = Phi MKL 4 ( runs): perf GFLOPS; time: tavg.848, tmin.76794, tmax.94655, twarmup 8.59 mkl_free(a); SGEMM return N = 5, tavg; M = 5, Q = 5 Phi MKL 4 ( runs): perf GFLOPS; time: tavg.6985, tmin.5898, tmax 4.55, twarmup
36 Умножение матриц SGEMM (Intel MKL) # launch.sh export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=67 export MIC_USE_MB_BUFFERS=M export MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[-4:]./sgemm The Intel compiler offload runtime allocates memory with MB pages when the size of allocation exceeds the value of the MIC_USE_MB_BUFFERS environment variable # Intel Xeon Phi A SGEMM N =, M =, Q = Phi MKL 67 (5 runs): perf 99. GFLOPS; time: tavg.5566, tmin.49, tmax.6796, twarmup 4.87 SGEMM N = 5, M = 5, Q = 5 Phi MKL 67 (5 runs): perf GFLOPS; time: tavg 7.66, tmin 7.58, tmax 7.78, twarmup SGEMM N =, M =, Q = Phi MKL 67 (5 runs): perf 7.86 GFLOPS; time: tavg.95547, tmin.9675, tmax , twarmup TeraFLOPS How to Use Huge Pages to Improve Application Performance on Intel Xeon Phi Coprocessor //
37 Умножение матриц SGEMM (Xeon Phi) /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_phi(float *a, float *b, float *c, int n, int m, int q) #pragma offload target(mic) in(a:length(n * m)) in(b:length(m * q)) out(c:length(n * q)) #pragma omp parallel int k = ; #pragma omp for for (int i = ; i < n; i++) for (int j = ; j < q; j++) c[k++] =.; #pragma omp for for (int i = ; i < n; i++) # Intel Xeon Phi A + MIC_USE_MB_BUFFERS=M + thread affinity GEMM N =, for M = (int, k Q = = ; k < m; k++) Phi OMP 67 ( runs): for perf (int 75. j = GFLOPS; ; j < time: q; j++) tavg , tmin 5.998, tmax , twarmup c[i * q + j] += a[i * m + k] * b[k * q + j];
38 PCI Express bandwidth test
39 PCI Express bandwidth test int main(int argc, char **argv) printf("xeon Phi bwtest: nreps = %d\n", NREPS); printf("size alignment talloc tsend trecv\n"); int s[] =, 4, 8, 6,, 64, 8, 56, 5, 4; int a[] =, 64, 8, 56, 5, 4, 48, 496, 89, 684; for (int j = ; j < NELEMS(a); j++) for (int i = ; i < NELEMS(s); i++) testbw(s[i] * 4 * 4, a[j]); return ;
40 PCI Express bandwidth test void testbw(int size, int alignment) uint8_t *buf = _mm_malloc(size, alignment); if (!buf) fprintf(stderr, "Can't allocate memory\n"); exit(exit_failure); // Init buffer (allocate pages) memset(buf,, size); double t, talloc; double tsend =.; double trecv =.; // Allocate buffer on Phi talloc = wtime(); #pragma offload target(mic) in(buf:length(size) free_if()) talloc = wtime() - talloc; Intel Xeon Phi buf[]
41 PCI Express bandwidth test // Measures for (int i = ; i < NREPS; i++) // Copy to Phi t = wtime(); #pragma offload target(mic) in(buf:length(size) alloc_if() free_if()) tsend += wtime() - t; Intel Xeon Phi buf[] // Copy from Phi t = wtime(); #pragma offload target(mic) out(buf:length(size) alloc_if() free_if()) trecv += wtime() - t; Intel Xeon Phi buf[] // Free on Phi #pragma offload target(mic) in(buf:length(size) alloc_if() free_if()) Intel Xeon Phi tsend /= NREPS; trecv /= NREPS; printf("%-d %-8d %-.6f %-.6f %-.6f\n", size, alignment, talloc, tsend, trecv); _mm_free(buf);
42 PCI Express bandwidth test
43 Запуск программы в Native Mode Native Mode. Программа компилируется на хост-машине кросс-компилятором (icc): sequential code, OpenMP, Intel Cilk Plus, Intel TBB, Intel MPI. Исполняемый файл и все библиотеки копируются в Intel Xeon Phi (scp, NFS). Программа запускается на Intel Xeon Phi (scp, ssh) Ограничения Объем памяти Intel Xeon Phi (Phi A GDDR5 6 GiB) Отсутствует энергонезависимое хранилище HDD/SSD (по NFS смонтирован каталог /home/micshare виртуальная сеть через PCI Express)
44 Умножение матриц SGEMM (OpenMP) /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_omp(float *a, float *b, float *c, int n, int m, int q) #pragma omp parallel #pragma omp for for (int i = ; i < n; i++) int k = i * q; for (int j = ; j < q; j++) c[k++] =.; #pragma omp for for (int i = ; i < n; i++) for (int k = ; k < m; k++) for (int j = ; j < q; j++) c[i * q + j] += a[i * m + k] * b[k * q + j]; # Сборка с ключом -mmic $ icc -mmic -Wall -g -std=c99 -O -fopenmp -c sgemm.c -o sgemm.o $ icc -o sgemm sgemm.o -mmic -lm fopenmp
45 PCI Express Копирование исполняемого файла в Intel Xeon Phi cnmic # Экспорт по NFS /home/micshare Intel Xeon Phi -- mic # Доступна по NFS (PCI Express) /home/micshare # Домашний каталог в RAM /home/user # Копирование в домашний каталог $ scp./sgemm mic: Password: sgemm % 5KB 5.KB/s : # Запуск на Xeon Phi $ ssh mic./sgemm Password:./sgemm: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
46 PCI Express Копирование исполняемого файла в Intel Xeon Phi cnmic # Экспорт по NFS /home/micshare Intel Xeon Phi -- mic # Доступна по NFS (PCI Express) /home/micshare # Домашний каталог в RAM /home/user # Копирование и запуск утилитой micnativeloadex $ cat./launch.sh #!/bin/sh export SINK_LD_LIBRARY_PATH=/home/micshare/libmic micnativeloadex./sgemm -d -e "OMP_NUM_THREADS=67 KMP_AFFINITY=explicit,granularity=fine,proclist=[-4:]" $./launch.sh SGEMM N =, M =, Q =, Xeon Phi memory usage MiB (.9 %) Phi OMP Native 67 ( runs): perf 64. GFLOPS; time: tavg.98, tmin.99, tmax.4, twarmup.67
47 Intel Xeon Phi: AVX-5 void distance(float *x, float *y, float *z, float *d, int n) for (int i = ; i < n; i++) d[i] = sqrtf(x[i] * x[i] + y[i] * y[i] + z[i] * z[i]); void distance_vec_mic(float *x, float *y, float *z, float *d, int n) m5 *xx = ( m5 *)x; m5 *yy = ( m5 *)y; m5 *zz = ( m5 *)z; m5 *dd = ( m5 *)d; int k = n / 6; for (int i = ; i < k; i++) m5 t = _mm5_mul_ps(xx[i], xx[i]); m5 t = _mm5_mul_ps(yy[i], yy[i]); m5 t = _mm5_mul_ps(zz[i], zz[i]); t = _mm5_add_ps(t, t); t = _mm5_add_ps(t, t); dd[i] = _mm5_sqrt_ps(t); for (int i = k * 6; i < n; i++) d[i] = sqrtf(x[i] * x[i] + y[i] * y[i] + z[i] * z[i]);
48 OpenMP + SIMD void distance(float *x, float *y, float *z, float *d, int n) // SIMD only #pragma omp simd for (int i = ; i < n; i++) d[i] = sqrtf(x[i] * x[i] + y[i] * y[i] + z[i] * z[i]); void distance(float *x, float *y, float *z, float *d, int n) // ing + SIMD #pragma omp parallel for simd for (int i = ; i < n; i++) d[i] = sqrtf(x[i] * x[i] + y[i] * y[i] + z[i] * z[i]);
49 Запуск программы в Native Mode Native Mode. Программа компилируется на хост-машине кросс-компилятором (icc): sequential code, OpenMP, Intel Cilk Plus, Intel TBB, Intel MPI. Исполняемый файл и все библиотеки копируются в Intel Xeon Phi (scp, NFS). Программа запускается на Intel Xeon Phi (scp, ssh) Ограничения Объем памяти Intel Xeon Phi (Phi A GDDR5 6 GiB) Отсутствует энергонезависимое хранилище HDD/SSD (по NFS смонтирован каталог /home/micshare виртуальная сеть через PCI Express)
50 Умножение матриц SGEMM (OpenMP) /* Matrix multiplication C[n, q] = A[n, m] * B[m, q] */ void sgemm_omp(float *a, float *b, float *c, int n, int m, int q) #pragma omp parallel #pragma omp for for (int i = ; i < n; i++) int k = i * q; for (int j = ; j < q; j++) c[k++] =.; #pragma omp for for (int i = ; i < n; i++) for (int k = ; k < m; k++) for (int j = ; j < q; j++) c[i * q + j] += a[i * m + k] * b[k * q + j]; # Сборка с ключом -mmic $ icc -mmic -Wall -g -std=c99 -O -fopenmp -c sgemm.c -o sgemm.o $ icc -o sgemm sgemm.o -mmic -lm fopenmp
51 PCI Express Копирование исполняемого файла в Intel Xeon Phi cnmic # Экспорт по NFS /home/micshare Intel Xeon Phi -- mic # Доступна по NFS (PCI Express) /home/micshare # Домашний каталог в RAM /home/user # Копирование в домашний каталог $ scp./sgemm mic: Password: sgemm % 5KB 5.KB/s : # Запуск на Xeon Phi $ ssh mic./sgemm Password:./sgemm: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
52 PCI Express Копирование исполняемого файла в Intel Xeon Phi cnmic # Экспорт по NFS /home/micshare Intel Xeon Phi -- mic # Доступна по NFS (PCI Express) /home/micshare # Домашний каталог в RAM /home/user # Копирование и запуск утилитой micnativeloadex $ cat./launch.sh #!/bin/sh export SINK_LD_LIBRARY_PATH=/home/micshare/libmic micnativeloadex./sgemm -d -e "OMP_NUM_THREADS=67 KMP_AFFINITY=explicit,granularity=fine,proclist=[-4:]" $./launch.sh SGEMM N =, M =, Q =, Xeon Phi memory usage MiB (.9 %) Phi OMP Native 67 ( runs): perf 64. GFLOPS; time: tavg.98, tmin.99, tmax.4, twarmup.67
53 Intel Xeon Phi: AVX-5 void distance(float *x, float *y, float *z, float *d, int n) for (int i = ; i < n; i++) d[i] = sqrtf(x[i] * x[i] + y[i] * y[i] + z[i] * z[i]); void distance_vec_mic(float *x, float *y, float *z, float *d, int n) m5 *xx = ( m5 *)x; m5 *yy = ( m5 *)y; m5 *zz = ( m5 *)z; m5 *dd = ( m5 *)d; int k = n / 6; for (int i = ; i < k; i++) m5 t = _mm5_mul_ps(xx[i], xx[i]); m5 t = _mm5_mul_ps(yy[i], yy[i]); m5 t = _mm5_mul_ps(zz[i], zz[i]); t = _mm5_add_ps(t, t); t = _mm5_add_ps(t, t); dd[i] = _mm5_sqrt_ps(t); for (int i = k * 6; i < n; i++) d[i] = sqrtf(x[i] * x[i] + y[i] * y[i] + z[i] * z[i]);
54 OpenMP + SIMD void distance(float *x, float *y, float *z, float *d, int n) // SIMD only #pragma omp simd for (int i = ; i < n; i++) d[i] = sqrtf(x[i] * x[i] + y[i] * y[i] + z[i] * z[i]); void distance(float *x, float *y, float *z, float *d, int n) // ing + SIMD #pragma omp parallel for simd for (int i = ; i < n; i++) d[i] = sqrtf(x[i] * x[i] + y[i] * y[i] + z[i] * z[i]);
55 Использование MPI на Intel Xeon Phi Native model все процессы MPI-программы выполняются на сопроцессоре Программа копируется на сопроцессор и запускается NUMA node cnmic Eth CPU CPU MEM MEM PCI-E MEM Xeon Phi NUMA node
56 Использование MPI на Intel Xeon Phi Native model все процессы MPI-программы выполняются на сопроцессоре Программа копируется на сопроцессор и запускается NUMA node cnmic Eth CPU CPU MEM MEM PCI-E MEM Xeon Phi MPI-процессы NUMA node
57 Использование MPI на Intel Xeon Phi Symmetric model процессы программы выполняются на хост-машине и сопроцессоре NUMA node cnmic Eth CPU CPU MEM MEM PCI-E MEM Xeon Phi MPI-процессы NUMA node
58 Использование MPI на Intel Xeon Phi Offload model процессы MPI-программы выполняются на хост-машине и выгружают (offload) часть кода на сопроцессор #pragma offload, Intel MKL (MIC) NUMA node cnmic Eth CPU CPU MEM MEM PCI-E MEM Xeon Phi #pragma offload, Intel MKL (MIC) NUMA node
59 MPI Native mode #include <unistd.h> #include <stdio.h> #include <mpi.h> int main(int argc, char **argv) int rank, len; char procname[mpi_max_processor_name]; MPI_Init(&argc, &argv); MPI_Get_processor_name(procname, &len); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("rank %d on %s\n", rank, procname); MPI_Finalize(); return ; Eth NUMA node CPU CPU MEM MEM NUMA node PCI-E cnmic MEM Xeon Phi # Сборка программы $ cat./build.sh #!/bin/sh mpiicc -mmic -Wall -std=c99 -O -o \ hello./hello.c # Запуск программы $ cat./launch.sh #!/bin/sh TMP=`mktemp -p /home/micshare/tmp -u` cp -f./hello $TMP export I_MPI_MIC=enable mpirun -n -f hosts $TMP rm -f $TMP $./launch.sh Rank on cnmic-mic Rank on cnmic-mic Rank on cnmic-mic mic
60 MPI Symmetric mode #include <unistd.h> #include <stdio.h> #include <mpi.h> # Сборка программы $ cat./build.sh #!/bin/sh int main(int argc, char **argv) $./launch.sh int rank, len; Rankchar onprocname[mpi_max_processor_name]; cnmic Rank on cnmic RankMPI_Init(&argc, on cnmic-mic&argv); RankMPI_Get_processor_name(procname, 4 on cnmic-mic &len); RankMPI_Comm_rank(MPI_COMM_WORLD, on cnmic-mic &rank); cnmic #!/bin/sh printf("rank %d on %s\n", rank, procname); Eth NUMA node MEM MPI_Finalize(); return ; CPU CPU MEM PCI-E MEM Xeon Phi # Binary for Xeon Phi mpiicc -mmic -Wall -std=c99 -O -o hello.mic./hello.c # Binary for Host mpiicc -Wall -std=c99 -O -o hello./hello.c # Запуск программы Host + Phi Вариант $ cat./build.sh # Copy program to the NFS folder TMP=`mktemp -p /home/micshare/tmp -u` cp -f./hello.mic $TMP # Launch the program (see hosts file) export I_MPI_MIC=enable export I_MPI_FABRICS=shm:tcpmpi run -n -host cnmic./hello : -n -host mic $TMP NUMA node rm -f $TMP
61 MPI Symmetric mode #include <unistd.h> #include <stdio.h> #include <mpi.h> # Сборка программы $ cat./build.sh #!/bin/sh MEM NUMA node Xeon Phi # Binary for Xeon Phi mpiicc -mmic -Wall -std=c99 -O -o hello.mic./hello.c int main(int argc, char **argv) $./launch.sh int rank, len; Rankchar onprocname[mpi_max_processor_name]; cnmic # Binary for Host Rank on cnmic mpiicc -Wall -std=c99 -O -o hello./hello.c RankMPI_Init(&argc, on cnmic-mic&argv); RankMPI_Get_processor_name(procname, 4 on cnmic-mic &len); # Запуск программы Host + Phi Вариант RankMPI_Comm_rank(MPI_COMM_WORLD, on cnmic-mic &rank); $ cat./build.sh cnmic # Copy program to the NFS folder printf("rank NUMA node %d on %s\n", rank, procname); TMP=`mktemp -p /home/micshare/tmp -u` MEM cp -f./hello.mic "$TMP.mic" Eth MPI_Finalize(); MEM cp -f./hello $TMP return ; CPU # Launch the program (see hosts file) 4 5 export I_MPI_MIC=enable CPU export I_MPI_MIC_POSTFIX=.mic 4 5 PCI-E export I_MPI_FABRICS=shm:tcp mpirun -machinefile hosts $TMP rm -f $TMP "$TMP.mic" cnmic: mic:
62 MPI Offload mode int main(int argc, char **argv) int rank, len; char procname[mpi_max_processor_name]; MPI_Init(&argc, &argv); MPI_Get_processor_name(procname, &len); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("rank %d on %s\n", rank, procname); # Сборка программы $ cat./build.sh #!/bin/sh mpiicc -Wall -std=c99 -O -o hello./hello.c # Запуск программы $ cat./launch.sh #!/bin/sh #pragma offload target(mic) in(rank) mpirun -n 4 -host cnmic./hello./hello printf("hello, from Xeon Phi (offloaded from proc %d)\n", rank); cnmic $./launch.sh MPI_Finalize(); NUMA node Rank on cnmic return ; Rank on cnmic MEM Eth MEM Rank on cnmic CPU Rank on cnmic 9 8 Hello, from Xeon Phi (offloaded from proc ) CPU 7 Xeon Phi 4 5 PCI-E 6 Hello, from Xeon Phi (offloaded from proc ) 54 MEM Hello, from Xeon Phi (offloaded from proc ) Hello, from Xeon Phi (offloaded from proc ) NUMA node
63 Конфигурация гибридного NUMA-узла cnmic $ hwloc-ls Machine (64GB) NUMANode L# (P# GB) Socket L# + L L# (5MB) L L# (56KB) + Ld L# (KB) + Li L# (KB) + Core L# HostBridge L# PCIBridge PCI de:f GPU L# "card" GPU L# "renderd8" GPU L# "controld64" PCI 886:8d6 PCIBridge PCI 886:5 Net L# "enp6s" PCIBridge PCI 886:5 Net L#4 "enp7s" PCIBridge PCI b:6 PCI 886:8d Block L#5 "sda" NUMANode L# (P# GB) Socket L# + L L# (5MB) L L#6 (56KB) + Ld L#6 (KB) + Li L#6 (KB) + Core L#6 HostBridge L#5 PCIBridge PCI 886:5d CoProc L#6 "mic" Eth NUMA node MEM CPU 4 5 CPU 4 5 MEM NUMA node PCI-E cnmic MEM Xeon Phi
Семинар 1 (23) Программирование сопроцессора Intel Xeon Phi
Семинар () Программирование сопроцессора Intel Xeon Phi Михаил Курносов E-mail: mkurnosov@gmail.com WWW: www.mkurnosov.net Цикл семинаров «Основы параллельного программирования» Институт физики полупроводников
More informationСеминар 4 (26) Программирование сопроцессора Intel Xeon Phi (MPI)
Семинар 4 (26) Программирование сопроцессора Intel Xeon Phi (MPI) Михаил Курносов E-mail: mkurnosov@gmail.com WWW: www.mkurnosov.net Цикл семинаров «Основы параллельного программирования» Институт физики
More informationMany-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)
Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationIntel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop
Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,
, Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationXeon Phi Coprocessors on Turing
Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationPRACE PATC Course: Intel MIC Programming Workshop MPI LRZ,
PRACE PATC Course: Intel MIC Programming Workshop MPI LRZ, 27.6.- 29.6.2016 Intel Xeon Phi Programming Models: MPI MPI on Hosts & MICs MPI @ LRZ Default Module: SuperMUC: mpi.ibm/1.4 SuperMIC: mpi.intel/5.1
More informationJURECA Tuning for the platform
JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationBeacon Quickstart Guide at AACE/NICS
Beacon Intel MIC Cluster Beacon Overview Beacon Quickstart Guide at AACE/NICS Each compute node has 2 8- core Intel Xeon E5-2670 with 256GB of RAM All compute nodes also contain 4 KNC cards (mic0/1/2/3)
More informationOverview of Intel Xeon Phi Coprocessor
Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing
More informationHands-on with Intel Xeon Phi
Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling
More informationIntel Xeon Phi Workshop. Bart Oldeman, McGill HPC May 7, 2015
Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015 1 Online Slides http://tinyurl.com/xeon-phi-handout-may2015 OR http://www.hpc.mcgill.ca/downloads/xeon_phi_workshop_may2015/handout.pdf
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationthe Intel Xeon Phi coprocessor
the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with
More informationProgramming Intel R Xeon Phi TM
Programming Intel R Xeon Phi TM An Overview Anup Zope Mississippi State University 20 March 2018 Anup Zope (Mississippi State University) Programming Intel R Xeon Phi TM 20 March 2018 1 / 46 Outline 1
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationMigrating Offloading Software to Intel Xeon Phi Processor
Migrating Offloading Software to Intel Xeon Phi Processor White Paper February 2018 Document Number: 337129-001US Legal Lines and Disclaimers Intel technologies features and benefits depend on system configuration
More informationOpenMP programming Part II. Shaohao Chen High performance Louisiana State University
OpenMP programming Part II Shaohao Chen High performance computing @ Louisiana State University Part II Optimization for performance Trouble shooting and debug Common Misunderstandings and Frequent Errors
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationIntel Math Kernel Library (Intel MKL) Latest Features
Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance
More information6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance
The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E5-2630 Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationThe Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation
The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationIntroduction to the Intel Xeon Phi on Stampede
June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors
More informationIntroduction of Xeon Phi Programming
Introduction of Xeon Phi Programming Wei Feinstein Shaohao Chen HPC User Services LSU HPC/LONI Louisiana State University 10/21/2015 Introduction to Xeon Phi Programming Overview Intel Xeon Phi architecture
More informationNative Computing and Optimization on Intel Xeon Phi
Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native
More informationReusing this material
XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationSCALABLE HYBRID PROTOTYPE
SCALABLE HYBRID PROTOTYPE Scalable Hybrid Prototype Part of the PRACE Technology Evaluation Objectives Enabling key applications on new architectures Familiarizing users and providing a research platform
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Outline Overview What is a native application? Why run native? Getting Started: Building a
More informationDebugging Intel Xeon Phi KNC Tutorial
Debugging Intel Xeon Phi KNC Tutorial Last revised on: 10/7/16 07:37 Overview: The Intel Xeon Phi Coprocessor 2 Debug Library Requirements 2 Debugging Host-Side Applications that Use the Intel Offload
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationRunning HARMONIE on Xeon Phi Coprocessors
Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps Controlling Execution Tuning Vectorization
More informationrabbit.engr.oregonstate.edu What is rabbit?
1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor 1 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 2 Intel Multicore Architecture Intel Many Integrated Core Architecture (Intel MIC) Foundation
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationLAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationGet Ready for Intel MKL on Intel Xeon Phi Coprocessors. Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library
Get Ready for Intel MKL on Intel Xeon Phi Coprocessors Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationLab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC
Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC # pg. Subject Purpose directory 1 3 5 Offload, Begin (C) (F90) Compile and Run (CPU, MIC, Offload) offload_hello 2 7 Offload, Data Optimize
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. Ostrava,
Intel MIC Programming Workshop, Hardware Overview & Native Execution Ostrava, 7-8.2.2017 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationParallel Processing/Programming
Parallel Processing/Programming with the applications to image processing Lectures: 1. Parallel Processing & Programming from high performance mono cores to multi- and many-cores 2. Programming Interfaces
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi MIC Training Event at TACC Lars Koesterke Xeon Phi MIC Xeon Phi = first product of Intel s Many Integrated Core (MIC) architecture Co- processor PCI Express card Stripped down Linux
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationAn Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015
Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationMulticore Performance and Tools. Part 1: Topology, affinity, clock speed
Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and
More informationOutline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples
Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors
More informationKlaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation
S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013
More informationHands-on. MPI basic exercises
WIFI XSF-UPC: Username: xsf.convidat Password: 1nt3r3st3l4r WIFI EDUROAM: Username: roam06@bsc.es Password: Bsccns.4 MareNostrum III User Guide http://www.bsc.es/support/marenostrum3-ug.pdf Remember to
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor Lars Koesterke John D. McCalpin lars@tacc.utexas.edu mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps
More informationIntel Xeon Phi Coprocessor Offloading Computation
Intel Xeon Phi Coprocessor Offloading Computation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationBenchmark results on Knight Landing (KNL) architecture
Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3
More informationGrowth in Cores - A well rehearsed story
Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More informationA Unified Approach to Heterogeneous Architectures Using the Uintah Framework
DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps A Unified Approach to Heterogeneous Architectures Using the Uintah Framework Qingyu Meng, Alan Humphrey
More informationVincent C. Betro, Ph.D. NICS March 6, 2014
Vincent C. Betro, Ph.D. NICS March 6, 2014 NSF Acknowledgement This material is based upon work supported by the National Science Foundation under Grant Number 1137097 Any opinions, findings, and conclusions
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationIntroduction to OpenMP
Introduction to OpenMP Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Shared Memory Programming OpenMP Fork-Join Model Compiler Directives / Run time library routines Compiling and
More informationCS 470 Spring Mike Lam, Professor. OpenMP
CS 470 Spring 2018 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationParallel Computing. Lecture 17: OpenMP Last Touch
CSCI-UA.0480-003 Parallel Computing Lecture 17: OpenMP Last Touch Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Some slides from here are adopted from: Yun (Helen) He and Chris Ding
More informationDay 5: Introduction to Parallel Intel Architectures
Day 5: Introduction to Parallel Intel Architectures Lecture day 5 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com Welcome Colfax International, 2013 2017 Disclaimer 2 While
More informationOpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013
OpenMP António Abreu Instituto Politécnico de Setúbal 1 de Março de 2013 António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 1 / 37 openmp what? It s an Application Program Interface
More information