Семинар 1 (23) Программирование сопроцессора Intel Xeon Phi
|
|
- Tabitha McDonald
- 5 years ago
- Views:
Transcription
1 Семинар () Программирование сопроцессора Intel Xeon Phi Михаил Курносов WWW: Цикл семинаров «Основы параллельного программирования» Институт физики полупроводников им. А. В. Ржанова СО РАН Новосибирск, 6
2 Intel Xeon Phi Intel Teraflops Research Chip (Polaris, x-7 гг.) 8 cores, 8 routers, self-correction system Single-Chip Cloud Computer (SCC, 9) 48 P54C Pentium cores, 4x6 D-mesh of tiles ( cores) Intel Larrabee (X- гг.) GPGPU: x86, cache coherency, 4-bit ring bus, 4-way SMT, разработка отменена в г. Larrabee GPU (SIGGRAPH, 8) Intel MIC (Intel Many Integrated Core Architecture) кеш-когерентный мультипроцессор с общей памятью, аппаратная многопоточность (4-way SMT), широкие векторные регистры (5 бит) Knights Ferry (): in-order cores, 4-way SMT, ring bus, 75 GFLOPS Knights Corner/ Xeon Phi (): nm, >= 5 cores Knights Landing ( nd gen., 4 nm, ): 74 Airmont (Atom) cores, bootable Knights Hill ( rd gen., nm)
3 Intel Knights Corner micro-architecture Больше 5 ядер Pentium P54C: x86 ISA, in-order, 4-way SMT, 5-bit SIMD units Cache: KB L data/i-cache, coherent L cache (5 KB), двусторонняя кольцевая шина Кольцевая шина: двусторонняя Подключается через шину PCI Express cnmin (oak): Intel Xeon Phi A Cores: 57 4-way SMT (core #57 for OS only!): 4 threads RAM: GDDR5 6 GiB Intel Compiler: $ source /opt/intel_cluster/bin/iccvars.sh intel64
4 Подсчет простых чисел (serial version) int is_prime_number(int n) int limit = sqrt(n) + ; for (int i = ; i <= limit; i++) if (n % i == ) return ; return (n > )? : ; Число операций O( n) int count_prime_numbers(int a, int b) int nprimes = ; if (a <= ) /* Count '' as a prime number */ nprimes = ; a = ; if (a % == ) /* Shift 'a' to odd number */ a++; /* Loop over odd numbers: a, a +, a + 4,..., b */ for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes; Проверка всех нечетных чисел в интервале [a, b]
5 Подсчет простых чисел (OpenMP) int count_prime_numbers_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; #pragma omp parallel /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;
6 Подсчет простых чисел (OpenMP) int count_prime_numbers_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; cnmic (oak.cpct.sibsutis.ru) System board: ASUS ZPE-D8 WS (Dual CPU) CPU: x Intel Xeon E5-6v (.4GHz, Haswell, 6 cores) RAM: 64 GiB, DDR4 Mhz (4x8 GiB, 4x8 GiB) Coprocessor: Intel Xeon Phi A (Cores: 56 4-way SMT, RAM: GDDR5 6 GiB) $ icc -std=c99 -Wall -O -fopenmp -c primes.c -o primes.o $ icc -o primes primes.o -lm -fopenmp #pragma omp parallel /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) return nprimes; if (is_prime_number(i)) nprimes++; $ export OMP_NUM_THREADS= $./primes Count prime numbers in [, ] Result (host serial): 686 Result (host parallel): 686 Execution time (host serial):.76 Execution time (host parallel):.7 Speedup host_serial/host_omp:.7
7 Подсчет простых чисел: Xeon Phi offload (serial) #include <offload.h> double run_phi_serial() #ifdef INTEL_OFFLOAD printf("intel Xeon Phi devices: %d\n", _Offload_number_of_devices()); #endif int n; double t = wtime(); #pragma offload target(mic) out(n) n = count_prime_numbers_phi(a, b); t = wtime() - t; printf("result (phi serial): %d\n", n); return t; Структурный блок выгружается (offload) для выполнения на сопроцессор Требуется указать: объекты, которые необходимо скопировать в память сопроцессора перед выполнением блока (in) объекты, которые необходимо скопировать из памяти сопроцессора после выполнением блока (out)
8 Подсчет простых чисел: Xeon Phi offload (serial) attribute ((target(mic))) int count_prime_numbers_phi(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; Функции и переменные, доступ к которым осуществляется на сопроцессоре, помечаются атрибутом attribute ((target(mic))) (гетерогенная компиляция) /* Loop over odd numbers: a, a +, a + 4,..., b */ for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;
9 Подсчет простых чисел: Xeon Phi offload (serial) attribute ((target(mic))) int is_prime_number(int n) int limit = sqrt(n) + ; for (int i = ; i <= limit; i++) if (n % i == ) return ; return (n > )? : ; Функции и переменные, доступ к которым осуществляется на сопроцессоре, помечаются атрибутом attribute ((target(mic))) (гетерогенная компиляция)
10 Подсчет простых чисел: Xeon Phi offload (serial) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); printf("execution time (host serial): %.6f\n", thost_serial); printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); return ; На данном тесте ядро Intel Xeon Phi A в ~5 раз медленнее ядра Intel Xeon E5-6 v $ export OMP_NUM_THREADS= $./primes Count prime numbers in [, ] Result (host serial): 686 Result (host parallel): 686 Intel Xeon Phi devices: Result (phi serial): 686 Execution time (host serial):.8 Execution time (host parallel):.8 Execution time (phi serial): 8.96 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.74
11 Подсчет простых чисел: Xeon Phi offload (OpenMP) double run_phi_parallel() #ifdef INTEL_OFFLOAD printf("intel Xeon Phi devices: %d\n", _Offload_number_of_devices()); #endif int n; double t = wtime(); #pragma offload target(mic) out(n) n = count_prime_numbers_phi_omp(a, b); t = wtime() - t; printf("result (phi parallel): %d\n", n); return t;
12 Подсчет простых чисел: Xeon Phi offload (OpenMP) attribute ((target(mic))) int count_prime_numbers_phi_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp parallel #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;
13 Подсчет простых чисел: Xeon Phi offload (OpenMP) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); double tphi_par = run_phi_parallel(); printf("execution time (host serial): %.6f\n", thost_serial); printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial); printf("execution time (phi parallel): %.6f\n", tphi_par); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); printf("speedup host_omp/phi_omp: %.f\n", thost_par / tphi_par); printf("speedup host_serial/phi_omp: %.f\n", thost_serial / tphi_par); printf("speedup phi_serial/phi_omp: %.f\n", tphi_serial / tphi_par); return ;
14 Подсчет простых чисел: Xeon Phi offload (OpenMP) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); double tphi_par = run_phi_parallel(); $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC printf("execution time (host serial): %.6f\n", thost_serial); export MIC_OMP_NUM_THREADS=4 printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial);./primes printf("execution time (phi parallel): %.6f\n", tphi_par); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); printf("speedup host_omp/phi_omp: %.f\n", thost_par / tphi_par); printf("speedup host_serial/phi_omp: %.f\n", thost_serial / tphi_par); $./launch.sh printf("speedup phi_serial/phi_omp: %.f\n", tphi_serial / tphi_par); Count prime numbers in [, ] Execution time (host serial):.78 return ; Execution time (host parallel):.466 Execution time (phi serial): 8.6 Execution time (phi parallel):.7784 Ratio phi_serial/host_serial: 4.85 Phi-OpenMP-версия на 4 потоках в раза быстрее Speedup host_serial/host_omp:.6 последовательной Phi-версии Speedup host_omp/phi_omp:.4 Speedup host_serial/phi_omp:.58 Phi-OpenMP-версия медленее Host-OpenMP-версии в ~7 раз Speedup phi_serial/phi_omp:.4
15 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=verbose Execution time (host serial):.9 Execution time (host parallel):.86 Execution time (phi serial): 8.78 Execution time (phi parallel):.546 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:. Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:.4 Speedup phi_serial/phi_omp:../primes OMP: Info #4: KMP_AFFINITY: decoding xapic ids. OMP: Info #5: KMP_AFFINITY: cpuid leaf not supported - decoding legacy APIC ids. OMP: Info #49: KMP_AFFINITY: Affinity capable, using global cpuid info OMP: Info #54: KMP_AFFINITY: Initial OS proc set respected:,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,4,4,4,4,44,45,46,47,48,49,5,5,5,5,54,55,56,57,58,59,6,6,6,6,64,65,66,67,68,69,7,7,7,7,74,75,76,77,78,79,8,8,8,8,84,85,86,87,88,89,9,9,9,9,94,95,96,97,98,99,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,4,4,4,4,44,45,46,47,48,49,5,5,5,5,54,55,56,57,58,59,6,6,6,6,64,65,66,67,68,69,7,7,7,7,74,75,76,77,78,79,8,8,8,8,84,85,86,87,88,89,9,9,9,9,94,95,96,97,98,99,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4 OMP: Info #56: KMP_AFFINITY: 4 available OS procs OMP: Info #57: KMP_AFFINITY: Uniform topology OMP: Info #59: KMP_AFFINITY: packages x 56 cores/pkg x 4 threads/core (56 total cores)
16 Xeon Phi: привязка потоков к ядрам (affinity) OMP: Info #6: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 4 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 5 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 6 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 7 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 8 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 9 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread... OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc 4 maps to package core 55 thread Core Core Core Core 55
17 Xeon Phi: привязка потоков к ядрам (affinity) OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set 5 OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set 9 OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set... OMP: Info #4: KMP_AFFINITY: pid 4 thread 54 bound to OS proc set 7 OMP: Info #4: KMP_AFFINITY: pid 4 thread 55 bound to OS proc set Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55 Default Affinity
18 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,balanced Execution time (host serial):.5 Execution time (host parallel):.67 Execution time (phi serial): Execution time (phi parallel):.4887 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.7 Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:.5 Speedup phi_serial/phi_omp: 7.4./primes fine, balanced Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55
19 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,compact./primes Execution time (host serial):.5 Execution time (host parallel):.78 Execution time (phi serial): Execution time (phi parallel):.4787 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.7 Speedup host_omp/phi_omp:.7 Speedup host_serial/phi_omp:.8 Speedup phi_serial/phi_omp:.9 fine, compact Cores (HW threads): Core Core Core Core Core 4 Core 5 Core OMP threads:
20 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,scatter./primes Execution time (host serial):.4 Execution time (host parallel):.878 Execution time (phi serial): Execution time (phi parallel):.55 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.5 Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:. Speedup phi_serial/phi_omp:.7 fine, scatter Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55
21 Xeon Phi: привязка потоков к ядрам (affinity) NUM_THREADS = fine, scatter: core_id = thread_id % ncores Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: nodes: 8x + 48x = 55 fine, balanced: Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: nodes: 8x + 48x = 8 9 fine, compact: Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 9 OMP threads: nodes: x4 =
22 Задание Реализуйте вычисление определенного интеграла методом прямоугольников на Intel Xeon Phi (шаблон размещен в каталоге _task_integrate):. Выгрузите выполнение функции integrate_phi_omp на сопроцессор. Функцию integare_phi_omp реализуйте на OpenMP
Лекция 8 Введение в программирование сопроцессора Intel Xeon Phi
Лекция 8 Введение в программирование сопроцессора Intel Xeon Phi Курносов Михаил Георгиевич E-mail: mkurnosov@gmail.com WWW: www.mkurnosov.net Курс «Параллельное программирование» Сибирский государственный
More informationСеминар 4 (26) Программирование сопроцессора Intel Xeon Phi (MPI)
Семинар 4 (26) Программирование сопроцессора Intel Xeon Phi (MPI) Михаил Курносов E-mail: mkurnosov@gmail.com WWW: www.mkurnosov.net Цикл семинаров «Основы параллельного программирования» Институт физики
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationMany-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)
Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction
More informationHands-on with Intel Xeon Phi
Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling
More informationXeon Phi Coprocessors on Turing
Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Outline Overview What is a native application? Why run native? Getting Started: Building a
More informationDay 5: Introduction to Parallel Intel Architectures
Day 5: Introduction to Parallel Intel Architectures Lecture day 5 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com Welcome Colfax International, 2013 2017 Disclaimer 2 While
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps Controlling Execution Tuning Vectorization
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationNative Computing and Optimization on Intel Xeon Phi
Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationIntel Xeon Phi Workshop. Bart Oldeman, McGill HPC May 7, 2015
Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015 1 Online Slides http://tinyurl.com/xeon-phi-handout-may2015 OR http://www.hpc.mcgill.ca/downloads/xeon_phi_workshop_may2015/handout.pdf
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationIntroduction to the Intel Xeon Phi on Stampede
June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationVincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012
Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationRunning HARMONIE on Xeon Phi Coprocessors
Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor Lars Koesterke John D. McCalpin lars@tacc.utexas.edu mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps
More informationThe following program computes a Calculus value, the "trapezoidal approximation of
Multicore machines and shared memory Multicore CPUs have more than one core processor that can execute instructions at the same time. The cores share main memory. In the next few activities, we will learn
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationPARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA
2 nd Workshop MIC IFERC PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA Ph.D. candidate Fan YE Advisor CEA Christophe Calvin Supervisor Serge Petiton 18 MARCH 2015 2015 年 3 月 18 日 CEA
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More information6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance
The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E5-2630 Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationINF5063: Programming heterogeneous multi-core processors Introduction
INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More informationBenchmark results on Knight Landing architecture
Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationThe Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation
The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationOpenMP Programming: Correctness, Tuning, Examples
OpenMP Programming: Correctness, Tuning, Examples Reinhold Bader (LRZ), Georg Hager (RRZE), Gerhard Wellein (RRZE), Markus Wittmann (RRZE) OpenMP Performance Tuning and Correctness Telling the Intel compiler
More informationOpenMP programming Part II. Shaohao Chen High performance Louisiana State University
OpenMP programming Part II Shaohao Chen High performance computing @ Louisiana State University Part II Optimization for performance Trouble shooting and debug Common Misunderstandings and Frequent Errors
More informationthe Intel Xeon Phi coprocessor
the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with
More informationLAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery
More informationManycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.
phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this
More informationParallelization, OpenMP
~ Parallelization, OpenMP Scientific Computing Winter 2016/2017 Lecture 26 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 18 Why parallelization? Computers became faster and faster
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationParallel Processing/Programming
Parallel Processing/Programming with the applications to image processing Lectures: 1. Parallel Processing & Programming from high performance mono cores to multi- and many-cores 2. Programming Interfaces
More informationJURECA Tuning for the platform
JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)
More informationOffload acceleration of scientific calculations within.net assemblies
Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationBenchmark results on Knight Landing (KNL) architecture
Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationShared memory parallel computing
Shared memory parallel computing OpenMP Sean Stijven Przemyslaw Klosiewicz Shared-mem. programming API for SMP machines Introduced in 1997 by the OpenMP Architecture Review Board! More high-level than
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationDepartment of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer
HPC-Lab Session 2: OpenMP M. Bader, A. Breuer Meetings Date Schedule 10/13/14 Kickoff 10/20/14 Q&A 10/27/14 Presentation 1 11/03/14 H. Bast, Intel 11/10/14 Presentation 2 12/01/14 Presentation 3 12/08/14
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor 1 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 2 Intel Multicore Architecture Intel Many Integrated Core Architecture (Intel MIC) Foundation
More informationTurbo Boost Up, AVX Clock Down: Complications for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationScientific Computing with Intel Xeon Phi Coprocessors
Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents
More informationAAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors
AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu Pairwise Sequence Alignment Algorithms
More informationMulticore Performance and Tools. Part 1: Topology, affinity, clock speed
Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi MIC Training Event at TACC Lars Koesterke Xeon Phi MIC Xeon Phi = first product of Intel s Many Integrated Core (MIC) architecture Co- processor PCI Express card Stripped down Linux
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationScientific Computing WS 2017/2018. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1
Scientific Computing WS 2017/2018 Lecture 25 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 25 Slide 1 Why parallelization? Computers became faster and faster without that... [Source: spiralgen.com]
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationSymmetric Computing. SC 14 Jerome VIENNE
Symmetric Computing SC 14 Jerome VIENNE viennej@tacc.utexas.edu Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationAdvanced programming with OpenMP. Libor Bukata a Jan Dvořák
Advanced programming with OpenMP Libor Bukata a Jan Dvořák Programme of the lab OpenMP Tasks parallel merge sort, parallel evaluation of expressions OpenMP SIMD parallel integration to calculate π User-defined
More informationIntel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop
Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library
More informationSymmetric Computing. Jerome Vienne Texas Advanced Computing Center
Symmetric Computing Jerome Vienne Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently
More informationSymmetric Computing. John Cazes Texas Advanced Computing Center
Symmetric Computing John Cazes Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host and across nodes Also called heterogeneous computing Two executables are required:
More informationrabbit.engr.oregonstate.edu What is rabbit?
1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationSpecial Course on Computer Architecture
Special Course on Computer Architecture #9 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,
, Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationAn Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015
Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCS 470 Spring Mike Lam, Professor. OpenMP
CS 470 Spring 2018 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationSummary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory
Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts Abdeslem DJAOUI PPD Rutherford Appleton Laboratory Background to Workshop Organised by CERN OpenLab and held in IT Auditorium,
More informationThe Intel Xeon PHI Architecture
The Intel Xeon PHI Architecture (codename Knights Landing ) Dr. Christopher Dahnken Senior Application Engineer Software and Services Group Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More information8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2
CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.
More information