Семинар 1 (23) Программирование сопроцессора Intel Xeon Phi

Size: px
Start display at page:

Download "Семинар 1 (23) Программирование сопроцессора Intel Xeon Phi"

Transcription

1 Семинар () Программирование сопроцессора Intel Xeon Phi Михаил Курносов WWW: Цикл семинаров «Основы параллельного программирования» Институт физики полупроводников им. А. В. Ржанова СО РАН Новосибирск, 6

2 Intel Xeon Phi Intel Teraflops Research Chip (Polaris, x-7 гг.) 8 cores, 8 routers, self-correction system Single-Chip Cloud Computer (SCC, 9) 48 P54C Pentium cores, 4x6 D-mesh of tiles ( cores) Intel Larrabee (X- гг.) GPGPU: x86, cache coherency, 4-bit ring bus, 4-way SMT, разработка отменена в г. Larrabee GPU (SIGGRAPH, 8) Intel MIC (Intel Many Integrated Core Architecture) кеш-когерентный мультипроцессор с общей памятью, аппаратная многопоточность (4-way SMT), широкие векторные регистры (5 бит) Knights Ferry (): in-order cores, 4-way SMT, ring bus, 75 GFLOPS Knights Corner/ Xeon Phi (): nm, >= 5 cores Knights Landing ( nd gen., 4 nm, ): 74 Airmont (Atom) cores, bootable Knights Hill ( rd gen., nm)

3 Intel Knights Corner micro-architecture Больше 5 ядер Pentium P54C: x86 ISA, in-order, 4-way SMT, 5-bit SIMD units Cache: KB L data/i-cache, coherent L cache (5 KB), двусторонняя кольцевая шина Кольцевая шина: двусторонняя Подключается через шину PCI Express cnmin (oak): Intel Xeon Phi A Cores: 57 4-way SMT (core #57 for OS only!): 4 threads RAM: GDDR5 6 GiB Intel Compiler: $ source /opt/intel_cluster/bin/iccvars.sh intel64

4 Подсчет простых чисел (serial version) int is_prime_number(int n) int limit = sqrt(n) + ; for (int i = ; i <= limit; i++) if (n % i == ) return ; return (n > )? : ; Число операций O( n) int count_prime_numbers(int a, int b) int nprimes = ; if (a <= ) /* Count '' as a prime number */ nprimes = ; a = ; if (a % == ) /* Shift 'a' to odd number */ a++; /* Loop over odd numbers: a, a +, a + 4,..., b */ for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes; Проверка всех нечетных чисел в интервале [a, b]

5 Подсчет простых чисел (OpenMP) int count_prime_numbers_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; #pragma omp parallel /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;

6 Подсчет простых чисел (OpenMP) int count_prime_numbers_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; cnmic (oak.cpct.sibsutis.ru) System board: ASUS ZPE-D8 WS (Dual CPU) CPU: x Intel Xeon E5-6v (.4GHz, Haswell, 6 cores) RAM: 64 GiB, DDR4 Mhz (4x8 GiB, 4x8 GiB) Coprocessor: Intel Xeon Phi A (Cores: 56 4-way SMT, RAM: GDDR5 6 GiB) $ icc -std=c99 -Wall -O -fopenmp -c primes.c -o primes.o $ icc -o primes primes.o -lm -fopenmp #pragma omp parallel /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) return nprimes; if (is_prime_number(i)) nprimes++; $ export OMP_NUM_THREADS= $./primes Count prime numbers in [, ] Result (host serial): 686 Result (host parallel): 686 Execution time (host serial):.76 Execution time (host parallel):.7 Speedup host_serial/host_omp:.7

7 Подсчет простых чисел: Xeon Phi offload (serial) #include <offload.h> double run_phi_serial() #ifdef INTEL_OFFLOAD printf("intel Xeon Phi devices: %d\n", _Offload_number_of_devices()); #endif int n; double t = wtime(); #pragma offload target(mic) out(n) n = count_prime_numbers_phi(a, b); t = wtime() - t; printf("result (phi serial): %d\n", n); return t; Структурный блок выгружается (offload) для выполнения на сопроцессор Требуется указать: объекты, которые необходимо скопировать в память сопроцессора перед выполнением блока (in) объекты, которые необходимо скопировать из памяти сопроцессора после выполнением блока (out)

8 Подсчет простых чисел: Xeon Phi offload (serial) attribute ((target(mic))) int count_prime_numbers_phi(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; Функции и переменные, доступ к которым осуществляется на сопроцессоре, помечаются атрибутом attribute ((target(mic))) (гетерогенная компиляция) /* Loop over odd numbers: a, a +, a + 4,..., b */ for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;

9 Подсчет простых чисел: Xeon Phi offload (serial) attribute ((target(mic))) int is_prime_number(int n) int limit = sqrt(n) + ; for (int i = ; i <= limit; i++) if (n % i == ) return ; return (n > )? : ; Функции и переменные, доступ к которым осуществляется на сопроцессоре, помечаются атрибутом attribute ((target(mic))) (гетерогенная компиляция)

10 Подсчет простых чисел: Xeon Phi offload (serial) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); printf("execution time (host serial): %.6f\n", thost_serial); printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); return ; На данном тесте ядро Intel Xeon Phi A в ~5 раз медленнее ядра Intel Xeon E5-6 v $ export OMP_NUM_THREADS= $./primes Count prime numbers in [, ] Result (host serial): 686 Result (host parallel): 686 Intel Xeon Phi devices: Result (phi serial): 686 Execution time (host serial):.8 Execution time (host parallel):.8 Execution time (phi serial): 8.96 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.74

11 Подсчет простых чисел: Xeon Phi offload (OpenMP) double run_phi_parallel() #ifdef INTEL_OFFLOAD printf("intel Xeon Phi devices: %d\n", _Offload_number_of_devices()); #endif int n; double t = wtime(); #pragma offload target(mic) out(n) n = count_prime_numbers_phi_omp(a, b); t = wtime() - t; printf("result (phi parallel): %d\n", n); return t;

12 Подсчет простых чисел: Xeon Phi offload (OpenMP) attribute ((target(mic))) int count_prime_numbers_phi_omp(int a, int b) int nprimes = ; /* Count '' as a prime number */ if (a <= ) nprimes = ; a = ; /* Shift 'a' to odd number */ if (a % == ) a++; /* Loop over odd numbers: a, a +, a + 4,..., b */ #pragma omp parallel #pragma omp for schedule(dynamic, ) reduction(+:nprimes) for (int i = a; i <= b; i += ) if (is_prime_number(i)) nprimes++; return nprimes;

13 Подсчет простых чисел: Xeon Phi offload (OpenMP) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); double tphi_par = run_phi_parallel(); printf("execution time (host serial): %.6f\n", thost_serial); printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial); printf("execution time (phi parallel): %.6f\n", tphi_par); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); printf("speedup host_omp/phi_omp: %.f\n", thost_par / tphi_par); printf("speedup host_serial/phi_omp: %.f\n", thost_serial / tphi_par); printf("speedup phi_serial/phi_omp: %.f\n", tphi_serial / tphi_par); return ;

14 Подсчет простых чисел: Xeon Phi offload (OpenMP) int main(int argc, char **argv) printf("count prime numbers in [%d, %d]\n", a, b); double thost_serial = run_host_serial(); double thost_par = run_host_parallel(); double tphi_serial = run_phi_serial(); double tphi_par = run_phi_parallel(); $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC printf("execution time (host serial): %.6f\n", thost_serial); export MIC_OMP_NUM_THREADS=4 printf("execution time (host parallel): %.6f\n", thost_par); printf("execution time (phi serial): %.6f\n", tphi_serial);./primes printf("execution time (phi parallel): %.6f\n", tphi_par); printf("ratio phi_serial/host_serial: %.f\n", tphi_serial / thost_serial); printf("speedup host_serial/host_omp: %.f\n", thost_serial / thost_par); printf("speedup host_omp/phi_omp: %.f\n", thost_par / tphi_par); printf("speedup host_serial/phi_omp: %.f\n", thost_serial / tphi_par); $./launch.sh printf("speedup phi_serial/phi_omp: %.f\n", tphi_serial / tphi_par); Count prime numbers in [, ] Execution time (host serial):.78 return ; Execution time (host parallel):.466 Execution time (phi serial): 8.6 Execution time (phi parallel):.7784 Ratio phi_serial/host_serial: 4.85 Phi-OpenMP-версия на 4 потоках в раза быстрее Speedup host_serial/host_omp:.6 последовательной Phi-версии Speedup host_omp/phi_omp:.4 Speedup host_serial/phi_omp:.58 Phi-OpenMP-версия медленее Host-OpenMP-версии в ~7 раз Speedup phi_serial/phi_omp:.4

15 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=verbose Execution time (host serial):.9 Execution time (host parallel):.86 Execution time (phi serial): 8.78 Execution time (phi parallel):.546 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:. Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:.4 Speedup phi_serial/phi_omp:../primes OMP: Info #4: KMP_AFFINITY: decoding xapic ids. OMP: Info #5: KMP_AFFINITY: cpuid leaf not supported - decoding legacy APIC ids. OMP: Info #49: KMP_AFFINITY: Affinity capable, using global cpuid info OMP: Info #54: KMP_AFFINITY: Initial OS proc set respected:,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,4,4,4,4,44,45,46,47,48,49,5,5,5,5,54,55,56,57,58,59,6,6,6,6,64,65,66,67,68,69,7,7,7,7,74,75,76,77,78,79,8,8,8,8,84,85,86,87,88,89,9,9,9,9,94,95,96,97,98,99,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,4,4,4,4,44,45,46,47,48,49,5,5,5,5,54,55,56,57,58,59,6,6,6,6,64,65,66,67,68,69,7,7,7,7,74,75,76,77,78,79,8,8,8,8,84,85,86,87,88,89,9,9,9,9,94,95,96,97,98,99,,,,,4,5,6,7,8,9,,,,,4,5,6,7,8,9,,,,,4 OMP: Info #56: KMP_AFFINITY: 4 available OS procs OMP: Info #57: KMP_AFFINITY: Uniform topology OMP: Info #59: KMP_AFFINITY: packages x 56 cores/pkg x 4 threads/core (56 total cores)

16 Xeon Phi: привязка потоков к ядрам (affinity) OMP: Info #6: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 4 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 5 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 6 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 7 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 8 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc 9 maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core thread... OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc maps to package core 55 thread OMP: Info #7: KMP_AFFINITY: OS proc 4 maps to package core 55 thread Core Core Core Core 55

17 Xeon Phi: привязка потоков к ядрам (affinity) OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set 5 OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set 9 OMP: Info #4: KMP_AFFINITY: pid 4 thread bound to OS proc set... OMP: Info #4: KMP_AFFINITY: pid 4 thread 54 bound to OS proc set 7 OMP: Info #4: KMP_AFFINITY: pid 4 thread 55 bound to OS proc set Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55 Default Affinity

18 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,balanced Execution time (host serial):.5 Execution time (host parallel):.67 Execution time (phi serial): Execution time (phi parallel):.4887 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.7 Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:.5 Speedup phi_serial/phi_omp: 7.4./primes fine, balanced Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55

19 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,compact./primes Execution time (host serial):.5 Execution time (host parallel):.78 Execution time (phi serial): Execution time (phi parallel):.4787 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.7 Speedup host_omp/phi_omp:.7 Speedup host_serial/phi_omp:.8 Speedup phi_serial/phi_omp:.9 fine, compact Cores (HW threads): Core Core Core Core Core 4 Core 5 Core OMP threads:

20 Xeon Phi: привязка потоков к ядрам (affinity) $ cat./launch.sh #!/bin/sh # Host OpenMP export OMP_NUM_THREADS= # Xeon Phi OpenMP export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=56 export MIC_KMP_AFFINITY=granularity=fine,scatter./primes Execution time (host serial):.4 Execution time (host parallel):.878 Execution time (phi serial): Execution time (phi parallel):.55 Ratio phi_serial/host_serial: 4.86 Speedup host_serial/host_omp:.5 Speedup host_omp/phi_omp:. Speedup host_serial/phi_omp:. Speedup phi_serial/phi_omp:.7 fine, scatter Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: 55

21 Xeon Phi: привязка потоков к ядрам (affinity) NUM_THREADS = fine, scatter: core_id = thread_id % ncores Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: nodes: 8x + 48x = 55 fine, balanced: Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 55 OMP threads: nodes: 8x + 48x = 8 9 fine, compact: Cores (HW threads): Core Core Core Core Core 4 Core 5 Core 9 OMP threads: nodes: x4 =

22 Задание Реализуйте вычисление определенного интеграла методом прямоугольников на Intel Xeon Phi (шаблон размещен в каталоге _task_integrate):. Выгрузите выполнение функции integrate_phi_omp на сопроцессор. Функцию integare_phi_omp реализуйте на OpenMP

Лекция 8 Введение в программирование сопроцессора Intel Xeon Phi

Лекция 8 Введение в программирование сопроцессора Intel Xeon Phi Лекция 8 Введение в программирование сопроцессора Intel Xeon Phi Курносов Михаил Георгиевич E-mail: mkurnosov@gmail.com WWW: www.mkurnosov.net Курс «Параллельное программирование» Сибирский государственный

More information

Семинар 4 (26) Программирование сопроцессора Intel Xeon Phi (MPI)

Семинар 4 (26) Программирование сопроцессора Intel Xeon Phi (MPI) Семинар 4 (26) Программирование сопроцессора Intel Xeon Phi (MPI) Михаил Курносов E-mail: mkurnosov@gmail.com WWW: www.mkurnosov.net Цикл семинаров «Основы параллельного программирования» Институт физики

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information) Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction

More information

Hands-on with Intel Xeon Phi

Hands-on with Intel Xeon Phi Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling

More information

Xeon Phi Coprocessors on Turing

Xeon Phi Coprocessors on Turing Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Outline Overview What is a native application? Why run native? Getting Started: Building a

More information

Day 5: Introduction to Parallel Intel Architectures

Day 5: Introduction to Parallel Intel Architectures Day 5: Introduction to Parallel Intel Architectures Lecture day 5 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com Welcome Colfax International, 2013 2017 Disclaimer 2 While

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps Controlling Execution Tuning Vectorization

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Native Computing and Optimization on Intel Xeon Phi

Native Computing and Optimization on Intel Xeon Phi Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Intel Xeon Phi Workshop. Bart Oldeman, McGill HPC May 7, 2015

Intel Xeon Phi Workshop. Bart Oldeman, McGill HPC  May 7, 2015 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015 1 Online Slides http://tinyurl.com/xeon-phi-handout-may2015 OR http://www.hpc.mcgill.ca/downloads/xeon_phi_workshop_may2015/handout.pdf

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Introduction to the Intel Xeon Phi on Stampede

Introduction to the Intel Xeon Phi on Stampede June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Running HARMONIE on Xeon Phi Coprocessors

Running HARMONIE on Xeon Phi Coprocessors Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor Lars Koesterke John D. McCalpin lars@tacc.utexas.edu mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps

More information

The following program computes a Calculus value, the "trapezoidal approximation of

The following program computes a Calculus value, the trapezoidal approximation of Multicore machines and shared memory Multicore CPUs have more than one core processor that can execute instructions at the same time. The cores share main memory. In the next few activities, we will learn

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA

PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA 2 nd Workshop MIC IFERC PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA Ph.D. candidate Fan YE Advisor CEA Christophe Calvin Supervisor Serge Petiton 18 MARCH 2015 2015 年 3 月 18 日 CEA

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E5-2630 Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

INF5063: Programming heterogeneous multi-core processors Introduction

INF5063: Programming heterogeneous multi-core processors Introduction INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

Benchmark results on Knight Landing architecture

Benchmark results on Knight Landing architecture Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

OpenMP Programming: Correctness, Tuning, Examples

OpenMP Programming: Correctness, Tuning, Examples OpenMP Programming: Correctness, Tuning, Examples Reinhold Bader (LRZ), Georg Hager (RRZE), Gerhard Wellein (RRZE), Markus Wittmann (RRZE) OpenMP Performance Tuning and Correctness Telling the Intel compiler

More information

OpenMP programming Part II. Shaohao Chen High performance Louisiana State University

OpenMP programming Part II. Shaohao Chen High performance Louisiana State University OpenMP programming Part II Shaohao Chen High performance computing @ Louisiana State University Part II Optimization for performance Trouble shooting and debug Common Misunderstandings and Frequent Errors

More information

the Intel Xeon Phi coprocessor

the Intel Xeon Phi coprocessor the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with

More information

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Parallelization, OpenMP

Parallelization, OpenMP ~ Parallelization, OpenMP Scientific Computing Winter 2016/2017 Lecture 26 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 18 Why parallelization? Computers became faster and faster

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

Parallel Processing/Programming

Parallel Processing/Programming Parallel Processing/Programming with the applications to image processing Lectures: 1. Parallel Processing & Programming from high performance mono cores to multi- and many-cores 2. Programming Interfaces

More information

JURECA Tuning for the platform

JURECA Tuning for the platform JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)

More information

Offload acceleration of scientific calculations within.net assemblies

Offload acceleration of scientific calculations within.net assemblies Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

Shared memory parallel computing

Shared memory parallel computing Shared memory parallel computing OpenMP Sean Stijven Przemyslaw Klosiewicz Shared-mem. programming API for SMP machines Introduced in 1997 by the OpenMP Architecture Review Board! More high-level than

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer HPC-Lab Session 2: OpenMP M. Bader, A. Breuer Meetings Date Schedule 10/13/14 Kickoff 10/20/14 Q&A 10/27/14 Presentation 1 11/03/14 H. Bast, Intel 11/10/14 Presentation 2 12/01/14 Presentation 3 12/08/14

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor 1 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 2 Intel Multicore Architecture Intel Many Integrated Core Architecture (Intel MIC) Foundation

More information

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf

More information

Scientific Computing with Intel Xeon Phi Coprocessors

Scientific Computing with Intel Xeon Phi Coprocessors Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents

More information

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu Pairwise Sequence Alignment Algorithms

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi MIC Training Event at TACC Lars Koesterke Xeon Phi MIC Xeon Phi = first product of Intel s Many Integrated Core (MIC) architecture Co- processor PCI Express card Stripped down Linux

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Scientific Computing WS 2017/2018. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1

Scientific Computing WS 2017/2018. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1 Scientific Computing WS 2017/2018 Lecture 25 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 25 Slide 1 Why parallelization? Computers became faster and faster without that... [Source: spiralgen.com]

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Symmetric Computing. SC 14 Jerome VIENNE

Symmetric Computing. SC 14 Jerome VIENNE Symmetric Computing SC 14 Jerome VIENNE viennej@tacc.utexas.edu Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák Advanced programming with OpenMP Libor Bukata a Jan Dvořák Programme of the lab OpenMP Tasks parallel merge sort, parallel evaluation of expressions OpenMP SIMD parallel integration to calculate π User-defined

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center Symmetric Computing Jerome Vienne Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently

More information

Symmetric Computing. John Cazes Texas Advanced Computing Center

Symmetric Computing. John Cazes Texas Advanced Computing Center Symmetric Computing John Cazes Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host and across nodes Also called heterogeneous computing Two executables are required:

More information

rabbit.engr.oregonstate.edu What is rabbit?

rabbit.engr.oregonstate.edu What is rabbit? 1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Special Course on Computer Architecture

Special Course on Computer Architecture Special Course on Computer Architecture #9 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava, , Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015 Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP CS 470 Spring 2018 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts Abdeslem DJAOUI PPD Rutherford Appleton Laboratory Background to Workshop Organised by CERN OpenLab and held in IT Auditorium,

More information

The Intel Xeon PHI Architecture

The Intel Xeon PHI Architecture The Intel Xeon PHI Architecture (codename Knights Landing ) Dr. Christopher Dahnken Senior Application Engineer Software and Services Group Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2 CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.

More information