Code optimization in a 3D diffusion model

Size: px
Start display at page:

Download "Code optimization in a 3D diffusion model"

Transcription

1 Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona

2 Agenda Background Diffusion algorithm Performance: baseline Scaling: OpenMP Vectorization: #pragma simd Peeling out Note on bandwidth Summary 2

3 References Ref: Chapter 4 Intel s Xeon Phi Coprocessor High Performance Programming Author of the code: Naoya Maruyama of Riken Advanced Institute for Computational Science in Japan Simulate diffusion of a solute through a volume of liquid over time within a 3D container A three-dimensional seven-point stencil operation is used 3

4 Diffusion model Diffusion of a solute over Time through an Enclosed Volume 4

5 The diffusion equation ϕ(r, t) is the density of the diffusing material at location r and time t D(ϕ, r) is the collective diffusion coefficient for density ϕ at location r If D is constant then it becomes 5

6 Numerical approach: Finite differences Regular meshing in 3D Forward time centred space (FTCS) where Threading Vectorization MPI Domain decomposition Hybrid computing 6

7 Seven point stencil z North, South East, West Top, Bottom y x 3D Stencil Used to calculate the diffusion of a solute through a liquid volume. 7

8 Diffusion algorithm in principle for (i = 0; i < niter; i++) { } for (z = 0; z < nz; z++) for (y = 0; y < ny; y++) for (x = 0; x < nx; x++) f2[z,y,x] = cc*f1[z,y,x] + cw*f1[z,y,x 1] + ce*f1[z,y,x+1] + cn*f1[z,y 1,x] + cs*f1[z,y+1,x] + cb*f1[z 1,y,x] + ct*f1[z+1,y,x] temp = f2; f2 = f1; f1 = temp; Switch buffers The time loop Walk the mesh Update the mesh 8

9 Boundary conditions Molecular density for sub-volumes that sit next to the edges of our container The boundary conditions occur for any sub-volume that has x = 0, y = 0, or z = 0 x = nx, y = ny, or z = nz Replace the value of the neighbour volume with the target central density value to get a reasonable approximation of the diffusion at that point Bounds check no overstepping the bounds! Reshape the code with The sides of the box Linierize f1[] and f2[] using the stencil indices by adding w,e,n,s,b,t (west, east, north, south, bottom, top) variables 9

10 for (int i = 0; i < count; ++i){ } for (int z = 0; z < nz; z++) { } for (int y = 0; y < ny; y++) { for (int x = 0; x < nx; x++) { } int c, w, e, n, s, b, t; c = x + y * nx + z * nx * ny; w = (x == 0)? c : c 1; e = (x == nx 1)? c : c + 1; n = (y == 0)? c : c nx; s = (y == ny 1)? c : c + nx; b = (z == 0)? c : c nx * ny; t = (z == nz 1)? c : c + nx * ny; f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; REAL *t = f1_t; f1_t = f2_t; f2_t = t; Boundary coordinates Diffusion base kernel } 10

11 Diffusion: baseline code diffusion_base.c 11

12 Performance metrics Floating-point performance f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 13 floating point operation per inner loop iteration Memory bandwidth (in GB/s) number of bytes of volume data read and written during the call 12

13 Baseline 13

14 Compilation: native, aggressive for the Intel Xeon Phi symbol openmp switch Aggressive $icc g -openmp -mmic -std5c99 -O3 vect-report=3 diffusion_base.c - o diffusion_base Xeon Phi vectorization reports Environment: set on Xeon Phi export OMP_NUM_THREADS=1 228 export KMP_AFFINITY=compact scatter What and where Execution on Xeon Phi %./diffusion_base ssh to mic card and run natively 14

15 Runtime results Running diffusion base kernel l6553 times diffusion_base_xphi thread num = 1 affinity =!-rnp count is 65 Running diffusion kernel 65 times Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 15

16 vtune Run on host Requires a script Run.sh #!/bin/bash source /home/rogerphilp/psxevars.sh export OMP_NUM_THREADS=1 Export KMP_AFFINITY = echo diffusion_base_xphi thread num = ${OMP_NUM_THREADS} affinity = ${KMP_AFFINITY} /home/rogerphilp/diffusion/diffusion_base_xphi 16

17 vtune baseline statistics May be too high Memory stalls Instruction starvation Branch misprediction Or long latency instructions 17

18 Baseline thread analysis But only one core 18

19 vtune analysis of the baseline code Where the time is being spent Cpu activity 19

20 vtune analysis of the diffusion_baseline code Red = Regions of poor performance 20

21 Baseline vectorization report diffusion_base.optrpt: vect-report=3 diffusion_base.c(103,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD diffusion_base.c(106,9) Inner loop: f1, f2 dependency remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed FLOW dependence between f2 line 115 and f1 line 115 middle loops: f1, f2 dependency diffusion_base.c(105,7) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive diffusion_base.c(104,5) Temporal loop remark #15541: outer loop was not auto-vectorized: consider using SIMD directive 21

22 Performance requirements To improve performance we initially need two key elements Scaling: openmp directives Vectorization: simd pragmas 22

23 Scaling: openmp 23

24 Scaling: OpenMP See updated function diffusion_openmp() #pragma omp parallel And collapse the z and y loops #pragma omp for collapse(2) Effectively creating a loop for(yz=0; yz < ny*nx; ++yz) Enables each thread to be assigned larger chunks of data Allows more efficiency on each pass through the loop 24

25 diffusion_omp.c #pragma omp parallel { REAL *f1_t = f1,*f2_t = f2; for (int i = 0; i < count; ++i) { #pragma omp for collapse(2). for (int z = 0; z < nz; z++) { for (int y = 0; y < ny; y++) { for (int x = 0; x < nx; x++) { f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];.. } section is marked as parallel Each thread gets the same index Z and y loops are collapsed 25

26 diffusion_omp.c: compact; 228 threads Thread number Thread arrangement diffusion_omp_xphi thread num = 228 affinity = compact Running diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 Diffusion_base_xphi Flops: FLOPS : (MFlops) 26

27 Speedup Experiment with the number of threads per core Speedup using OpenMP vs number of threads omp omp c s Number of threads 27

28 CPI rate is better at 4.39 CPI rate is worse at thread compact openmp cpu usage 228 thread scatter openmp cpu usage 28

29 Diffusion openmp cpu usage histograms threads = 228 affinity = compact openmp cpu usage threads = 228 affinity = scattter openmp cpu usage 29

30 Diffusion openmp usage histograms threads = 228 affinity = compact openmp usage threads = 228 affinity = scattter openmp usage 30

31 Diffusion_omp.c: compact Diffusion_omp.c: scatter 31

32 Hotspots threads = 228 affinity = compact openmp threads = 228 affinity = scatter openmp 32

33 Openmp vectorization report diffusion_omp.optrpt: vect-report=3 Temporal loop diffusion_omp.c(106,5) inlined into diffusion_omp.c(202,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive middle loops: f1, f2 dependency diffusion_omp.c(108,7) inlined into diffusion_omp.c(202,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive inner loops: f1, f2 dependency diffusion_omp.c(110,11) inlined into diffusion_omp.c(202,3) remark #15346: vector dependence: assumed FLOW dependence between f2_t line 119 and f1_t line

34 Vectorization 34

35 Forcing vectorization Add vectorization pragma #pragma simd Vectorization pragma Ignore suspected dependencies 35

36 diffusion_ompvect.c #pragma omp parallel for (int i = 0; i < count; ++i) { #pragma omp for collapse(2) for (int z = 0; z < nz; z++) { Vectorization pragma for (int y = 0; y < ny; y++) { #pragma simd Ignore suspected dependencies for (int x = 0; x < nx; x++) {. f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];. } 36

37 Difffusion_ompvect: compact diffusion_ompvect_xphi thread num = 228 affinity = compact diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s)Accuracy : e-09 Diffusion_omp_base_xphi Flops: FLOPS : (MFlops) 37

38 Speedup Speedup using OpenMP and Vectorization vs number of threads ompvec ompvec omp omp c s c s Number of Threads 38

39 Disffusion_ompvec: threads = 228 Affinity = compact CPI has increased over openmp 39

40 Openmp only threads = 228 affinity = compact openmp usage Openmp + vectorization threads = 228 affinity = compact openmp usage 40

41 Diffusion: threads = 228 Affinity = compact Openmp + vectorization Openmp only 41

42 Openmp vectorization report diffusion_ompvec.optrpt: vect-report=3 diffusion_ompvect.c(111,11) remark #15301: SIMD LOOP WAS VECTORIZED remark #15476: scalar loop cost: 66 remark #15477: vector loop cost: remark #15478: estimated potential speedup: diffusion_ompvect.c(111,11) remark #15301: REMAINDER LOOP WAS VECTORIZED diffusion_ompvect.c(111,11) remark #15301: PEEL LOOP WAS VECTORIZED 42

43 Peel and remainder 43

44 Boundary update: pesky boundaries Inner mesh Boundary mesh Currently boundary update: is mixed in with kernel: conditions can cause vectorization issues only has to occur before buffer pointer swap of f1 and f2 can occur before, after or before and after the execution of the main kernel 44

45 New main kernel: diffusion_peel #pragma simd for (x = 1; x < nx-1; x++) { } Start from index 1 ++c; ++n; ++s; ++b; ++t; f2_t[c] = cc*f1_t[c] + cw*f1_t[c-1] + ce*f1_t[c+1] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; No explicit conditionals Ends at index nx -2 The new vectorised peeled kernel 45

46 int x, c, n, s, b, t; x = 0; c = x + y*nx + z*nx*ny; n = (y == 0)? c : c - nx; s = (y == ny-1)? c : c + nx; b = (z == 0)? c : c - nx* ny; t = (z == nz-1)? c : c + nx* ny; f2_t[c] = cc*f1_t[c] + cw*f1_t[c] + ce*f1_t[c+1] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; // New simd kernel goes here ++c; ++n; ++s; ++b; ++t; f2_t[c] = cc*f1_t[c] + cw*f1_t[c-1] + ce*f1_t[c] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; } } REAL *t = f1_t; f1_t = f2_t; f2_t = t; First set of boundaries updated Second set of boundaries updated System updated 46

47 diffusion_peel.c : compact, 228 threads Thread number Thread arrangement diffusion_peel_xphi thread num = 228 affinity = compact Running diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 Diffusion_base_xphi Flops: FLOPS : (MFlops) 47

48 s diffusion_peel.c 450 Speedup vs Number of Threads ompvecpl b ompvec b omp b ompvecpl c ompvecpl s ompvec c ompvec s omp c omp s number of threads 48

49 diffusion_peel.c: compact, 228 threads CPI is higher still 49

50 Affinity = compact, 228 threads Openmp + vectorization + peel Openmp + vectorization 50

51 Openmp vectorization report diffusion_peel.optrpt: vect-report=3 diffusion_peel.c(120,11) remark #15301: PEEL LOOP WAS VECTORIZED remark #15301: SIMD LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 7 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 52 remark #15477: vector loop cost: remark #15478: estimated potential speedup:

52 Axis Title A note on bandwith 100 Bandwidth GB/s vs number of threads ompvecpl c ompvecpl s ompvec c ompvec s omp s omp s number of threads Problem may move from compute bound to memory bound 52

53 Note Intel s compilers, are in a constant state of improvement particularly in regards to finding vectorization opportunities. compiler version used was unable to automatically vectorize newer compiler version may have succeed but this did not! The compiler may need a little more information to vectorize. 53

54 Summary 54

55 Overview of procedure for optimizing the diffusion code -O3 optimisation profile profile profile baseline openmp openmp + vectorization Opt-report Opt-report Opt-report profile peel tiling Affinity Thread count 55

56 s diffusion_peel.c 450 Speedup vs Number of Threads ompvecpl b ompvec b omp b ompvecpl c ompvecpl s ompvec c ompvec s omp c omp s number of threads 56

57 Summary Compiled everything with O3 Generated a baseline performance figure Applied multiple thread counts Applied two affinities: compact and scatter Reviewed the optimisation reports Analysed the program using vtune to find hotspots As a consequence we achieved a speed up of 400 times 57

58 Thank you 58

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Native Computing and Optimization on Intel Xeon Phi

Native Computing and Optimization on Intel Xeon Phi Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps Controlling Execution Tuning Vectorization

More information

Hands-on with Intel Xeon Phi

Hands-on with Intel Xeon Phi Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling

More information

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product

More information

Code modernization of Polyhedron benchmark suite

Code modernization of Polyhedron benchmark suite Code modernization of Polyhedron benchmark suite Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Approaches for

More information

Dynamic SIMD Scheduling

Dynamic SIMD Scheduling Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies

More information

Xeon Phi Knights Corner

Xeon Phi Knights Corner Xeon Phi Knights Corner 1,2 1,2 1 DSL Xeon Phi Knights Corner 186 GF/s 2 TF/s 127 GF/s SMT L2 SMT 1. Physis [1] DSL Physis CPU GPU Xeon Phi (Knights Corner) KNC [2], [3], [4], [8] KNC SIMD HPC [9] KNC

More information

Boundary element quadrature schemes for multi- and many-core architectures

Boundary element quadrature schemes for multi- and many-core architectures Boundary element quadrature schemes for multi- and many-core architectures Jan Zapletal, Michal Merta, Lukáš Malý IT4Innovations, Dept. of Applied Mathematics VŠB-TU Ostrava jan.zapletal@vsb.cz Intel MIC

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor Lars Koesterke John D. McCalpin lars@tacc.utexas.edu mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps

More information

Improving performance of numeric weather prediction codes

Improving performance of numeric weather prediction codes Improving performance of numeric weather prediction codes Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

Day 6: Optimization on Parallel Intel Architectures

Day 6: Optimization on Parallel Intel Architectures Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Advanced Threading and Optimization

Advanced Threading and Optimization Mikko Byckling, CSC Michael Klemm, Intel Advanced Threading and Optimization February 24-26, 2015 PRACE Advanced Training Centre CSC IT Center for Science Ltd, Finland!$omp parallel do collapse(3) do p4=1,p4d

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC

Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC # pg. Subject Purpose directory 1 3 5 Offload, Begin (C) (F90) Compile and Run (CPU, MIC, Offload) offload_hello 2 7 Offload, Data Optimize

More information

Xeon Phi Coprocessors on Turing

Xeon Phi Coprocessors on Turing Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Outline Overview What is a native application? Why run native? Getting Started: Building a

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Analysis of Subroutine xppm0 in FV3. Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC

Analysis of Subroutine xppm0 in FV3. Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC Analysis of Subroutine xppm0 in FV3 Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC Lynd.Stringer@noaa.gov Introduction Hardware/Software Why xppm0? KGEN Compiler at O2 Assembly at O2 Compiler

More information

Running HARMONIE on Xeon Phi Coprocessors

Running HARMONIE on Xeon Phi Coprocessors Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Architecture Advanced Workshop Memory Session Speaking: Shannon Cepeda Intel,, Cilk,, Pentium, VTune and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries 1 Objective This

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Overview of Intel Xeon Phi Coprocessor

Overview of Intel Xeon Phi Coprocessor Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing

More information

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel

More information

Shared Memory Programming With OpenMP Exercise Instructions

Shared Memory Programming With OpenMP Exercise Instructions Shared Memory Programming With OpenMP Exercise Instructions John Burkardt Interdisciplinary Center for Applied Mathematics & Information Technology Department Virginia Tech... Advanced Computational Science

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017 Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is

More information

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Lecture 2: Introduction to OpenMP with application to a simple PDE solver Lecture 2: Introduction to OpenMP with application to a simple PDE solver Mike Giles Mathematical Institute Mike Giles Lecture 2: Introduction to OpenMP 1 / 24 Hardware and software Hardware: a processor

More information

FFTSS Library Version 3.0 User s Guide

FFTSS Library Version 3.0 User s Guide Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large

More information

OpenMP and Performance

OpenMP and Performance Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl}@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Dynamic load balancing of the N-body problem

Dynamic load balancing of the N-body problem Dynamic load balancing of the N-body problem Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona This material is based

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

Improving performance of the N-Body problem

Improving performance of the N-Body problem Improving performance of the N-Body problem Efim Sergeev Senior Software Engineer at Singularis Lab LLC Contents Theory Naive version Memory layout optimization Cache Blocking Techniques Data Alignment

More information

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

Shared Memory Programming With OpenMP Computer Lab Exercises

Shared Memory Programming With OpenMP Computer Lab Exercises Shared Memory Programming With OpenMP Computer Lab Exercises Advanced Computational Science II John Burkardt Department of Scientific Computing Florida State University http://people.sc.fsu.edu/ jburkardt/presentations/fsu

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

VECTORISATION. Adrian

VECTORISATION. Adrian VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be

More information

Introduction to Performance Tuning & Optimization Tools

Introduction to Performance Tuning & Optimization Tools Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Lab MIC Experiments 4/25/13 TACC

Lab MIC Experiments 4/25/13 TACC Lab MIC Experiments 4/25/13 TACC # pg. Subject Purpose directory 1 3 5 Offload, Begin (C) (F90) Compile and Run (CPU, MIC, Offload) offload_hello 2 7 Offload, Data Optimize Offload Data Transfers offload_transfer

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

Parallel Implementation of PK-PD Parameter Estimation on Xeon Phi Using Grid Search Method

Parallel Implementation of PK-PD Parameter Estimation on Xeon Phi Using Grid Search Method Title and Content 109 207 246 255 255 255 131 56 155 0 99 190 85 165 28 214 73 42 Dark 1 Light 1 Dark 2 Light 2 Accent 1 Accent 2 185 175 164 151 75 7 193 187 0 255 221 62 255 255 255 236 137 29 Accent

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

Software Optimization Case Study. Yu-Ping Zhao

Software Optimization Case Study. Yu-Ping Zhao Software Optimization Case Study Yu-Ping Zhao Yuping.zhao@intel.com Agenda RELION Background RELION ITAC and VTUE Analyze RELION Auto-Refine Workload Optimization RELION 2D Classification Workload Optimization

More information

Offload acceleration of scientific calculations within.net assemblies

Offload acceleration of scientific calculations within.net assemblies Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Getting Performance from OpenMP Programs on NUMA Architectures

Getting Performance from OpenMP Programs on NUMA Architectures Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région

More information

Investigation of Intel MIC for implementation of Fast Fourier Transform

Investigation of Intel MIC for implementation of Fast Fourier Transform Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

Hybrid MPI - A Case Study on the Xeon Phi Platform

Hybrid MPI - A Case Study on the Xeon Phi Platform Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory

More information

KNL tools. Dr. Fabio Baruffa

KNL tools. Dr. Fabio Baruffa KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort?

Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort? Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort? Björn Meyer, Jörn Schumacher, Christian Plessl, Jens Förstner University of Paderborn, Germany 2 ), - 4 * 4 + - 6-4.

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

Benchmark results on Knight Landing architecture

Benchmark results on Knight Landing architecture Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

Automatic Polyhedral Optimization of Stencil Codes

Automatic Polyhedral Optimization of Stencil Codes Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer 31.03.2014 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup

More information

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Intel profiling tools and roofline model. Dr. Luigi Iapichino Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed

More information

Revealing the performance aspects in your code

Revealing the performance aspects in your code Revealing the performance aspects in your code 1 Three corner stones of HPC The parallelism can be exploited at three levels: message passing, fork/join, SIMD Hyperthreading is not quite threading A popular

More information

Code Optimization Process for KNL. Dr. Luigi Iapichino

Code Optimization Process for KNL. Dr. Luigi Iapichino Code Optimization Process for KNL Dr. Luigi Iapichino luigi.iapichino@lrz.de About the presenter Dr. Luigi Iapichino Scientific Computing Expert, Leibniz Supercomputing Centre Member of the Intel Parallel

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center Symmetric Computing Jerome Vienne Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information