Code optimization in a 3D diffusion model
|
|
- Baldric Allan Day
- 5 years ago
- Views:
Transcription
1 Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona
2 Agenda Background Diffusion algorithm Performance: baseline Scaling: OpenMP Vectorization: #pragma simd Peeling out Note on bandwidth Summary 2
3 References Ref: Chapter 4 Intel s Xeon Phi Coprocessor High Performance Programming Author of the code: Naoya Maruyama of Riken Advanced Institute for Computational Science in Japan Simulate diffusion of a solute through a volume of liquid over time within a 3D container A three-dimensional seven-point stencil operation is used 3
4 Diffusion model Diffusion of a solute over Time through an Enclosed Volume 4
5 The diffusion equation ϕ(r, t) is the density of the diffusing material at location r and time t D(ϕ, r) is the collective diffusion coefficient for density ϕ at location r If D is constant then it becomes 5
6 Numerical approach: Finite differences Regular meshing in 3D Forward time centred space (FTCS) where Threading Vectorization MPI Domain decomposition Hybrid computing 6
7 Seven point stencil z North, South East, West Top, Bottom y x 3D Stencil Used to calculate the diffusion of a solute through a liquid volume. 7
8 Diffusion algorithm in principle for (i = 0; i < niter; i++) { } for (z = 0; z < nz; z++) for (y = 0; y < ny; y++) for (x = 0; x < nx; x++) f2[z,y,x] = cc*f1[z,y,x] + cw*f1[z,y,x 1] + ce*f1[z,y,x+1] + cn*f1[z,y 1,x] + cs*f1[z,y+1,x] + cb*f1[z 1,y,x] + ct*f1[z+1,y,x] temp = f2; f2 = f1; f1 = temp; Switch buffers The time loop Walk the mesh Update the mesh 8
9 Boundary conditions Molecular density for sub-volumes that sit next to the edges of our container The boundary conditions occur for any sub-volume that has x = 0, y = 0, or z = 0 x = nx, y = ny, or z = nz Replace the value of the neighbour volume with the target central density value to get a reasonable approximation of the diffusion at that point Bounds check no overstepping the bounds! Reshape the code with The sides of the box Linierize f1[] and f2[] using the stencil indices by adding w,e,n,s,b,t (west, east, north, south, bottom, top) variables 9
10 for (int i = 0; i < count; ++i){ } for (int z = 0; z < nz; z++) { } for (int y = 0; y < ny; y++) { for (int x = 0; x < nx; x++) { } int c, w, e, n, s, b, t; c = x + y * nx + z * nx * ny; w = (x == 0)? c : c 1; e = (x == nx 1)? c : c + 1; n = (y == 0)? c : c nx; s = (y == ny 1)? c : c + nx; b = (z == 0)? c : c nx * ny; t = (z == nz 1)? c : c + nx * ny; f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; REAL *t = f1_t; f1_t = f2_t; f2_t = t; Boundary coordinates Diffusion base kernel } 10
11 Diffusion: baseline code diffusion_base.c 11
12 Performance metrics Floating-point performance f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 13 floating point operation per inner loop iteration Memory bandwidth (in GB/s) number of bytes of volume data read and written during the call 12
13 Baseline 13
14 Compilation: native, aggressive for the Intel Xeon Phi symbol openmp switch Aggressive $icc g -openmp -mmic -std5c99 -O3 vect-report=3 diffusion_base.c - o diffusion_base Xeon Phi vectorization reports Environment: set on Xeon Phi export OMP_NUM_THREADS=1 228 export KMP_AFFINITY=compact scatter What and where Execution on Xeon Phi %./diffusion_base ssh to mic card and run natively 14
15 Runtime results Running diffusion base kernel l6553 times diffusion_base_xphi thread num = 1 affinity =!-rnp count is 65 Running diffusion kernel 65 times Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 15
16 vtune Run on host Requires a script Run.sh #!/bin/bash source /home/rogerphilp/psxevars.sh export OMP_NUM_THREADS=1 Export KMP_AFFINITY = echo diffusion_base_xphi thread num = ${OMP_NUM_THREADS} affinity = ${KMP_AFFINITY} /home/rogerphilp/diffusion/diffusion_base_xphi 16
17 vtune baseline statistics May be too high Memory stalls Instruction starvation Branch misprediction Or long latency instructions 17
18 Baseline thread analysis But only one core 18
19 vtune analysis of the baseline code Where the time is being spent Cpu activity 19
20 vtune analysis of the diffusion_baseline code Red = Regions of poor performance 20
21 Baseline vectorization report diffusion_base.optrpt: vect-report=3 diffusion_base.c(103,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD diffusion_base.c(106,9) Inner loop: f1, f2 dependency remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed FLOW dependence between f2 line 115 and f1 line 115 middle loops: f1, f2 dependency diffusion_base.c(105,7) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive diffusion_base.c(104,5) Temporal loop remark #15541: outer loop was not auto-vectorized: consider using SIMD directive 21
22 Performance requirements To improve performance we initially need two key elements Scaling: openmp directives Vectorization: simd pragmas 22
23 Scaling: openmp 23
24 Scaling: OpenMP See updated function diffusion_openmp() #pragma omp parallel And collapse the z and y loops #pragma omp for collapse(2) Effectively creating a loop for(yz=0; yz < ny*nx; ++yz) Enables each thread to be assigned larger chunks of data Allows more efficiency on each pass through the loop 24
25 diffusion_omp.c #pragma omp parallel { REAL *f1_t = f1,*f2_t = f2; for (int i = 0; i < count; ++i) { #pragma omp for collapse(2). for (int z = 0; z < nz; z++) { for (int y = 0; y < ny; y++) { for (int x = 0; x < nx; x++) { f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];.. } section is marked as parallel Each thread gets the same index Z and y loops are collapsed 25
26 diffusion_omp.c: compact; 228 threads Thread number Thread arrangement diffusion_omp_xphi thread num = 228 affinity = compact Running diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 Diffusion_base_xphi Flops: FLOPS : (MFlops) 26
27 Speedup Experiment with the number of threads per core Speedup using OpenMP vs number of threads omp omp c s Number of threads 27
28 CPI rate is better at 4.39 CPI rate is worse at thread compact openmp cpu usage 228 thread scatter openmp cpu usage 28
29 Diffusion openmp cpu usage histograms threads = 228 affinity = compact openmp cpu usage threads = 228 affinity = scattter openmp cpu usage 29
30 Diffusion openmp usage histograms threads = 228 affinity = compact openmp usage threads = 228 affinity = scattter openmp usage 30
31 Diffusion_omp.c: compact Diffusion_omp.c: scatter 31
32 Hotspots threads = 228 affinity = compact openmp threads = 228 affinity = scatter openmp 32
33 Openmp vectorization report diffusion_omp.optrpt: vect-report=3 Temporal loop diffusion_omp.c(106,5) inlined into diffusion_omp.c(202,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive middle loops: f1, f2 dependency diffusion_omp.c(108,7) inlined into diffusion_omp.c(202,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive inner loops: f1, f2 dependency diffusion_omp.c(110,11) inlined into diffusion_omp.c(202,3) remark #15346: vector dependence: assumed FLOW dependence between f2_t line 119 and f1_t line
34 Vectorization 34
35 Forcing vectorization Add vectorization pragma #pragma simd Vectorization pragma Ignore suspected dependencies 35
36 diffusion_ompvect.c #pragma omp parallel for (int i = 0; i < count; ++i) { #pragma omp for collapse(2) for (int z = 0; z < nz; z++) { Vectorization pragma for (int y = 0; y < ny; y++) { #pragma simd Ignore suspected dependencies for (int x = 0; x < nx; x++) {. f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];. } 36
37 Difffusion_ompvect: compact diffusion_ompvect_xphi thread num = 228 affinity = compact diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s)Accuracy : e-09 Diffusion_omp_base_xphi Flops: FLOPS : (MFlops) 37
38 Speedup Speedup using OpenMP and Vectorization vs number of threads ompvec ompvec omp omp c s c s Number of Threads 38
39 Disffusion_ompvec: threads = 228 Affinity = compact CPI has increased over openmp 39
40 Openmp only threads = 228 affinity = compact openmp usage Openmp + vectorization threads = 228 affinity = compact openmp usage 40
41 Diffusion: threads = 228 Affinity = compact Openmp + vectorization Openmp only 41
42 Openmp vectorization report diffusion_ompvec.optrpt: vect-report=3 diffusion_ompvect.c(111,11) remark #15301: SIMD LOOP WAS VECTORIZED remark #15476: scalar loop cost: 66 remark #15477: vector loop cost: remark #15478: estimated potential speedup: diffusion_ompvect.c(111,11) remark #15301: REMAINDER LOOP WAS VECTORIZED diffusion_ompvect.c(111,11) remark #15301: PEEL LOOP WAS VECTORIZED 42
43 Peel and remainder 43
44 Boundary update: pesky boundaries Inner mesh Boundary mesh Currently boundary update: is mixed in with kernel: conditions can cause vectorization issues only has to occur before buffer pointer swap of f1 and f2 can occur before, after or before and after the execution of the main kernel 44
45 New main kernel: diffusion_peel #pragma simd for (x = 1; x < nx-1; x++) { } Start from index 1 ++c; ++n; ++s; ++b; ++t; f2_t[c] = cc*f1_t[c] + cw*f1_t[c-1] + ce*f1_t[c+1] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; No explicit conditionals Ends at index nx -2 The new vectorised peeled kernel 45
46 int x, c, n, s, b, t; x = 0; c = x + y*nx + z*nx*ny; n = (y == 0)? c : c - nx; s = (y == ny-1)? c : c + nx; b = (z == 0)? c : c - nx* ny; t = (z == nz-1)? c : c + nx* ny; f2_t[c] = cc*f1_t[c] + cw*f1_t[c] + ce*f1_t[c+1] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; // New simd kernel goes here ++c; ++n; ++s; ++b; ++t; f2_t[c] = cc*f1_t[c] + cw*f1_t[c-1] + ce*f1_t[c] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; } } REAL *t = f1_t; f1_t = f2_t; f2_t = t; First set of boundaries updated Second set of boundaries updated System updated 46
47 diffusion_peel.c : compact, 228 threads Thread number Thread arrangement diffusion_peel_xphi thread num = 228 affinity = compact Running diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 Diffusion_base_xphi Flops: FLOPS : (MFlops) 47
48 s diffusion_peel.c 450 Speedup vs Number of Threads ompvecpl b ompvec b omp b ompvecpl c ompvecpl s ompvec c ompvec s omp c omp s number of threads 48
49 diffusion_peel.c: compact, 228 threads CPI is higher still 49
50 Affinity = compact, 228 threads Openmp + vectorization + peel Openmp + vectorization 50
51 Openmp vectorization report diffusion_peel.optrpt: vect-report=3 diffusion_peel.c(120,11) remark #15301: PEEL LOOP WAS VECTORIZED remark #15301: SIMD LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 7 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 52 remark #15477: vector loop cost: remark #15478: estimated potential speedup:
52 Axis Title A note on bandwith 100 Bandwidth GB/s vs number of threads ompvecpl c ompvecpl s ompvec c ompvec s omp s omp s number of threads Problem may move from compute bound to memory bound 52
53 Note Intel s compilers, are in a constant state of improvement particularly in regards to finding vectorization opportunities. compiler version used was unable to automatically vectorize newer compiler version may have succeed but this did not! The compiler may need a little more information to vectorize. 53
54 Summary 54
55 Overview of procedure for optimizing the diffusion code -O3 optimisation profile profile profile baseline openmp openmp + vectorization Opt-report Opt-report Opt-report profile peel tiling Affinity Thread count 55
56 s diffusion_peel.c 450 Speedup vs Number of Threads ompvecpl b ompvec b omp b ompvecpl c ompvecpl s ompvec c ompvec s omp c omp s number of threads 56
57 Summary Compiled everything with O3 Generated a baseline performance figure Applied multiple thread counts Applied two affinities: compact and scatter Reviewed the optimisation reports Analysed the program using vtune to find hotspots As a consequence we achieved a speed up of 400 times 57
58 Thank you 58
Bring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationNative Computing and Optimization on Intel Xeon Phi
Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps Controlling Execution Tuning Vectorization
More informationHands-on with Intel Xeon Phi
Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling
More informationCode modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.
Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product
More informationCode modernization of Polyhedron benchmark suite
Code modernization of Polyhedron benchmark suite Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Approaches for
More informationDynamic SIMD Scheduling
Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies
More informationXeon Phi Knights Corner
Xeon Phi Knights Corner 1,2 1,2 1 DSL Xeon Phi Knights Corner 186 GF/s 2 TF/s 127 GF/s SMT L2 SMT 1. Physis [1] DSL Physis CPU GPU Xeon Phi (Knights Corner) KNC [2], [3], [4], [8] KNC SIMD HPC [9] KNC
More informationBoundary element quadrature schemes for multi- and many-core architectures
Boundary element quadrature schemes for multi- and many-core architectures Jan Zapletal, Michal Merta, Lukáš Malý IT4Innovations, Dept. of Applied Mathematics VŠB-TU Ostrava jan.zapletal@vsb.cz Intel MIC
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor Lars Koesterke John D. McCalpin lars@tacc.utexas.edu mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps
More informationImproving performance of numeric weather prediction codes
Improving performance of numeric weather prediction codes Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationDay 6: Optimization on Parallel Intel Architectures
Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While
More informationReusing this material
XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationAdvanced Threading and Optimization
Mikko Byckling, CSC Michael Klemm, Intel Advanced Threading and Optimization February 24-26, 2015 PRACE Advanced Training Centre CSC IT Center for Science Ltd, Finland!$omp parallel do collapse(3) do p4=1,p4d
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationTowards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)
Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationOverview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.
Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:
More informationLab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC
Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC # pg. Subject Purpose directory 1 3 5 Offload, Begin (C) (F90) Compile and Run (CPU, MIC, Offload) offload_hello 2 7 Offload, Data Optimize
More informationXeon Phi Coprocessors on Turing
Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationPerformance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,
Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Outline Overview What is a native application? Why run native? Getting Started: Building a
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationTowards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA
Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationAnalysis of Subroutine xppm0 in FV3. Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC
Analysis of Subroutine xppm0 in FV3 Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC Lynd.Stringer@noaa.gov Introduction Hardware/Software Why xppm0? KGEN Compiler at O2 Assembly at O2 Compiler
More informationRunning HARMONIE on Xeon Phi Coprocessors
Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.
More informationIntroduction to tuning on many core platforms. Gilles Gouaillardet RIST
Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need
More informationIntel Xeon Phi Coprocessor
Architecture Advanced Workshop Memory Session Speaking: Shannon Cepeda Intel,, Cilk,, Pentium, VTune and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries 1 Objective This
More informationKevin O Leary, Intel Technical Consulting Engineer
Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."
More informationOverview of Intel Xeon Phi Coprocessor
Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing
More informationGetting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions
Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel
More informationShared Memory Programming With OpenMP Exercise Instructions
Shared Memory Programming With OpenMP Exercise Instructions John Burkardt Interdisciplinary Center for Applied Mathematics & Information Technology Department Virginia Tech... Advanced Computational Science
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationLecture 2: Introduction to OpenMP with application to a simple PDE solver
Lecture 2: Introduction to OpenMP with application to a simple PDE solver Mike Giles Mathematical Institute Mike Giles Lecture 2: Introduction to OpenMP 1 / 24 Hardware and software Hardware: a processor
More informationFFTSS Library Version 3.0 User s Guide
Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large
More informationOpenMP and Performance
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl}@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationDynamic load balancing of the N-body problem
Dynamic load balancing of the N-body problem Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona This material is based
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationImproving performance of the N-Body problem
Improving performance of the N-Body problem Efim Sergeev Senior Software Engineer at Singularis Lab LLC Contents Theory Naive version Memory layout optimization Cache Blocking Techniques Data Alignment
More informationThe Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation
The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationShared Memory Programming With OpenMP Computer Lab Exercises
Shared Memory Programming With OpenMP Computer Lab Exercises Advanced Computational Science II John Burkardt Department of Scientific Computing Florida State University http://people.sc.fsu.edu/ jburkardt/presentations/fsu
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationVECTORISATION. Adrian
VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be
More informationIntroduction to Performance Tuning & Optimization Tools
Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationLab MIC Experiments 4/25/13 TACC
Lab MIC Experiments 4/25/13 TACC # pg. Subject Purpose directory 1 3 5 Offload, Begin (C) (F90) Compile and Run (CPU, MIC, Offload) offload_hello 2 7 Offload, Data Optimize Offload Data Transfers offload_transfer
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationParallel Implementation of PK-PD Parameter Estimation on Xeon Phi Using Grid Search Method
Title and Content 109 207 246 255 255 255 131 56 155 0 99 190 85 165 28 214 73 42 Dark 1 Light 1 Dark 2 Light 2 Accent 1 Accent 2 185 175 164 151 75 7 193 187 0 255 221 62 255 255 255 236 137 29 Accent
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationSoftware Optimization Case Study. Yu-Ping Zhao
Software Optimization Case Study Yu-Ping Zhao Yuping.zhao@intel.com Agenda RELION Background RELION ITAC and VTUE Analyze RELION Auto-Refine Workload Optimization RELION 2D Classification Workload Optimization
More informationOffload acceleration of scientific calculations within.net assemblies
Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to
More informationIdentifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning
Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationGetting Performance from OpenMP Programs on NUMA Architectures
Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant
More informationCOMP Parallel Computing. SMM (2) OpenMP Programming Model
COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel
More informationA Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA
A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région
More informationInvestigation of Intel MIC for implementation of Fast Fourier Transform
Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for
More informationEarly Experiences Writing Performance Portable OpenMP 4 Codes
Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic
More informationHybrid MPI - A Case Study on the Xeon Phi Platform
Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory
More informationKNL tools. Dr. Fabio Baruffa
KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationOpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach
OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationConvey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort?
Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort? Björn Meyer, Jörn Schumacher, Christian Plessl, Jens Förstner University of Paderborn, Germany 2 ), - 4 * 4 + - 6-4.
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationBenchmark results on Knight Landing architecture
Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationAutomatic Polyhedral Optimization of Stencil Codes
Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer 31.03.2014 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup
More informationIntel profiling tools and roofline model. Dr. Luigi Iapichino
Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed
More informationRevealing the performance aspects in your code
Revealing the performance aspects in your code 1 Three corner stones of HPC The parallelism can be exploited at three levels: message passing, fork/join, SIMD Hyperthreading is not quite threading A popular
More informationCode Optimization Process for KNL. Dr. Luigi Iapichino
Code Optimization Process for KNL Dr. Luigi Iapichino luigi.iapichino@lrz.de About the presenter Dr. Luigi Iapichino Scientific Computing Expert, Leibniz Supercomputing Centre Member of the Intel Parallel
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationOpenMP on the IBM Cell BE
OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationSymmetric Computing. Jerome Vienne Texas Advanced Computing Center
Symmetric Computing Jerome Vienne Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More information