Advanced Parallel Programming II
|
|
- Hector Welch
- 5 years ago
- Views:
Transcription
1 Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz
2 Introduction to Vectorization RISC Software GmbH Johannes Kepler University Linz
3 Motivation Increasement in number of cores Threading techniques to improve performance But flops per cycle of vector units increased as much as number of cores No use of vector units wasting flops/watt For best performance Use all cores Efficient use of vector units Ignoring potential of vector units is so inefficient as using only one core RISC Software GmbH Johannes Kepler University Linz
4 Vector Unit Single Instruction Multiple Data (SIMD) units Mostly for floating point operations Data parallelization with one instruction 64-Bit unit 1 DP flop, 2 SP flop 128-Bit unit 2 DP flop, 4 SP flop Multiple data elements are loaded into vector registers and used by vector units Some architectures have more than one instruction per cycle (e.g. Sandy Bridge) RISC Software GmbH Johannes Kepler University Linz
5 Parallel Execution Scalar version works on one element at a time Vector version carries out the same instructions on many elements at a time a[i] = b[i] + c[i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8]; a[i] = b[i] + c[i] x d[i] a[i] = b[i] + c[i] x d[i] a[i+1] = b[i+1] + c[i+1] x d[i+1] a[i+2] = b[i+2] + c[i+2] x d[i+2] a[i+3] = b[i+3] + c[i+3] x d[i+3] a[i+4] = b[i+4] + c[i+4] x d[i+4] a[i+5] = b[i+5] + c[i+5] x d[i+5] a[i+6] = b[i+6] + c[i+6] x d[i+6] a[i+7] = b[i+7] + c[i+7] x d[i+7] RISC Software GmbH Johannes Kepler University Linz
6 Vector Registers RISC Software GmbH Johannes Kepler University Linz
7 Vector Unit Usage (Programmers View) Use vectorized libraries (e.g. Intel MKL) Ease of use Fully automatic vectorization Auto vectorization hints (#pragma ivdep) SIMD feature (#pragma simd and simd function annotation) Vector intrinsics (e.g. mm_add_ps()) ASM code (e.g. addps) Programmer control RISC Software GmbH Johannes Kepler University Linz
8 Auto Vectorization RISC Software GmbH Johannes Kepler University Linz
9 Auto Vectorization Modern compilers analyse loops in serial code identification for vectorization Perform loop transformations for identification Usage of instruction set of target architecture RISC Software GmbH Johannes Kepler University Linz
10 Common Compiler Switches GCC and ICC Functionality Switch Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. ipo, -O3, ) -fast Create symbols for debugging -g Generate assembly files -S OpenMP support -openmp RISC Software GmbH Johannes Kepler University Linz
11 Architecture Specific Compiler Switches GCC Functionality Optimize for current machine Generate SSE v1 code Generate SSE v2 code (default, may also emit SSE v1 code) Generate SSE v3 code (may also emit SSE v1 and v2 code) Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) Generate AVX code Generate AVX v2 code Switch -march=native -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 RISC Software GmbH Johannes Kepler University Linz
12 Architecture Specific Compiler Switches ICC Functionality Switch * Optimize for current machine Generate SSE v1 code Generate SSE v2 code (default, may also emit SSE v1 code) Generate SSE v3 code (may also emit SSE v1 and v2 code) Generate SSE v3 code for Atom-based processors Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) Generate AVX code -xhost -xsse1 -xsse2 -xsse3 -xsse_atom -xssse3 -xsse4.1 -xsse4.2 -xavx * For Intel processors use x, for non-intel processors use -m RISC Software GmbH Johannes Kepler University Linz
13 Example 4 Simple Vector Addition 1. Go to the directory example_4. 2. Compile simple.cpp (no vectorization) and execute the binary icc -std=c++11 -O2 -no-vec simple.cpp (g++ -std=c++11 -O2 simple.cpp) 3. Compile simple.cpp (vectorization) and execute the binary icc -std=c++11 -O2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize) 4. What is the difference considering the execution times? 5. Can the execution times be further improved? RISC Software GmbH Johannes Kepler University Linz
14 SIMD Vectorization Basics Vectorization offers good performance improvement on floating point intensive code Vectorized code could compute slightly different results than non vectorized (x87 FPU 80 Bit, SIMD 64 Bit) Even for scalar operations vector unit is used Vectorization is only one aspect to improve performance Efficent use of the cache is necessary RISC Software GmbH Johannes Kepler University Linz
15 Parallelization at No Cost We tell the compiler to vectorize That s not the whole story There are cases, where a compiler cannot vectorize the code How can we analyse such situations vectorization report Generation of vectorization report via the compiler More interesting what was not done and why Focus on code paths, which were not vectorized RISC Software GmbH Johannes Kepler University Linz
16 Example 5 Vectorization Report 1. Go to the directory example_5. 2. Compile simple.cpp with enabled vector report generation: icc -std=c++11 -O2 -vec-report=2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize fopt-info-vec-missed simple.cpp) 3. Which positive/negative information does the vectorization report tell? 4. Insert the following code before std::cout in calcsp() and compile again. Have a look at the vectorization report. for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i 1]; } RISC Software GmbH Johannes Kepler University Linz
17 Vectorization Report ICC Information Which code was vectorized? Which code was not vectorized? Compiler switch vec-report<n> n=0: no diagnostic information n=1: (default) vectorized loops n=2: vectorized/non vectorized loops (and why) n=3: additional dependency information n=4: only non vectorized loops n=5: only non vectorized loops and dependency information n=6: vectorized/non vectorized loops with details RISC Software GmbH Johannes Kepler University Linz
18 Loop Unrolling Unrolling allows compiler to reconstruct loop for vector operations for (i = 0; i < N; i++) { a[i] = b[i] * c[i]; } Load b(i,, i + 3) Load c(i,, i + 3) Operate b * c -> a Store a(i,, i + 3) for (i = 0; i < N; i += 4) { a[i ] = b[i ] * c[i ]; a[i + 1] = b[i + 1] * c[i + 1]; a[i + 2] = b[i + 2] * c[i + 2]; a[i + 3] = b[i + 3] * c[i + 3]; } RISC Software GmbH Johannes Kepler University Linz
19 Requirements for Auto Vectorization Countable Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking) RISC Software GmbH Johannes Kepler University Linz
20 Requirements for Auto Vectorization Only most inner loop (caution in case of loop interchange or loop collapsing) No functions calls, but Instrinsic math (sin, log, ) Inline functions Elemental functions attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz
21 Inhibitors of Auto Vectorization Non contiguous data // arrays accessed with stride 2 for (int i=0; i<size; i+=2) b[i]+=a[i]+x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<size; j++) for (int i=0; i<size; i++) b[i]+=a[i][j]*x[j]; // indirect addressing of x using index array for (int i=0; i<size; i+=2) b[i]+=a[i]*x[index[i]]; stride 1 is best. Caution in case of more dimensional arrays: do j=1,n do i=1,n a(i,j)=b(i,j)*s enddo enddo F90 C for (j=0; j<n; j++) for (i=0;i<n;i++) a[j][i]=b[j][i]*s; RISC Software GmbH Johannes Kepler University Linz
22 Inhibitors of Auto Vectorization Inability to identify data with alias (or overlapping) Runtime check possible multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; } Runtime check not possible non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff]; } } Would vectorize with strict aliasing (-ansi-alias) RISC Software GmbH Johannes Kepler University Linz
23 Inhibitors of Auto Vectorization Vector dependency Read after write (RAW): non vectorizable for (i = 1; i < N; i++) a[i] = a[i 1] + b[i]; Write after read (WAR): vectorizable for (i = 0; i < N - 1; i++) a[i] = a[i + 1] + b[i]; Read after read (RAR): vectorizable for (i = 0; i < N; i++) a[i] = b[i % M] + c[i]; Write after write (WAW): non vectorizable for (i = 0; i < N; i++) a[i % M] = b[i] + c[i]; RISC Software GmbH Johannes Kepler University Linz
24 Efficiency Aspects for Auto Vectorization Alignment Address alignment (SSE: 16 Bytes, AVX: 32 Bytes) Check at compile time alignment Otherwise check at runtime peel and remainder loop Explicit definition in code: local/global: attribute ((alligned(16))) Heap: _mm_alloc, _mm_free Compiler support: assume_aligned(x, 16) void fill(char* x) { for (int i=0;i<1024;i++) x[i]=1; } Peeling peel=x&0x0f; if (peel!=0) { peel=16 peel; for (i=0; i<peel;i++)x[i]=1; } for (i=peel; i<1024; i++) x[i]=1; RISC Software GmbH Johannes Kepler University Linz
25 Efficieny Aspects for Auto Vectorization Data Layout Structure of Arrays (SoA) instead of Array of Structures (AoS) struct Vector3d { //AoS double x; double y; double z; }; x y z x y z x y z x y z struct Vectors3d { //SoA double* x; double* y; double* z; }; x x x x y y y y z z z z RISC Software GmbH Johannes Kepler University Linz
26 Example 6 Non Contiguous Data 1. Go to the directory example_6. 2. Compile contig.cpp with enabled vectorization report icc -std=c++11 -O2 -vec-report=2 contig.cpp (g++ -std=c++11 -O2 -ftree-vectorize fopt-info-vec-missed contig.cpp) 3. What does the vectorization report tell? RISC Software GmbH Johannes Kepler University Linz
27 Compiler Directives ICC Many compilers have directives for vectorization hints ivdep (C: #pragma ivdep, Fortran:!dec$ ivdep ) No vector dependency in loop (usage in case of existing dependency leads to incorrect code) vector always (C: #pragma vector always, Fortran:!dec$ vector always ) Elemental functions: attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz
28 Compiler Directives ICC SIMD Extensions SIMD pragma SIMD function annotation attribute ((simd)) Differences to traditional ivdep and vector always Traditional pragmas are more like hints New SIMD extension more like an assertion Fine control over auto vectorization with additiontal clauses (vectorlength, private, linear, reduction, assert) RISC Software GmbH Johannes Kepler University Linz
29 Example 7 Vector Dependency Go to the directory example_7 Compile the file forware.cpp and execute the binary. icc -std=c++11 -O2 -vec-report=2 forward.cpp (g++ -std=c++11 -O2 ftree-vectorize -fopt-info-vec forward.cpp) What is the execution time? What does the vectorization report tell? What happens if you split the inner loop into two seperate update loops for b and a? RISC Software GmbH Johannes Kepler University Linz
30 Vector Intrinsics RISC Software GmbH Johannes Kepler University Linz
31 Vector Intrinsics void vec_eltwise_product_avx(vec_t* a, vec_t* b, vec_t* c) { size_t i; m256 va; m256 vb; m256 vc; for (i = 0; i < a->size; i += 8) { va = _mm256_loadu_ps(&a->data[i]); vb = _mm256_loadu_ps(&b->data[i]); vc = _mm256_mul_ps(va, vb); _m256_storeu_ps(&c->data[i], vc); } } RISC Software GmbH Johannes Kepler University Linz
32 Vector Intrinsics Different data types for different architectures SSE: mm128, mm128d, mm128i AVX: mm256, mm256d, mm256i Different operations for different architectures SSE: _mm_add_ps(), _mm_add_pd(), AVX:_mm256_add_ps(), _mm256_add_pd(), Portability Implement wrapper for different architectures RISC Software GmbH Johannes Kepler University Linz
33 Cilk Plus Array Notation RISC Software GmbH Johannes Kepler University Linz
34 Cilk Plus Array Notation C/C++ language extension Support for data parallel operations New language array expression array-expression [ lower-bound : length : stride ] Default values of each argument in [:] lower-bound: 0 length: length of array stride: 1 (if default stride, the second : may be omitted) array-expression[:] array section is entire array of known length and stride 1 RISC Software GmbH Johannes Kepler University Linz
35 Cilk Plus Array Notation array-expression[:][:] denotes a two dimensional array Two new terms rank: number of array sections of single array (rank zero scalar) shape: length of each array section Statement all expressions have same rank and shape or rank zero broadcast for each element Built-in functions (reductions, ) Overlap between LHS and RHS array expression undefined behaviour (unless exact overlap) RISC Software GmbH Johannes Kepler University Linz
36 Cilk Plus Array Notation No new data types (use of existing array types in C and C++) Short-hand for entire array: array[:] Exception: dynamically allocated arrays array[start : length] Availability Intel C/C++ compiler GCC 5 Clang/LLVM fork ( Examples int a[10]; int b[10]; int c[10][10]; int d[10]; RISC Software GmbH Johannes Kepler University Linz
37 Examples Cilk Array notation Scalar C/C++ code a[:] = 5; for (i = 0; i < 10; i++) a[i] = 5; a[0:7] = 5; a[7:3] = 4; a[0:5:2] = 5; a[1:5:2] = 4; a[:] = b[:]; for (i = 0; i < 7; i++) a[i] = 5; for (i = 7; i < (7 + 3); i++) a[i] = 4; for (i = 0; i < 10; i += 2) a[i] = 5; for (i = 1; i < 10; i += 2) a[i] = 4; for (i = 0; i < 10; i++) a[i] = b[i]; a[:] = b[:] + 5; for (i = 0; i < 10; i++) a[i] = b[i] + 5; RISC Software GmbH Johannes Kepler University Linz
38 Examples Cilk Array notation d[:] = a[:] + b[:]; Scalar C/C++ code for (i = 0; i < 10; i++) d[i] = a[i] + b[i]; a[0:n] = 5; for (i = 0; i < n; i++) a[i] = 5; c[:][:] = 12; for (i = 0; i < 10; i++) for (j = 0; j < 10; j++) c[i][j] = 12; c[0:5:2][:] = 12; for (i = 0; i < 10; i += 2) for (j = 0; j < 10; j++) c[i][j] = 12; c[4][:] = a[:]; for (j = 0; j < 10; j++) c[4][j] = a[j]; func(a[:]); d[:] = a[b[:]] a[b[:]] = d[:] for (i = 0; i < 10; i++) func(a[i]); for (i = 0; i < 10; i++) d[i] = a[b[i]]; for (i = 0; i < 10; i++) a[b[i]] = d[i]; RISC Software GmbH Johannes Kepler University Linz
39 Examples Cilk Array notation Scalar C/C++ code if (5 == a[:]) b[:] = 1; else b[:] = 0; if ((5 == a[:]) (8 == a[:])) b[:] = 1; else b[:] = 0; a[:] = b[:] < 5? b[:] : a[:]; for (i = 0; i < 10; i++) { if (5 == a[i]) b[i] = 1; else b[i] = 0; } for (i = 0; i < 10; i++) { if ((5 == a[i]) (8 == a[i])) b[i] = 1; else b[i] = 0; } for (i = 0; i < 10; i++) a[i] = b[i] < 5? b[i] : a[i]; RISC Software GmbH Johannes Kepler University Linz
40 Functions Scalar function applied to all elements of array section Element type overloading in C++ Example: a[:] = sin(b[:]) Compiler may use vectorized version of function Built-in function User defined SIMD-enabled function elemental functions: attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz
41 Builtin Functions Builtin function sec_reduce_add(a[:]) sec_reduce_mul(a[:]) sec_reduce_max(a[:]) sec_reduce_min(a[:]) sec_reduce_max_ind(a[:]) sec_reduce_min_ind(a[:]) Return value Scalar that is a sum of all the elements in the array section. Scalar that is a product of the 10 elements from index '0' (inclusive). Scalar that is the largest element in the array section. Scalar that is the smallest element in the array section. Integer index of the largest element in the array section. Integer index of the smallest element in the array section. sec_reduce_all_zero(a[:]) 1 if all the elements of the array section are zero, else 0. sec_reduce_all_nonzero(a[:]) 1 if all elements of the array section are non-zero, else 0. sec_reduce_any_zero(a[:]) 1 if any of the elements in the array section are zero, else 0. sec_reduce_any_nonzero(a[:]) 1 if any of the elements in the array section are non-zero, else 0. RISC Software GmbH Johannes Kepler University Linz
42 Example 8 Minimum Distance Computation ~2.07 RISC Software GmbH Johannes Kepler University Linz
43 Example 8 Minimum Distance Computation Go to the directory example_8. Compile the file BinarySTLReader.cpp and distance.cpp icc -O2 -openmp BinarySTLReader.cpp distance.cpp Execute the binary. What are the execution times? Enable OpenMP support (comment out). Compile and execute again. What are the execution times? RISC Software GmbH Johannes Kepler University Linz
44 Portability RISC Software GmbH Johannes Kepler University Linz
45 Portability Initial situation Auto vectorization is default enabled (Intel) Default instruction set SSE2 (-msse2) Goal Optimal performance on target machine Usage of latest instruction set architecture (-xavx) Problem Illegal instruction exception on older hardware Solution Create several binaries Better: usage of CPU Dispatch RISC Software GmbH Johannes Kepler University Linz
46 ICC CPU Dispatch Generation of multiple code paths Usage of compiler switch ax Base line path Other switches (e.g. O3) apply to base line path Specified via x oder m (default msse2) Alternative path Specified via ax (e.g. axavx) Path selection based on executing CPU RISC Software GmbH Johannes Kepler University Linz
47 Manuel CPU Dispatch Usage of attribute ((cpu_dispatch(cpuid, ))) Example attribute ((cpu_dispatch(generic,future_cpu_16)) void dispatch_func() {}; attribute ((cpu_specific(generic))) void dispatch_func() { /* Code for generic */ } attribute ((cpu_specific(future_cpu_16))) void dispatch_func() { /* Code for future_cpu_16 */} int main() { dispatch_func(); } RISC Software GmbH Johannes Kepler University Linz
48 Conclusion Vectorization can improve the runtime performance of floating point intensive loops (increase flops/watt) Auto vectorization Many factors inhibit auto vectorization (some do not apply to certain processors like scatter/gather) Vectorization report helps to identify (non) vectorized code If one compiler fails, another could be successful The Intel compiler can provide hints for code modifications Explicit vectorization Vector intrinsics abstraction over vector assembler Cilk Plus array notation good abstraction for data parallelism RISC Software GmbH Johannes Kepler University Linz
49 Thank You! Castor, 4228m Pollux, 4092m zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz RISC Software GmbH Johannes Kepler University Linz
Vectorization on KNL
Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017
More informationOverview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.
Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationGetting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions
Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationHPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA
HPC TNT - 2 Tips and tricks for Vectorization approaches to efficient code HPC core facility CalcUA ANNIE CUYT STEFAN BECUWE FRANKY BACKELJAUW [ENGEL]BERT TIJSKENS Overview Introduction What is vectorization
More informationPresenter: Georg Zitzlsberger. Date:
Presenter: Georg Zitzlsberger Date: 07-09-2016 1 Agenda Introduction to SIMD for Intel Architecture Compiler & Vectorization Validating Vectorization Success Intel Cilk Plus OpenMP* 4.x Summary 2 Vectorization
More informationSIMD: Data parallel execution
ERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger HLRS, 15.6.2018 CPU Stored Program Computer: Base setting Memory for (int j=0; j
More informationKlaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation
S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Vectorization Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Data Types for Intel MIC Architecture
More informationVECTORISATION. Adrian
VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be
More informationOpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach
OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c
More informationReview. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags
Review Dynamic memory allocation Look a-like dynamic 2D array Simulated 2D array How cache memory / cache line works Lecture 3 Command line arguments Pre-processor directives #define #ifdef #else #endif
More informationHigh Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization
High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler
More informationKevin O Leary, Intel Technical Consulting Engineer
Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationIntel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor
More informationAdvanced OpenMP Vectoriza?on
UT Aus?n Advanced OpenMP Vectoriza?on TACC TACC OpenMP Team milfeld/lars/agomez@tacc.utexas.edu These slides & Labs:?nyurl.com/tacc- openmp Learning objec?ve Vectoriza?on: what is that? Past, present and
More informationGe#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017
Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)
More informationGrowth in Cores - A well rehearsed story
Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More informationUsing Intel AVX without Writing AVX
1 White Paper Using Intel AVX without Writing AVX Introduction and Tools Intel Advanced Vector Extensions (Intel AVX) is a new 256-bit instruction set extension to Intel Streaming SIMD Extensions (Intel
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk
More informationIntel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark
Intel Xeon Phi programming September 22nd-23rd 2015 University of Copenhagen, Denmark Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationCompiler Options. Linux/x86 Performance Practical,
Center for Information Services and High Performance Computing (ZIH) Compiler Options Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945 Ulf Markwardt
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Benchmarking, Compiler Limitations Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato, Alen Stojanov Last Time: ILP Latency/throughput (Pentium
More informationVc: Portable and Easy SIMD Programming with C++
Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron
More informationPresenter: Georg Zitzlsberger Date:
C++ SIMD parallelism with Intel Cilk Plus and OpenMP* 4.0 Presenter: Georg Zitzlsberger Date: 05-12-2014 Agenda SIMD & Vectorization How to Vectorize? Vectorize with OpenMP* 4.0 Vectorize with Intel Cilk
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More informationCode Optimization Process for KNL. Dr. Luigi Iapichino
Code Optimization Process for KNL Dr. Luigi Iapichino luigi.iapichino@lrz.de About the presenter Dr. Luigi Iapichino Scientific Computing Expert, Leibniz Supercomputing Centre Member of the Intel Parallel
More informationFigure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7
SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set
More informationLeftover. UMA vs. NUMA Parallel Computer Architecture. Shared-memory Distributed-memory
Leftover Parallel Computer Architecture Shared-memory Distributed-memory 1 2 Shared- Parallel Computer Shared- Parallel Computer System Bus Main Shared Programming through threading Multiple processors
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationOptimising with the IBM compilers
Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square
More informationIntroduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations
Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers
More informationIntel Advisor XE. Vectorization Optimization. Optimization Notice
Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics
More informationRevision 1.1. Copyright 2011, XLsoft K.K. All rights reserved. 1
1. Revision 1.1 Copyright 2011, XLsoft K.K. All rights reserved. 1 Cluster Studio XE 2012 Compiler C/C++ Fortran Library : MKL MPI: MPI C++ : TBB : IPP Analyzer Copyright 2011, XLsoft K.K. All rights reserved.
More informationProgram Optimization Through Loop Vectorization
Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop
More informationNo Time to Read This Book?
Chapter 1 No Time to Read This Book? We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain
More informationAdvanced Vector extensions. Optimization report. Optimization report
CGT 581I - Parallel Graphics and Simulation Knights Landing Vectorization Bedrich Benes, Ph.D. Professor Department of Computer Graphics Purdue University Advanced Vector extensions Vectorization: Execution
More informationCode modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.
Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationProgramming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment
Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment Joe H. Wolf III, Microprocessor Products Group, Intel Corporation Index
More informationCode Quality Analyzer (CQA)
Code Quality Analyzer (CQA) CQA for Intel 64 architectures Version 1.5b October 2018 www.maqao.org 1 1 Introduction MAQAO-CQA (MAQAO Code Quality Analyzer) is the MAQAO module addressing the code quality
More informationPERFORMANCE OPTIMISATION
PERFORMANCE OPTIMISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Hardware design Image from Colfax training material Pipeline Simple five stage pipeline: 1. Instruction fetch get instruction
More informationLecture 3. Vectorization Memory system optimizations Performance Characterization
Lecture 3 Vectorization Memory system optimizations Performance Characterization Announcements Submit NERSC form Login to lilliput.ucsd.edu using your AD credentials? Scott B. Baden /CSE 260/ Winter 2014
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationIntroduction to tuning on many core platforms. Gilles Gouaillardet RIST
Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need
More informationC Language Constructs for Parallel Programming
C Language Constructs for Parallel Programming Robert Geva 5/17/13 1 Cilk Plus Parallel tasks Easy to learn: 3 keywords Tasks, not threads Load balancing Hyper Objects Array notations Elemental Functions
More informationVectorization with Haswell. and CilkPlus. August Author: Fumero Alfonso, Juan José. Supervisor: Nowak, Andrzej
Vectorization with Haswell and CilkPlus August 2013 Author: Fumero Alfonso, Juan José Supervisor: Nowak, Andrzej CERN openlab Summer Student Report 2013 Project Specification This project concerns the
More informationCS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.
Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines
More informationProgram Optimization Through Loop Vectorization
Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Program Optimization
More informationWhat s New August 2015
What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability
More informationPerformance Analysis and Optimization MAQAO Tool
Performance Analysis and Optimization MAQAO Tool Andrés S. CHARIF-RUBIAL Emmanuel OSERET {achar,emmanuel.oseret}@exascale-computing.eu Exascale Computing Research 11th VI-HPS Tuning Workshop MAQAO Tool
More informationCompilers and optimization techniques. Gabriele Fatigati - Supercomputing Group
Compilers and optimization techniques Gabriele Fatigati - g.fatigati@cineca.it Supercomputing Group The compilation is the process by which a high-level code is converted to machine languages. Born to
More informationImproving graphics processing performance using Intel Cilk Plus
Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationIntel Array Building Blocks (Intel ArBB) Technical Presentation
Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationTechnical Report. Research Lab: LERIA
Technical Report Improvement of Fitch function for Maximum Parsimony in Phylogenetic Reconstruction with Intel AVX2 assembler instructions Research Lab: LERIA TR20130624-1 Version 1.0 24 June 2013 JEAN-MICHEL
More informationCS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010
Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010 Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: - Submit a PDF file - Use the handin program on the
More informationBoundary element quadrature schemes for multi- and many-core architectures
Boundary element quadrature schemes for multi- and many-core architectures Jan Zapletal, Michal Merta, Lukáš Malý IT4Innovations, Dept. of Applied Mathematics VŠB-TU Ostrava jan.zapletal@vsb.cz Intel MIC
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationImproving performance of the N-Body problem
Improving performance of the N-Body problem Efim Sergeev Senior Software Engineer at Singularis Lab LLC Contents Theory Naive version Memory layout optimization Cache Blocking Techniques Data Alignment
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More informationDan Stafford, Justine Bonnot
Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationCS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions
CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions Randal E. Bryant David R. O Hallaron January 14, 2016 Notice The material in this document is supplementary material to
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationIntel Software and Services, Kirill Rogozhin
Intel Software and Services, 2016 Kirill Rogozhin Agenda Motivation for vectorization OpenMP 4.x programming model Intel Advisor: focus and characterize Enable vectorization on scalar code Speed-up your
More informationUnder the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.
Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.
More informationOpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat
OpenMP 4.0 implementation in GCC Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Work started in April 2013, C/C++ support with host fallback only
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationSSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals
SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions
More informationIntel Compilers for C/C++ and Fortran
Intel Compilers for C/C++ and Fortran Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Important Optimization Options for HPC High Level Optimizations (HLO) Pragmas Interprocedural
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle
More informationCilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation
Cilk Plus in GCC GNU Tools Cauldron 2012 Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation July 10, 2012 Presentation Outline Introduction Cilk Plus components Implementation GCC Project Status
More informationCS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions
CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions Randal E. Bryant David R. O Hallaron October 12, 2015 Notice The material in this document is supplementary material to
More informationOPENMP FOR ACCELERATORS
7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There
More informationPRACE PATC Course: Vectorisation & Basic Performance Overview. Ostrava,
PRACE PATC Course: Vectorisation & Basic Performance Overview Ostrava, 7-8.2.2017 1 Agenda Basic Vectorisation & SIMD Instructions IMCI Vector Extension Intel compiler flags Hands-on Intel Tool VTune Amplifier
More informationWhat s P. Thierry
What s new@intel P. Thierry Principal Engineer, Intel Corp philippe.thierry@intel.com CPU trend Memory update Software Characterization in 30 mn 10 000 feet view CPU : Range of few TF/s and
More informationAdvanced programming with OpenMP. Libor Bukata a Jan Dvořák
Advanced programming with OpenMP Libor Bukata a Jan Dvořák Programme of the lab OpenMP Tasks parallel merge sort, parallel evaluation of expressions OpenMP SIMD parallel integration to calculate π User-defined
More informationMAQAO Hands-on exercises FROGGY Cluster
MAQAO Hands-on exercises FROGGY Cluster LProf: lightweight generic profiler LProf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Copy handson material > cp /home/projects/pr-vi-hps-tw18/tutorial/maqao.tar.bz2
More informationParallel processing with OpenMP. #pragma omp
Parallel processing with OpenMP #pragma omp 1 Bit-level parallelism long words Instruction-level parallelism automatic SIMD: vector instructions vector types Multiple threads OpenMP GPU CUDA GPU + CPU
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationVectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition
Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Aaron Birkland Consultant Cornell Center for Advanced Computing December 11, 2012 1 Simple Vectorization This lab serves as an
More informationOpenMP 4.5: Threading, vectorization & offloading
OpenMP 4.5: Threading, vectorization & offloading Michal Merta michal.merta@vsb.cz 2nd of March 2018 Agenda Introduction The Basics OpenMP Tasks Vectorization with OpenMP 4.x Offloading to Accelerators
More informationDynamic SIMD Scheduling
Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies
More informationAccelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture
Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome
More informationCilk Plus GETTING STARTED
Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions
More informationAn Intel Xeon Phi Backend for the ExaStencils Code Generator
Bachelor thesis An Intel Xeon Phi Backend for the ExaStencils Code Generator Thomas Lang Supervisor: Tutor: Prof. Christian Lengauer, Ph.D. Dr. Armin Größlinger 27th April 2016 Abstract Stencil computations
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationCompiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005
Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer
More informationSIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016
SIMD Instructions outside and inside Oracle 2c Laurent Léturgez 206 Whoami Oracle Consultant since 200 Former developer (C, Java, perl, PL/SQL) Owner@Premiseo: Data Management on Premise and in the Cloud
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More information