Advanced Parallel Programming II

Size: px
Start display at page:

Download "Advanced Parallel Programming II"

Transcription

1 Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz

2 Introduction to Vectorization RISC Software GmbH Johannes Kepler University Linz

3 Motivation Increasement in number of cores Threading techniques to improve performance But flops per cycle of vector units increased as much as number of cores No use of vector units wasting flops/watt For best performance Use all cores Efficient use of vector units Ignoring potential of vector units is so inefficient as using only one core RISC Software GmbH Johannes Kepler University Linz

4 Vector Unit Single Instruction Multiple Data (SIMD) units Mostly for floating point operations Data parallelization with one instruction 64-Bit unit 1 DP flop, 2 SP flop 128-Bit unit 2 DP flop, 4 SP flop Multiple data elements are loaded into vector registers and used by vector units Some architectures have more than one instruction per cycle (e.g. Sandy Bridge) RISC Software GmbH Johannes Kepler University Linz

5 Parallel Execution Scalar version works on one element at a time Vector version carries out the same instructions on many elements at a time a[i] = b[i] + c[i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8]; a[i] = b[i] + c[i] x d[i] a[i] = b[i] + c[i] x d[i] a[i+1] = b[i+1] + c[i+1] x d[i+1] a[i+2] = b[i+2] + c[i+2] x d[i+2] a[i+3] = b[i+3] + c[i+3] x d[i+3] a[i+4] = b[i+4] + c[i+4] x d[i+4] a[i+5] = b[i+5] + c[i+5] x d[i+5] a[i+6] = b[i+6] + c[i+6] x d[i+6] a[i+7] = b[i+7] + c[i+7] x d[i+7] RISC Software GmbH Johannes Kepler University Linz

6 Vector Registers RISC Software GmbH Johannes Kepler University Linz

7 Vector Unit Usage (Programmers View) Use vectorized libraries (e.g. Intel MKL) Ease of use Fully automatic vectorization Auto vectorization hints (#pragma ivdep) SIMD feature (#pragma simd and simd function annotation) Vector intrinsics (e.g. mm_add_ps()) ASM code (e.g. addps) Programmer control RISC Software GmbH Johannes Kepler University Linz

8 Auto Vectorization RISC Software GmbH Johannes Kepler University Linz

9 Auto Vectorization Modern compilers analyse loops in serial code identification for vectorization Perform loop transformations for identification Usage of instruction set of target architecture RISC Software GmbH Johannes Kepler University Linz

10 Common Compiler Switches GCC and ICC Functionality Switch Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. ipo, -O3, ) -fast Create symbols for debugging -g Generate assembly files -S OpenMP support -openmp RISC Software GmbH Johannes Kepler University Linz

11 Architecture Specific Compiler Switches GCC Functionality Optimize for current machine Generate SSE v1 code Generate SSE v2 code (default, may also emit SSE v1 code) Generate SSE v3 code (may also emit SSE v1 and v2 code) Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) Generate AVX code Generate AVX v2 code Switch -march=native -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 RISC Software GmbH Johannes Kepler University Linz

12 Architecture Specific Compiler Switches ICC Functionality Switch * Optimize for current machine Generate SSE v1 code Generate SSE v2 code (default, may also emit SSE v1 code) Generate SSE v3 code (may also emit SSE v1 and v2 code) Generate SSE v3 code for Atom-based processors Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) Generate AVX code -xhost -xsse1 -xsse2 -xsse3 -xsse_atom -xssse3 -xsse4.1 -xsse4.2 -xavx * For Intel processors use x, for non-intel processors use -m RISC Software GmbH Johannes Kepler University Linz

13 Example 4 Simple Vector Addition 1. Go to the directory example_4. 2. Compile simple.cpp (no vectorization) and execute the binary icc -std=c++11 -O2 -no-vec simple.cpp (g++ -std=c++11 -O2 simple.cpp) 3. Compile simple.cpp (vectorization) and execute the binary icc -std=c++11 -O2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize) 4. What is the difference considering the execution times? 5. Can the execution times be further improved? RISC Software GmbH Johannes Kepler University Linz

14 SIMD Vectorization Basics Vectorization offers good performance improvement on floating point intensive code Vectorized code could compute slightly different results than non vectorized (x87 FPU 80 Bit, SIMD 64 Bit) Even for scalar operations vector unit is used Vectorization is only one aspect to improve performance Efficent use of the cache is necessary RISC Software GmbH Johannes Kepler University Linz

15 Parallelization at No Cost We tell the compiler to vectorize That s not the whole story There are cases, where a compiler cannot vectorize the code How can we analyse such situations vectorization report Generation of vectorization report via the compiler More interesting what was not done and why Focus on code paths, which were not vectorized RISC Software GmbH Johannes Kepler University Linz

16 Example 5 Vectorization Report 1. Go to the directory example_5. 2. Compile simple.cpp with enabled vector report generation: icc -std=c++11 -O2 -vec-report=2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize fopt-info-vec-missed simple.cpp) 3. Which positive/negative information does the vectorization report tell? 4. Insert the following code before std::cout in calcsp() and compile again. Have a look at the vectorization report. for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i 1]; } RISC Software GmbH Johannes Kepler University Linz

17 Vectorization Report ICC Information Which code was vectorized? Which code was not vectorized? Compiler switch vec-report<n> n=0: no diagnostic information n=1: (default) vectorized loops n=2: vectorized/non vectorized loops (and why) n=3: additional dependency information n=4: only non vectorized loops n=5: only non vectorized loops and dependency information n=6: vectorized/non vectorized loops with details RISC Software GmbH Johannes Kepler University Linz

18 Loop Unrolling Unrolling allows compiler to reconstruct loop for vector operations for (i = 0; i < N; i++) { a[i] = b[i] * c[i]; } Load b(i,, i + 3) Load c(i,, i + 3) Operate b * c -> a Store a(i,, i + 3) for (i = 0; i < N; i += 4) { a[i ] = b[i ] * c[i ]; a[i + 1] = b[i + 1] * c[i + 1]; a[i + 2] = b[i + 2] * c[i + 2]; a[i + 3] = b[i + 3] * c[i + 3]; } RISC Software GmbH Johannes Kepler University Linz

19 Requirements for Auto Vectorization Countable Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking) RISC Software GmbH Johannes Kepler University Linz

20 Requirements for Auto Vectorization Only most inner loop (caution in case of loop interchange or loop collapsing) No functions calls, but Instrinsic math (sin, log, ) Inline functions Elemental functions attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz

21 Inhibitors of Auto Vectorization Non contiguous data // arrays accessed with stride 2 for (int i=0; i<size; i+=2) b[i]+=a[i]+x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<size; j++) for (int i=0; i<size; i++) b[i]+=a[i][j]*x[j]; // indirect addressing of x using index array for (int i=0; i<size; i+=2) b[i]+=a[i]*x[index[i]]; stride 1 is best. Caution in case of more dimensional arrays: do j=1,n do i=1,n a(i,j)=b(i,j)*s enddo enddo F90 C for (j=0; j<n; j++) for (i=0;i<n;i++) a[j][i]=b[j][i]*s; RISC Software GmbH Johannes Kepler University Linz

22 Inhibitors of Auto Vectorization Inability to identify data with alias (or overlapping) Runtime check possible multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; } Runtime check not possible non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff]; } } Would vectorize with strict aliasing (-ansi-alias) RISC Software GmbH Johannes Kepler University Linz

23 Inhibitors of Auto Vectorization Vector dependency Read after write (RAW): non vectorizable for (i = 1; i < N; i++) a[i] = a[i 1] + b[i]; Write after read (WAR): vectorizable for (i = 0; i < N - 1; i++) a[i] = a[i + 1] + b[i]; Read after read (RAR): vectorizable for (i = 0; i < N; i++) a[i] = b[i % M] + c[i]; Write after write (WAW): non vectorizable for (i = 0; i < N; i++) a[i % M] = b[i] + c[i]; RISC Software GmbH Johannes Kepler University Linz

24 Efficiency Aspects for Auto Vectorization Alignment Address alignment (SSE: 16 Bytes, AVX: 32 Bytes) Check at compile time alignment Otherwise check at runtime peel and remainder loop Explicit definition in code: local/global: attribute ((alligned(16))) Heap: _mm_alloc, _mm_free Compiler support: assume_aligned(x, 16) void fill(char* x) { for (int i=0;i<1024;i++) x[i]=1; } Peeling peel=x&0x0f; if (peel!=0) { peel=16 peel; for (i=0; i<peel;i++)x[i]=1; } for (i=peel; i<1024; i++) x[i]=1; RISC Software GmbH Johannes Kepler University Linz

25 Efficieny Aspects for Auto Vectorization Data Layout Structure of Arrays (SoA) instead of Array of Structures (AoS) struct Vector3d { //AoS double x; double y; double z; }; x y z x y z x y z x y z struct Vectors3d { //SoA double* x; double* y; double* z; }; x x x x y y y y z z z z RISC Software GmbH Johannes Kepler University Linz

26 Example 6 Non Contiguous Data 1. Go to the directory example_6. 2. Compile contig.cpp with enabled vectorization report icc -std=c++11 -O2 -vec-report=2 contig.cpp (g++ -std=c++11 -O2 -ftree-vectorize fopt-info-vec-missed contig.cpp) 3. What does the vectorization report tell? RISC Software GmbH Johannes Kepler University Linz

27 Compiler Directives ICC Many compilers have directives for vectorization hints ivdep (C: #pragma ivdep, Fortran:!dec$ ivdep ) No vector dependency in loop (usage in case of existing dependency leads to incorrect code) vector always (C: #pragma vector always, Fortran:!dec$ vector always ) Elemental functions: attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz

28 Compiler Directives ICC SIMD Extensions SIMD pragma SIMD function annotation attribute ((simd)) Differences to traditional ivdep and vector always Traditional pragmas are more like hints New SIMD extension more like an assertion Fine control over auto vectorization with additiontal clauses (vectorlength, private, linear, reduction, assert) RISC Software GmbH Johannes Kepler University Linz

29 Example 7 Vector Dependency Go to the directory example_7 Compile the file forware.cpp and execute the binary. icc -std=c++11 -O2 -vec-report=2 forward.cpp (g++ -std=c++11 -O2 ftree-vectorize -fopt-info-vec forward.cpp) What is the execution time? What does the vectorization report tell? What happens if you split the inner loop into two seperate update loops for b and a? RISC Software GmbH Johannes Kepler University Linz

30 Vector Intrinsics RISC Software GmbH Johannes Kepler University Linz

31 Vector Intrinsics void vec_eltwise_product_avx(vec_t* a, vec_t* b, vec_t* c) { size_t i; m256 va; m256 vb; m256 vc; for (i = 0; i < a->size; i += 8) { va = _mm256_loadu_ps(&a->data[i]); vb = _mm256_loadu_ps(&b->data[i]); vc = _mm256_mul_ps(va, vb); _m256_storeu_ps(&c->data[i], vc); } } RISC Software GmbH Johannes Kepler University Linz

32 Vector Intrinsics Different data types for different architectures SSE: mm128, mm128d, mm128i AVX: mm256, mm256d, mm256i Different operations for different architectures SSE: _mm_add_ps(), _mm_add_pd(), AVX:_mm256_add_ps(), _mm256_add_pd(), Portability Implement wrapper for different architectures RISC Software GmbH Johannes Kepler University Linz

33 Cilk Plus Array Notation RISC Software GmbH Johannes Kepler University Linz

34 Cilk Plus Array Notation C/C++ language extension Support for data parallel operations New language array expression array-expression [ lower-bound : length : stride ] Default values of each argument in [:] lower-bound: 0 length: length of array stride: 1 (if default stride, the second : may be omitted) array-expression[:] array section is entire array of known length and stride 1 RISC Software GmbH Johannes Kepler University Linz

35 Cilk Plus Array Notation array-expression[:][:] denotes a two dimensional array Two new terms rank: number of array sections of single array (rank zero scalar) shape: length of each array section Statement all expressions have same rank and shape or rank zero broadcast for each element Built-in functions (reductions, ) Overlap between LHS and RHS array expression undefined behaviour (unless exact overlap) RISC Software GmbH Johannes Kepler University Linz

36 Cilk Plus Array Notation No new data types (use of existing array types in C and C++) Short-hand for entire array: array[:] Exception: dynamically allocated arrays array[start : length] Availability Intel C/C++ compiler GCC 5 Clang/LLVM fork ( Examples int a[10]; int b[10]; int c[10][10]; int d[10]; RISC Software GmbH Johannes Kepler University Linz

37 Examples Cilk Array notation Scalar C/C++ code a[:] = 5; for (i = 0; i < 10; i++) a[i] = 5; a[0:7] = 5; a[7:3] = 4; a[0:5:2] = 5; a[1:5:2] = 4; a[:] = b[:]; for (i = 0; i < 7; i++) a[i] = 5; for (i = 7; i < (7 + 3); i++) a[i] = 4; for (i = 0; i < 10; i += 2) a[i] = 5; for (i = 1; i < 10; i += 2) a[i] = 4; for (i = 0; i < 10; i++) a[i] = b[i]; a[:] = b[:] + 5; for (i = 0; i < 10; i++) a[i] = b[i] + 5; RISC Software GmbH Johannes Kepler University Linz

38 Examples Cilk Array notation d[:] = a[:] + b[:]; Scalar C/C++ code for (i = 0; i < 10; i++) d[i] = a[i] + b[i]; a[0:n] = 5; for (i = 0; i < n; i++) a[i] = 5; c[:][:] = 12; for (i = 0; i < 10; i++) for (j = 0; j < 10; j++) c[i][j] = 12; c[0:5:2][:] = 12; for (i = 0; i < 10; i += 2) for (j = 0; j < 10; j++) c[i][j] = 12; c[4][:] = a[:]; for (j = 0; j < 10; j++) c[4][j] = a[j]; func(a[:]); d[:] = a[b[:]] a[b[:]] = d[:] for (i = 0; i < 10; i++) func(a[i]); for (i = 0; i < 10; i++) d[i] = a[b[i]]; for (i = 0; i < 10; i++) a[b[i]] = d[i]; RISC Software GmbH Johannes Kepler University Linz

39 Examples Cilk Array notation Scalar C/C++ code if (5 == a[:]) b[:] = 1; else b[:] = 0; if ((5 == a[:]) (8 == a[:])) b[:] = 1; else b[:] = 0; a[:] = b[:] < 5? b[:] : a[:]; for (i = 0; i < 10; i++) { if (5 == a[i]) b[i] = 1; else b[i] = 0; } for (i = 0; i < 10; i++) { if ((5 == a[i]) (8 == a[i])) b[i] = 1; else b[i] = 0; } for (i = 0; i < 10; i++) a[i] = b[i] < 5? b[i] : a[i]; RISC Software GmbH Johannes Kepler University Linz

40 Functions Scalar function applied to all elements of array section Element type overloading in C++ Example: a[:] = sin(b[:]) Compiler may use vectorized version of function Built-in function User defined SIMD-enabled function elemental functions: attribute ((vector)) RISC Software GmbH Johannes Kepler University Linz

41 Builtin Functions Builtin function sec_reduce_add(a[:]) sec_reduce_mul(a[:]) sec_reduce_max(a[:]) sec_reduce_min(a[:]) sec_reduce_max_ind(a[:]) sec_reduce_min_ind(a[:]) Return value Scalar that is a sum of all the elements in the array section. Scalar that is a product of the 10 elements from index '0' (inclusive). Scalar that is the largest element in the array section. Scalar that is the smallest element in the array section. Integer index of the largest element in the array section. Integer index of the smallest element in the array section. sec_reduce_all_zero(a[:]) 1 if all the elements of the array section are zero, else 0. sec_reduce_all_nonzero(a[:]) 1 if all elements of the array section are non-zero, else 0. sec_reduce_any_zero(a[:]) 1 if any of the elements in the array section are zero, else 0. sec_reduce_any_nonzero(a[:]) 1 if any of the elements in the array section are non-zero, else 0. RISC Software GmbH Johannes Kepler University Linz

42 Example 8 Minimum Distance Computation ~2.07 RISC Software GmbH Johannes Kepler University Linz

43 Example 8 Minimum Distance Computation Go to the directory example_8. Compile the file BinarySTLReader.cpp and distance.cpp icc -O2 -openmp BinarySTLReader.cpp distance.cpp Execute the binary. What are the execution times? Enable OpenMP support (comment out). Compile and execute again. What are the execution times? RISC Software GmbH Johannes Kepler University Linz

44 Portability RISC Software GmbH Johannes Kepler University Linz

45 Portability Initial situation Auto vectorization is default enabled (Intel) Default instruction set SSE2 (-msse2) Goal Optimal performance on target machine Usage of latest instruction set architecture (-xavx) Problem Illegal instruction exception on older hardware Solution Create several binaries Better: usage of CPU Dispatch RISC Software GmbH Johannes Kepler University Linz

46 ICC CPU Dispatch Generation of multiple code paths Usage of compiler switch ax Base line path Other switches (e.g. O3) apply to base line path Specified via x oder m (default msse2) Alternative path Specified via ax (e.g. axavx) Path selection based on executing CPU RISC Software GmbH Johannes Kepler University Linz

47 Manuel CPU Dispatch Usage of attribute ((cpu_dispatch(cpuid, ))) Example attribute ((cpu_dispatch(generic,future_cpu_16)) void dispatch_func() {}; attribute ((cpu_specific(generic))) void dispatch_func() { /* Code for generic */ } attribute ((cpu_specific(future_cpu_16))) void dispatch_func() { /* Code for future_cpu_16 */} int main() { dispatch_func(); } RISC Software GmbH Johannes Kepler University Linz

48 Conclusion Vectorization can improve the runtime performance of floating point intensive loops (increase flops/watt) Auto vectorization Many factors inhibit auto vectorization (some do not apply to certain processors like scatter/gather) Vectorization report helps to identify (non) vectorized code If one compiler fails, another could be successful The Intel compiler can provide hints for code modifications Explicit vectorization Vector intrinsics abstraction over vector assembler Cilk Plus array notation good abstraction for data parallelism RISC Software GmbH Johannes Kepler University Linz

49 Thank You! Castor, 4228m Pollux, 4092m zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz RISC Software GmbH Johannes Kepler University Linz

Vectorization on KNL

Vectorization on KNL Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA HPC TNT - 2 Tips and tricks for Vectorization approaches to efficient code HPC core facility CalcUA ANNIE CUYT STEFAN BECUWE FRANKY BACKELJAUW [ENGEL]BERT TIJSKENS Overview Introduction What is vectorization

More information

Presenter: Georg Zitzlsberger. Date:

Presenter: Georg Zitzlsberger. Date: Presenter: Georg Zitzlsberger Date: 07-09-2016 1 Agenda Introduction to SIMD for Intel Architecture Compiler & Vectorization Validating Vectorization Success Intel Cilk Plus OpenMP* 4.x Summary 2 Vectorization

More information

SIMD: Data parallel execution

SIMD: Data parallel execution ERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger HLRS, 15.6.2018 CPU Stored Program Computer: Base setting Memory for (int j=0; j

More information

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Vectorization Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Data Types for Intel MIC Architecture

More information

VECTORISATION. Adrian

VECTORISATION. Adrian VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be

More information

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c

More information

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags Review Dynamic memory allocation Look a-like dynamic 2D array Simulated 2D array How cache memory / cache line works Lecture 3 Command line arguments Pre-processor directives #define #ifdef #else #endif

More information

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

Advanced OpenMP Vectoriza?on

Advanced OpenMP Vectoriza?on UT Aus?n Advanced OpenMP Vectoriza?on TACC TACC OpenMP Team milfeld/lars/agomez@tacc.utexas.edu These slides & Labs:?nyurl.com/tacc- openmp Learning objec?ve Vectoriza?on: what is that? Past, present and

More information

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)

More information

Growth in Cores - A well rehearsed story

Growth in Cores - A well rehearsed story Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

More information

Using Intel AVX without Writing AVX

Using Intel AVX without Writing AVX 1 White Paper Using Intel AVX without Writing AVX Introduction and Tools Intel Advanced Vector Extensions (Intel AVX) is a new 256-bit instruction set extension to Intel Streaming SIMD Extensions (Intel

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk

More information

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark Intel Xeon Phi programming September 22nd-23rd 2015 University of Copenhagen, Denmark Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Compiler Options. Linux/x86 Performance Practical,

Compiler Options. Linux/x86 Performance Practical, Center for Information Services and High Performance Computing (ZIH) Compiler Options Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945 Ulf Markwardt

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Benchmarking, Compiler Limitations Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato, Alen Stojanov Last Time: ILP Latency/throughput (Pentium

More information

Vc: Portable and Easy SIMD Programming with C++

Vc: Portable and Easy SIMD Programming with C++ Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron

More information

Presenter: Georg Zitzlsberger Date:

Presenter: Georg Zitzlsberger Date: C++ SIMD parallelism with Intel Cilk Plus and OpenMP* 4.0 Presenter: Georg Zitzlsberger Date: 05-12-2014 Agenda SIMD & Vectorization How to Vectorize? Vectorize with OpenMP* 4.0 Vectorize with Intel Cilk

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

Code Optimization Process for KNL. Dr. Luigi Iapichino

Code Optimization Process for KNL. Dr. Luigi Iapichino Code Optimization Process for KNL Dr. Luigi Iapichino luigi.iapichino@lrz.de About the presenter Dr. Luigi Iapichino Scientific Computing Expert, Leibniz Supercomputing Centre Member of the Intel Parallel

More information

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set

More information

Leftover. UMA vs. NUMA Parallel Computer Architecture. Shared-memory Distributed-memory

Leftover. UMA vs. NUMA   Parallel Computer Architecture. Shared-memory Distributed-memory Leftover Parallel Computer Architecture Shared-memory Distributed-memory 1 2 Shared- Parallel Computer Shared- Parallel Computer System Bus Main Shared Programming through threading Multiple processors

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

Optimising with the IBM compilers

Optimising with the IBM compilers Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square

More information

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers

More information

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Intel Advisor XE. Vectorization Optimization. Optimization Notice Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics

More information

Revision 1.1. Copyright 2011, XLsoft K.K. All rights reserved. 1

Revision 1.1. Copyright 2011, XLsoft K.K. All rights reserved. 1 1. Revision 1.1 Copyright 2011, XLsoft K.K. All rights reserved. 1 Cluster Studio XE 2012 Compiler C/C++ Fortran Library : MKL MPI: MPI C++ : TBB : IPP Analyzer Copyright 2011, XLsoft K.K. All rights reserved.

More information

Program Optimization Through Loop Vectorization

Program Optimization Through Loop Vectorization Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop

More information

No Time to Read This Book?

No Time to Read This Book? Chapter 1 No Time to Read This Book? We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain

More information

Advanced Vector extensions. Optimization report. Optimization report

Advanced Vector extensions. Optimization report. Optimization report CGT 581I - Parallel Graphics and Simulation Knights Landing Vectorization Bedrich Benes, Ph.D. Professor Department of Computer Graphics Purdue University Advanced Vector extensions Vectorization: Execution

More information

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017 Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is

More information

Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment

Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment Joe H. Wolf III, Microprocessor Products Group, Intel Corporation Index

More information

Code Quality Analyzer (CQA)

Code Quality Analyzer (CQA) Code Quality Analyzer (CQA) CQA for Intel 64 architectures Version 1.5b October 2018 www.maqao.org 1 1 Introduction MAQAO-CQA (MAQAO Code Quality Analyzer) is the MAQAO module addressing the code quality

More information

PERFORMANCE OPTIMISATION

PERFORMANCE OPTIMISATION PERFORMANCE OPTIMISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Hardware design Image from Colfax training material Pipeline Simple five stage pipeline: 1. Instruction fetch get instruction

More information

Lecture 3. Vectorization Memory system optimizations Performance Characterization

Lecture 3. Vectorization Memory system optimizations Performance Characterization Lecture 3 Vectorization Memory system optimizations Performance Characterization Announcements Submit NERSC form Login to lilliput.ucsd.edu using your AD credentials? Scott B. Baden /CSE 260/ Winter 2014

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

C Language Constructs for Parallel Programming

C Language Constructs for Parallel Programming C Language Constructs for Parallel Programming Robert Geva 5/17/13 1 Cilk Plus Parallel tasks Easy to learn: 3 keywords Tasks, not threads Load balancing Hyper Objects Array notations Elemental Functions

More information

Vectorization with Haswell. and CilkPlus. August Author: Fumero Alfonso, Juan José. Supervisor: Nowak, Andrzej

Vectorization with Haswell. and CilkPlus. August Author: Fumero Alfonso, Juan José. Supervisor: Nowak, Andrzej Vectorization with Haswell and CilkPlus August 2013 Author: Fumero Alfonso, Juan José Supervisor: Nowak, Andrzej CERN openlab Summer Student Report 2013 Project Specification This project concerns the

More information

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009. Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines

More information

Program Optimization Through Loop Vectorization

Program Optimization Through Loop Vectorization Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Program Optimization

More information

What s New August 2015

What s New August 2015 What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability

More information

Performance Analysis and Optimization MAQAO Tool

Performance Analysis and Optimization MAQAO Tool Performance Analysis and Optimization MAQAO Tool Andrés S. CHARIF-RUBIAL Emmanuel OSERET {achar,emmanuel.oseret}@exascale-computing.eu Exascale Computing Research 11th VI-HPS Tuning Workshop MAQAO Tool

More information

Compilers and optimization techniques. Gabriele Fatigati - Supercomputing Group

Compilers and optimization techniques. Gabriele Fatigati - Supercomputing Group Compilers and optimization techniques Gabriele Fatigati - g.fatigati@cineca.it Supercomputing Group The compilation is the process by which a high-level code is converted to machine languages. Born to

More information

Improving graphics processing performance using Intel Cilk Plus

Improving graphics processing performance using Intel Cilk Plus Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

Intel Array Building Blocks (Intel ArBB) Technical Presentation

Intel Array Building Blocks (Intel ArBB) Technical Presentation Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Technical Report. Research Lab: LERIA

Technical Report. Research Lab: LERIA Technical Report Improvement of Fitch function for Maximum Parsimony in Phylogenetic Reconstruction with Intel AVX2 assembler instructions Research Lab: LERIA TR20130624-1 Version 1.0 24 June 2013 JEAN-MICHEL

More information

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010 Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: - Submit a PDF file - Use the handin program on the

More information

Boundary element quadrature schemes for multi- and many-core architectures

Boundary element quadrature schemes for multi- and many-core architectures Boundary element quadrature schemes for multi- and many-core architectures Jan Zapletal, Michal Merta, Lukáš Malý IT4Innovations, Dept. of Applied Mathematics VŠB-TU Ostrava jan.zapletal@vsb.cz Intel MIC

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Improving performance of the N-Body problem

Improving performance of the N-Body problem Improving performance of the N-Body problem Efim Sergeev Senior Software Engineer at Singularis Lab LLC Contents Theory Naive version Memory layout optimization Cache Blocking Techniques Data Alignment

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions Randal E. Bryant David R. O Hallaron January 14, 2016 Notice The material in this document is supplementary material to

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Intel Software and Services, Kirill Rogozhin

Intel Software and Services, Kirill Rogozhin Intel Software and Services, 2016 Kirill Rogozhin Agenda Motivation for vectorization OpenMP 4.x programming model Intel Advisor: focus and characterize Enable vectorization on scalar code Speed-up your

More information

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.

More information

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Work started in April 2013, C/C++ support with host fallback only

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

Intel Compilers for C/C++ and Fortran

Intel Compilers for C/C++ and Fortran Intel Compilers for C/C++ and Fortran Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Important Optimization Options for HPC High Level Optimizations (HLO) Pragmas Interprocedural

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle

More information

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation Cilk Plus in GCC GNU Tools Cauldron 2012 Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation July 10, 2012 Presentation Outline Introduction Cilk Plus components Implementation GCC Project Status

More information

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions Randal E. Bryant David R. O Hallaron October 12, 2015 Notice The material in this document is supplementary material to

More information

OPENMP FOR ACCELERATORS

OPENMP FOR ACCELERATORS 7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There

More information

PRACE PATC Course: Vectorisation & Basic Performance Overview. Ostrava,

PRACE PATC Course: Vectorisation & Basic Performance Overview. Ostrava, PRACE PATC Course: Vectorisation & Basic Performance Overview Ostrava, 7-8.2.2017 1 Agenda Basic Vectorisation & SIMD Instructions IMCI Vector Extension Intel compiler flags Hands-on Intel Tool VTune Amplifier

More information

What s P. Thierry

What s P. Thierry What s new@intel P. Thierry Principal Engineer, Intel Corp philippe.thierry@intel.com CPU trend Memory update Software Characterization in 30 mn 10 000 feet view CPU : Range of few TF/s and

More information

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák Advanced programming with OpenMP Libor Bukata a Jan Dvořák Programme of the lab OpenMP Tasks parallel merge sort, parallel evaluation of expressions OpenMP SIMD parallel integration to calculate π User-defined

More information

MAQAO Hands-on exercises FROGGY Cluster

MAQAO Hands-on exercises FROGGY Cluster MAQAO Hands-on exercises FROGGY Cluster LProf: lightweight generic profiler LProf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Copy handson material > cp /home/projects/pr-vi-hps-tw18/tutorial/maqao.tar.bz2

More information

Parallel processing with OpenMP. #pragma omp

Parallel processing with OpenMP. #pragma omp Parallel processing with OpenMP #pragma omp 1 Bit-level parallelism long words Instruction-level parallelism automatic SIMD: vector instructions vector types Multiple threads OpenMP GPU CUDA GPU + CPU

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition

Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Aaron Birkland Consultant Cornell Center for Advanced Computing December 11, 2012 1 Simple Vectorization This lab serves as an

More information

OpenMP 4.5: Threading, vectorization & offloading

OpenMP 4.5: Threading, vectorization & offloading OpenMP 4.5: Threading, vectorization & offloading Michal Merta michal.merta@vsb.cz 2nd of March 2018 Agenda Introduction The Basics OpenMP Tasks Vectorization with OpenMP 4.x Offloading to Accelerators

More information

Dynamic SIMD Scheduling

Dynamic SIMD Scheduling Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies

More information

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

An Intel Xeon Phi Backend for the ExaStencils Code Generator

An Intel Xeon Phi Backend for the ExaStencils Code Generator Bachelor thesis An Intel Xeon Phi Backend for the ExaStencils Code Generator Thomas Lang Supervisor: Tutor: Prof. Christian Lengauer, Ph.D. Dr. Armin Größlinger 27th April 2016 Abstract Stencil computations

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer

More information

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016 SIMD Instructions outside and inside Oracle 2c Laurent Léturgez 206 Whoami Oracle Consultant since 200 Former developer (C, Java, perl, PL/SQL) Owner@Premiseo: Data Management on Premise and in the Cloud

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information