Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Similar documents
Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms

Day 6: Optimization on Parallel Intel Architectures

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

CME 213 S PRING Eric Darve

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Numerical Algorithms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2

Parallel Computing. Hwansoo Han (SKKU)

Fusion PIC Code Performance Analysis on the Cori KNL System. T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel!

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Intel Knights Landing Hardware

Carlo Cavazzoni, HPC department, CINECA

27. Parallel Programming I

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Parallel Algorithm Engineering

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Intel Xeon Phi архитектура, модели программирования, оптимизация.

CS560 Lecture Parallel Architecture 1

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

The Art of Parallel Processing

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

27. Parallel Programming I

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Efficient Parallel Programming on Xeon Phi for Exascale

27. Parallel Programming I

Instructor: Leopold Grinberg

Bei Wang, Dmitry Prohorov and Carlos Rosales

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Parallel and High Performance Computing CSE 745

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Lecture 2. Memory locality optimizations Address space organization

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Parallelism. CS6787 Lecture 8 Fall 2017

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Parallel Computing. November 20, W.Homberg

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Intel Architecture for HPC

CSC2/458 Parallel and Distributed Systems Machines and Models

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to Parallel Computing

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

NUMA-aware OpenMP Programming

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Introduction to Parallel Programming

Parallel Programming Libraries and implementations

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Multi-core Architectures. Dr. Yingwu Zhu

Online Course Evaluation. What we will do in the last week?

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Martin Kruliš, v

Introduction to KNL and Parallel Computing

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Introduction to OpenMP

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

COSC 6385 Computer Architecture - Multi Processor Systems

Optimisation Myths and Facts as Seen in Statistical Physics

Kevin O Leary, Intel Technical Consulting Engineer

High Performance Computing (HPC) Introduction

Introduction to Parallel Programming

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Advanced Parallel Programming I

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

High Performance Computing: Architecture, Applications, and SE Issues. Peter Strazdins

OpenMP: Open Multiprocessing

Tutorial 11. Final Exam Review

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

FUSION PROCESSORS AND HPC

Introduction to parallel computing

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

VLPL-S Optimization on Knights Landing

Optimising for the p690 memory system

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017

MEMORY ON THE KNL. Adrian Some slides from Intel Presentations

An Introduction to Parallel Programming

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Intel Advisor XE. Vectorization Optimization. Optimization Notice

High Performance Computing Systems

Distributed Systems CS /640

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

The Era of Heterogeneous Computing

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

ECE 563 Second Exam, Spring 2014

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Transcription:

Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp

Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions

Why do we need many core platforms?

Why many-core platforms? Von Neumann architecture (1945) Moore s law (1965) The number of transistors in a dense integrated circuit double approximately every two years

CPU trends past 40 years Sources: AMD - 2011

The free lunch is over Back in the days, Moore s law meant the processor frequency doubled every two years Memory bandwidth increased too upgrading hardware was enough to increase application performance Processor frequency cannot increase any more (physical, power and thermal constraints) More transistors means more and more capable cores Memory bandwidth and cache size keeps increasing Memory bandwidth and cache size per core tends to decrease

The road to exascale Exascale supercomputer using evolved technology would be too expensive too power hungry ( > 100 MW, acceptable is < 20 MW) Disruptive technology is needed FPGA? GPU? Many core?

Disruptive technology is needed for exascale Good compromise between absolute performance, performance per Watt, cost and programmability Many core is a popular option, easy programming is an important factor Sunway TaihuLight (RISC - #1 on Top500) KNL (x86_64 - #5 and #6 on Top500) Post K (ARMv8)

Challenges More, and more capable but slower cores Good algorithm is critical Parallelism is no more an option Instruction level parallelism (vectorization) Block level parallelism (MPI and/or OpenMP) Hierarchical memory Hyperthreading Code modernization is mandatory

Manycore tuning challenges With great computational power comes great algorithmic responsibility David E. Keyes (KAUST)

Single-thread optimization

Single thread optimization Maximum single thread performance can only be achieved with optimized code that fully exploits all the hardware features : Vectorization (Instruction Level Parallelism) use of all the Floating Points units (FMA) Maximize FLOP vs BYTES

Memory and caches Most architecture are based on cachecoherent memory Some exceptions : Cell and TaihuLight only have scratchpad memory a read instruction move a line into L1 cache (unless non temporal stores) Usually write a line to the L1 cache n-ways associative memory Least Recently Used (LRU) policy is common Reuse data as much as possible

KNL memory configuration

Vectorization Scalar (one instruction produces one result) SIMD processing (one instruction can produce multiple results)

When vectorize? Vectorization can be seen as instruction level parallelism Loop carried dependence prevent vectorization #define N 1024 double a[n][n]; for (i=1; i<n; i++) for (j=1; j< n; j++) loop carried on for j loop no loop-carried dependence in for i loop a[i][j] = a[i][j-1] + 1; #define N 1024 double a[n][n]; for (i=1; i<n; i++) for (j=1; j< n; j++) no loop-carried dependence in for j loop loop-carried dependence in for i loop a[i][j] = a[i-1][j] + 1;

Compilers are conservative If there might be some dependences, the compiler will not vectorize Compiler generated vectorization report are very valuable Developer knows best, if there is no dependence, then tell the compiler Compiler specific pragma Standard OpenMP 4 simd directive

Vectorized loop internals Aligned memory accesses are more efficient A loop is made of 3 phases Peeling (scalar) so aligned access can be used in the main body Main body (vectorized) with only aligned access Remainder (scalar) aligned access but not a full vector Align data and inform the compiler to get rid of loop peeling (directive or compiler option)

Vectorized loop internals #define N 1020 double a[n], b[n], c[n]; for (int i=0; i<n; i++) a[i] = b[i] + c[i]; If 512 bits vectors, bump N to 1024 to get rid of the remainder integer, parameter :: N=1020, M=512 double precision :: a(n,m), b(n,m), c(n,m) do j=1,m do i=1,n a(i,j) = b(i,j) + c(i,j) end do end do #define N 1020 double a[n], b[n]; double sum = 0; for (int i=0; i<n; i++) sum += a[i] * b[i]; If 512 bits vectors, bump N to 1024 to get rid of the remainder in the innermost loop If 512 bits vectors, bump N to 1024 and zero a[1020:1023] and b[1020:1023] to get rid of the remainder

AoS vs SoA As taught in Object Oriented Programming classes, common data layout is Array of Structure (AoS) #define N 1024 typedef struct { double x; double y; double z; } point; point p[n]; /* x-translation */ for (int i=0; i<n; i++) p[i].x += 1.0; Strided access : 1 cache line contains 2 or 3 useful double out of 8 A vector uses data from 3 cache lines

AoS vs SoA Optimized data layout is Structure of Arrays (SoA) #define N 1024 typedef struct { double x[n]; double y[n]; double z[n]; } points; points p; /* x-translation */ for (int i=0; i<n; i++) p.x[i] += 1.0; Streaming access : 1 cache line contains 8 useful double

Indirect access #define N 1024 double * a; int * indixes; for (int i=0; i<n; i++) a[indixes[i]]] += 1.0; Data must be gathered/scattered from/to several cache lines Hardware support is available In the general case, vectorization is incorrect!

Indirect access and collisions #define N 1024 double * a; int * indixes; for (int i=0; i<n; i++) a[indixes[i]]] += 1.0; If no conflict can occur, then compiler must be informed vectorization is safe Hardware support might be available for collision detection (AVX512-CD)

KNL High Bandwidth Memory KNL comes with 16 GB of MCDRAM / High Bandwidth Memory (HBM) Memory mode can be selected at boot time (and ideally on a per-job basis) Flat : NUMA node with 16 GB of HBM and standard memory Cache : only standard memory, HBM is transparently used as a L3 cache Hybrid (one portion is used as cache, the other as scratchpad)

KNL flat mode Command line # use only HBM $ numactl m 1 a.out Autohbw library $ export AUTO_HBW_SIZE=min_size[:max_size] $ LD_PRELOAD=libautohbw.so a.out # try HBM first, fallback to standard memory $ numactl preferred=1 a.out Fortran directive Memkind library real(r8), allocatable :: a(:)!dir$ attributes fastmem, align:64 :: a #include <hbwmalloc.h> double * d = hbw_malloc(n*sizeof(double)); allocate(a)

Arithmetic intensity Arithmetic intensity is the ratio FLOP per byte moved to/from memory Numerical intensity is based on algorithm, but it can be influenced by dataset

Roofline analysis Kernel with low arithmetic intensity are memory-bound Kernel with high arithmetic intensity are compute-bound Roofline analysis is a visual method to find how well the kernel is performing There are several roofs Memory type (L1 / L2 / HBM / standard memory Cpu features (no vectorization, vectorization, vectorization + FMA)

Roofline analysis w/ Intel Advisor 2017 Roofline model has to be built once per processor Vendor knows best Process fully automated from Advisor 2017 Update 2

False sharing double local_value[num_threads]; #pragma omp parallel { int me = omp_get_thread_num(); local_value[me] = func(me); } typedef struct { double value; double[7] padding; Different data from the same cache line is used on different processors Excessive time is spent ensuring cache coherency Re-organize data so per thread data does not share any cache line (padding) } local_data;

n-way associative cache Since cache is not fully associative, all cache might not be used

Parallelization

Parallelization Free lunch is over, more cores are needed to keep application performances A lot more cores are available and must be effectively used to achieve greater performance Parallelization is now mandatory

Performance scaling Strong scaling how the solution time varies with the number of processors for a fixed total problem size? Ideally, adding processors decreases the time to solution. Weak scaling how the solution time varies with the number of processors for a fixed problem size per processor Ideally, the time to solution remains constant.

Amdahl s law (strong scaling) S latency is the theoretical speedup of the execution of the whole task; s is the speedup of the part of the task that benefits from improved system resources; p is the proportion of execution time that the part benefiting from improved resources originally occupied.

Gustafson s law (weak scaling) S latency is the theoretical speedup in latency of the execution of the whole task; s is the speedup in latency of the execution of the part of the task that benefits from the improvement of the resources of the system; p is the percentage of the execution workload of the whole task concerning the part that benefits from the improvement of the resources of the system before the improvement.

At a glance Amdahl s law we are fundamentally limited by the serial fraction Gustafson s law we need larger problems for larger numbers CPUs whilst we are still limited by the serial fraction, it becomes less important

Parallelization models MPI is the de facto standard for inter process communication Flat MPI is sometimes a good option, but beware of memory and wire-up overhead MPI+X is the general paradigm X is for intra node communications and can be OpenMP Pthreads PGAS (OpenSHMEM, Co-arrays, ) MPI (!)

OpenMP OpenMP is a common parallelization paradigm used on shared memory node OpenMP is a set of directives to enable parallelization #define N 1024 double a[n], b[n], c[n]; for (int i=0; i<n; i++) a[i] = b[i] + c[i]; #define N 1024 double a[n], b[n], c[n]; #pragma omp parallel for for (int i=0; i<n; i++) a[i] = b[i] + c[i];

OpenMP limitations Most OpenMP parallelization focus only on loops OpenMP has an overhead (thread creation, synchronization, reduction) OpenMP was designed when memory was flat, today NUMA is very common. NUMA makes performance models hard to build OpenMP is generally best within a NUMA node natural when iterations are independent MPI communications within an OpenMP region is not natural

Hyperthreading A KNL core consists of 4 hardware threads Hardware threads share resources (cache, FPU, ) When a hardware thread is waiting for memory, an other hardware thread can be scheduled to perform some computation 2 hardware threads are enough to achieve maximum performances (4 on KNC) Best is to experiment with 1 and 2 threads per core, and choose the fastest option

Intel Thread Advisor Suitability report gives a speed-up estimation

Intel Thread Advisor Dependency analysis help predicting parallel data sharing problems (very slow, and only on annotated loops)

KNL cluster modes KNL cluster mode is selected at boot time (BIOS parameter) KNL modes influence cache coherency wiring and topology presented to the application. Commonly used modes are : Alltoall Quadrant SNC-4 SNC-2 Using the most appropriate mode is critical to achieve optimal performances The best mode depends on application and parallelization model Hopefully, KNL cluster mode can be selected on a per job basis

KNL Cluster modes All2all/Quadrant (1 socket, 68 cores) SNC2 (2 sockets, 34+34 cores) SNC4 (4 sockets, 18+18+16+16 cores)

KNL cluster modes Rules of thumb AlltoAll : do not use it! SNC-4: 4 NUMA nodes, generally best with a multiple of 4 MPI tasks per node. Note 34 tiles do not split evenly into 4 quadrants! SNC-2: 2 NUMA nodes, generally best with a multiple of 2 MPI tasks per node Quadrant: flat memory, to be used if SNC-4/2 is not a fit

Problem sizing on KNL KNL has both ECC (standard) and MCDRAM (High Bandwidth) memory MCDRAM can be configured as cache or scratchpad Impact of performance can be significant If your app is cache-friendly, weak scale using all available memory If your app is not cache-friendly, it might be more effective to weak scale using only HBM

Is your app-cache friendly? In flat mode Run with HBM only Run with standard memory only In cache mode Increase dataset size to use all available memory Measure the drop in performance (if any) when using all the memory in cache mode

Conclusions

Conclusions Moore s law is yet still valid It used to mean free-lunch It now means more cores and more complex architectures, and that comes with new challenges Vectorization and parallelization are mandatory Tools are available to help We can help you too!

Questions? Regarding this presentation gilles@rist.or.jp Need help with your research helpdesk@hpci-office.jp