Parallel Computing. November 20, W.Homberg

Similar documents
OpenCL Vectorising Features. Andreas Beckmann

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Introduction to Parallel Programming

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Accelerator Programming Lecture 1

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

n N c CIni.o ewsrg.au

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Vectorisation and Portable Programming using OpenCL

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Parallel Programming. Libraries and Implementations

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Lecture 1: Introduction and Computational Thinking

the Intel Xeon Phi coprocessor

The Era of Heterogeneous Computing

Mathematical computations with GPUs

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Advanced OpenMP Features

Parallel Programming Libraries and implementations

Using JURECA's GPU Nodes

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

ECE 574 Cluster Computing Lecture 10

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

COMP528: Multi-core and Multi-Processor Computing

GPU programming. Dr. Bernhard Kainz

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

CME 213 S PRING Eric Darve

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Computer Architecture

Introduction to GPGPU and GPU-architectures

Holland Computing Center Kickstart MPI Intro

Modern Processor Architectures. L25: Modern Compiler Design

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Algorithm Engineering

Heterogeneous Computing and OpenCL

Intel Knights Landing Hardware

Tesla Architecture, CUDA and Optimization Strategies

Portland State University ECE 588/688. Graphics Processors

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Optimising the Mantevo benchmark suite for multi- and many-core architectures

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

GPUs and Emerging Architectures

High Performance Computing (HPC) Introduction

Kepler Overview Mark Ebersole

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Parallel Accelerators

Shared Memory programming paradigm: openmp

Technology for a better society. hetcomp.com

Trends and Challenges in Multicore Programming

General introduction: GPUs and the realm of parallel architectures

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Intel Xeon Phi Coprocessors

Cartoon parallel architectures; CPUs and GPUs

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to GPU hardware and to CUDA

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Parallel Accelerators

Structured Parallel Programming Patterns for Efficient Computation

Introduction to GPU computing

Preparing for Highly Parallel, Heterogeneous Coprocessing

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Research on Programming Models to foster Programmer Productivity

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Addressing Heterogeneity in Manycore Applications

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

General Purpose GPU Computing in Partial Wave Analysis

Our new HPC-Cluster An overview

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Parallel Programming. Libraries and implementations

Trends in HPC (hardware complexity and software challenges)

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Parallel Programming on Ranger and Stampede

CS 475: Parallel Programming Introduction

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Introduction to CUDA Programming

Parallel Computing Why & How?

Transcription:

Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg

Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better performance More instructions per second (compute bound) Better memory bandwidth (memory bound) Lower operating costs (energy to solution) November 20, 2017 Slide 2

Parallel hardware: CPU Every CPU is a parallel device (since Pentium II) Vector units (SSE, AVX, ) Independent floating point and integer units Multiple hardware threads (2 on Jureca per core) Multiple cores per CPU (12 on Jureca) Multiple sockets (2 on Jureca) November 20, 2017 Slide 3

Von Neumann Architecture Controls program execution Carries out computations Central processing unit (CPU) Input device(s) Control unit Arithmetic logic unit (ALU) Output device(s) Memory unit Stored program computer Single memory bus for data and instructions November 20, 2017 Slide 4

Computing with real numbers Enhancements Enhancing memory access CPU Control unit CPU Control unit Arithmetic logic unit (ALU) Floating point unit (FPU) FPU/ALU Memory unit Cache hierarchy Main memory Computational power [flops] Floating point operations (scalar addition or multiplication) per second Memory bandwidth [bytes/second] Number of bytes transferred in every second Latency [seconds] How fast the result of an operation (computation/memory access) is available November 20, 2017 Slide 5

Going Parallel (I) 1 Pacman eats 9 ghosts in 3 seconds 3 Pacmans eat 9 ghosts in 1 second November 20, 2017 Slide 6

CPU Control unit Going Parallel (II) Multiple cores within one CPU CPU FPU ALU FPU ALU FPU ALU Multiple FPUs/ALUs which operate on vector data types Multiple compute nodes within a cluster Multiple CPUs within a single compute node CPU Cluster CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Compute node CPU CPU CPU CPU CPU CPU CPU CPU Shared memory November 20, 2017 Slide 7

Computation of Max. Theroretical Perf. (R peak ) Example: Intel Xeon E5-2600v3 Haswell CPU (JUCA) (JUCA)l Sandy Bridge CP U (i7-2600) Calculation (SP) per Core R peak = #FPUs (= 2) * #Ops per FPU per cycle (= 2) * vector length of the operands (= 8) * processor clock (= 2.5 GHz) = 80 GFLOPs 12 processor cores (SP): 960 GFLOPs 12 processor cores (DP): 480 GFLOPs November 20, 2017 Slide 8

JuHYDRA GPU Server MEGWA MiriQuid GPU-Server, 64 GB Memory, Peak 16/5 TFlops SP/DP NVIDIA Tesla K20X (1x Kepler GK110) Flops: 3.94 / 1.31 TFlops SP / DP Compute Units: 14 Processing Elements: 192 / CU Total # PEs: 14 x 192 = 2688 CU frequency: 732 MHz Memory: 6 GB (ECC) 384bit Memory frequency: 5.2 GHz Memory bandwidth: 250 GB/s Power consumption: 235 W INTEL Xeon E5-2650 Processor (Sandy Bridge) Flops: 0.128 / 0.064 TFlops SP / DP Compute Units: 8 (Cores) Processing Elements: 4 / Core Total # PEs: 8 x 4 = 32 Core frequency: 2.0 GHz (2.4 turbo mode) Power consumption: 95 W AMD FirePro S10000 (2x Tahiti) Flops: 5.91 / 1.48 TFlops SP / DP Compute Units: 2x 28 INTEL Xeon Phi (MIC) Coprocessor 5110P Processing Elements: 64 / CU Flops: 2.02 / 1.01 TFlops SP / DP Total # PEs: 2x 28 x 64 = 3584 Compute Units: 60 (Cores) CU frequency: 825 MHz Processing Elements: 16 / Core Memory: 6 GB (ECC) 384bit Total # PEs: 60 x 16 = 960 Memory frequency: 5.0 GHz Core frequency: 1.053 GHz Memory bandwidth: 2x 240 GB/s Memory: 8 GB Power consumption: 375 W Memory bandwidth: 320 GB/s Power consumption: 225 W November 20, 2017 Slide 9

Heterogeneous systems Different devices within a node (CPU + GPU) Different nodes within a cluster Different clusters within a grid November 20, 2017 Slide 10

Flynn's characterization SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data MIMD Multiple Instructions, Multiple Data SIMT Single Instructions, Multiple Threads A A B C D A B C D A B C D November 20, 2017 Slide 11

Memory Registers (per core) L1 cache (per core) L2 cache (per core/shared) L3 cache (shared) Main memory L1 L1 L1 L1 L2 L2 L2 L2 L3 November 20, 2017 Slide 12

Latency and throughput Get your calculations done as quickly as possible (CPU) ALU Control ALU Cache ALU ALU DRAM Perform calculations on a lot of data in parallel (GPU) DRAM November 20, 2017 Slide 13

GPU Computing Fig.: Nvidia GPU programming: Thinking in large arrays of threads Proper organization of threads incl. data sharing very important Kepler GPU (GK110): Each green square = single FPU Each FPU (about 2700) available for a different thread Overall, GK110 can handle more than 30000 threads simultaneously and even better, in our program we can send billions of threads to the GPU! Many APIs for many different programming languages available: CUDA (only NVIDIA; e.g., runtime C++ API) OpenCL (independent of hardware platform, also for CPUs) OpenACC (for NVIDIA and AMD GPUs; based on compiler directives like OpenMP) November 20, 2017 Slide 14

Parallel Computing (I) Amdahl's Law Runtime on single processor: Runtime on P processors: Speedup: Serial fraction : Runtime on P processors (expressed with ): Amdahl s law: November 20, 2017 Slide 15

Parallel Computing (II) Amdahl's Law Using highly parallel computers (and accelerators like GPUs) only makes sense for programs with minimal serial code fractions! November 20, 2017 Slide 16

Parallel Computing (III) Gustafson-Barsis's Law speedup should be measured by scaling the problem to the number of processors, not by fixing the problem size. Gustafson, John L. "Reevaluating Amdahl's law." Communications of the ACM 31.5 (1988): 532-533. Amdahl's Law: Gustafson's Law: Serial fraction problem size fix, minimise time-to-solution execution time fix, increase # processors Runtime on P processors in parallel: + (1 - ) = 1 Runtime on 1 (hypothetical) processor in serial: + P (1 - ) Speedup: S(P) = + P (1 - ) = P - (P - 1) a sufficient large problem can be parallelised efficiently November 20, 2017 Slide 17

Parallel Computing (IV) Gustafson-Barsis's Law http://en.wikipedia.org/wiki/gustafson's_law a sufficient large problem can be parallelised efficiently November 20, 2017 Slide 18

Issues and pitfalls: race condition A Blue writes A, red reads A. avoid if possible must not do this with different thread blocks November 20, 2017 Slide 19

Issues and pitfalls: deadlock C A Blue creates lock to protect A, red has to wait. Red writes to C and protects C with a lock. Blue wants to read from C deadlock November 20, 2017 Slide 20

Issues and pitfalls: lack of locality Data on other core/gpu Data on host Data on disk... No memory coalescing Bank conflicts November 20, 2017 Slide 21

Issues and pitfalls: load imbalance Unused cores Adaptive refinement Lack of parallel work November 20, 2017 Slide 22

Issues and pitfalls: overhead Run time of kernel too short Computational intensity too low Too much communication QPI IO: Done by CPU can be done in parallel with compute kernel on GPU PCIe PCIe November 20, 2017 Slide 23

Overhead: Synchronization Only threads within a threadblock/work-group can be synchronized Global synchronization is done with kernel calls Atomics can sometimes be used to avoid synchronization November 20, 2017 Slide 24

Steps to parallelise an application: Profiling Is there a hotspot How long does it take Does it have a high computational intensity How much data has to be moved to/from GPU November 20, 2017 Slide 25

Steps to parallelise: Algorithm design Splitting up your work Embarrassingly parallel Static/dynamic load balancing Minimimze overhead Minimize latencies... November 20, 2017 Slide 26

Decomposition Functional decomposition Domain decomposition Atmosphere Vegetation Ocean Land Climate Simulation November 20, 2017 Slide 27

Parallel programming models Coarse-grained parallel MPI OpenMP OpenACC Fine-grained parallel CUDA OpenCL MPI + OpenMP + OpenACC + CUDA and many other combinations possible November 20, 2017 Slide 28

Message Passing Interface (MPI) Parallelism between compute nodes Cluster CPU CPU CPU CPU CPU CPU Main application of MPI: Communication between compute notes within a cluster CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Standardized and portable message-passing system Basic idea: Processes (= running instances of a computer program) exchange messages via network Defined by the MPI Forum (group of researchers from academia and industry) Release of MPI 1.0 in 1994 Many implementations for C and Fortran (and also other prog. lang.) available (library, daemons, helper programs) November 20, 2017 Slide 29

Message Passing Interface (MPI) Example C Code #include <mpi.h> #include <stdio.h> #include <string.h> int main(int argc, char *argv[]) { int myrank, message_size=50, tag=99; char message[message_size]; MPI_Status status; Init MPI processing Determine rank of process } MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_MM_WORLD, &myrank); if (myrank == 0) { MPI_Recv(message, message_size, MPI_CHAR, 1, tag, MPI_MM_WORLD, &status); printf("received \"%s\"\n", message); } else { strcpy(message, "Hello, there"); MPI_Send(message, strlen(message)+1, MPI_CHAR, 0, tag, MPI_MM_WORLD); } MPI_Finalize(); return 0; Finalize MPI processing Code example: Wikipedia Receive message on rank 0 Send message on rank 1 November 20, 2017 Slide 30

OpenMP Parallelism between Cores CPU Compute node CPU Main application of OpenMP: Programming interface for multiprocessing within a shared memory system via threads Shared memory Multiple threads during execution of a single program (thread = lightweight sub-process within a process) Application programming interface (API) for shared memory multiprocessing (multi-platform: hardware, OS) Basic idea: OpenMP runtime environment manages threads (e.g., creation) as required during program execution Makes live easy for the programmer Defined by the nonprofit technology consortium OpenMP Architecture Review Board (group of major computer hardware and software vendors) Release of OpenMP 1.0 in 1997 Many implementations for C, C++, and Fortran (library, OpenMP-enabled compilers) November 20, 2017 Slide 31

OpenMP Example C++ Code #include <iostream> using namespace std; #include <omp.h> Open parallel section (threads will run in parallel on same code) int main(int argc, char *argv[]) { int th_id, nthreads; #pragma omp parallel private(th_id) shared(nthreads) { th_id = omp_get_thread_num(); #pragma omp critical { cout << "Hello World from thread " << th_id << '\n'; } #pragma omp barrier Get ID of local thread Critical section: Only one thread at a time is allowed to write to the screen } } #pragma omp master { nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << '\n'; } return 0; Code example: Wikipedia Get overall number of threads which are running in parallel All threads meet here and wait for each other! Only master thread executes the following section November 20, 2017 Slide 32 section

Parallelisation Pattern: Domain decomposition From Introduction to Parallel Computing @ https://computing.llnl.gov/tutorials/parallel_comp/ November 20, 2017 Slide 33

Pattern: Stencil 0 1 0 3 0 2 7 8 9 0 0 6 7 3 2 8 8 7 6 3 8 4 9 0 1 6 4 7 3 2 8 9 0 0 1 2 3 6 7 4 8 2 9 0 1 9 8 3 8 7 3 6 4 1 7 8 2 9 1 8 2 7 1 1 8 8 8 8 7 4 7 3 6 2 0 0 1 9 2 8 1 2 3 4 5 1 1 7 8 7 9 2 8 1 9 0 1 7 3 6 4 9 1 7 3 6 1 9 2 0 1 6 1 7 3 6 4 9 1 7 1 9 0 0 2 8 1 0 November 20, 2017 Slide 34

Pattern: Reduction November 20, 2017 Slide 35

References Introduction to Parallel Computing @ https://computing.llnl.gov/tutorials/parallel_comp/ Structured Parallel Programming by Michael McCool, James Reinders, Arch Robinson November 20, 2017 Slide 36