Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Similar documents
GPUs and Emerging Architectures

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Parallel Algorithm Engineering

Introduc)on to High Performance Compu)ng Advanced Research Computing

Performance of deal.ii on a node

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

High Performance Computing (HPC) Introduction

CUDA GPGPU Workshop 2012

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Trends in HPC (hardware complexity and software challenges)

GPU Architecture. Alan Gray EPCC The University of Edinburgh

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Trends and Challenges in Multicore Programming

Chapter 3 Parallel Software

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Programming Libraries and implementations

OpenACC 2.6 Proposed Features

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Parallel Programming on Ranger and Stampede

[Potentially] Your first parallel application

What does Heterogeneity bring?

Architecture, Programming and Performance of MIC Phi Coprocessor

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

WHY PARALLEL PROCESSING? (CE-401)

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

HPC code modernization with Intel development tools

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

The Era of Heterogeneous Computing

Shared Memory programming paradigm: openmp

Parallel Numerical Algorithms

Parallel Computing. Hwansoo Han (SKKU)

Running the FIM and NIM Weather Models on GPUs

Comparison and analysis of parallel tasking performance for an irregular application

General introduction: GPUs and the realm of parallel architectures

OpenMP on Ranger and Stampede (with Labs)

Bring your application to a new era:

n N c CIni.o ewsrg.au

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Parallel Programming in C with MPI and OpenMP

OpenCL: History & Future. November 20, 2017

Intel Xeon Phi Coprocessor

Preparing for Highly Parallel, Heterogeneous Coprocessing

Parallel Programming in C with MPI and OpenMP

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

S Comparing OpenACC 2.5 and OpenMP 4.5

AutoTune Workshop. Michael Gerndt Technische Universität München

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

ECE 574 Cluster Computing Lecture 10

Introduction II. Overview

Parallel Systems. Project topics

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallel and High Performance Computing CSE 745

Parallel Computing Why & How?

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Carlo Cavazzoni, HPC department, CINECA

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Real Parallel Computers

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP 4.0/4.5. Mark Bull, EPCC

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Lecture 4: OpenMP Open Multi-Processing

The Art of Parallel Processing

The Stampede is Coming: A New Petascale Resource for the Open Science Community

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Introduction to Multicore Programming

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

OpenMP 4.0 (and now 5.0)

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Experiences with Achieving Portability across Heterogeneous Architectures

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel Programming in C with MPI and OpenMP

Parallel Programming. Libraries and Implementations

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Allows program to be incrementally parallelized

Modern Processor Architectures. L25: Modern Compiler Design

The StarPU Runtime System

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

CSC573: TSHA Introduction to Accelerators

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

OpenMP 4.0. Mark Bull, EPCC

Revealing the performance aspects in your code

Designing and Optimizing LQCD code using OpenACC

Transcription:

Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing

Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows one to: solve problems that don t fit on a single CPU solve problems that can t be solved in a reasonable time We can solve larger problems faster more cases 2

A Change in Moore s Law!

Parallelism is the New Moore s Law Power and energy efficiency impose a key constraint on design of micro-architectures Clock speeds have plateaued Hardware parallelism is increasing rapidly to make up the difference

Cluster System Architecture internet 1 TopSpin 120 Login Nodes Raid 5 2950 1 2 Home Server 2 InfiniBand Switch Hierarchy TopSpin 270 HO ME 130 1 2 TopSpin 120 16 I/O Nodes WORK File System GigE Switch Hierarchy GigE InfiniBand Fibre Channel

Blade : Rack : System 1 node : 2 x 8 cores= 16 cores 1 chassis : 10 nodes = 120 cores 1 rack (frame) : 4 chassis = 480 cores system : 10 racks = 4,800 cores x 10 x 4

HPC Trends Memory Memory M P GPU Architecture Single core Multicore GPU Cluster Code Serial OpenMP CUDA MPI

Multi-core systems Memory Memory Memory Memory Memory Network Current processors place multiple processor cores on a die Communication details are increasingly complex Cache access Main memory access Quick Path / Hyper Transport socket connections Node to node connection via network

Accelerator-based Systems Memory Memory Memory Memory G P U G P U G P U G P U Network Calculations made in both CPUs and Graphical Processing Unit No longer limited to single precision calculations Load balancing critical for performance Requires specific libraries and compilers (CUDA, OpenCL) Co-processor from Intel: MIC (Many Integrated Core)

Motivation Where is unrealized performance and how do we extract it? How broad is the performance impact? Hierarchical parallelism Increased importance of fine-grained and data parallelism More cores available per processor

Where is the Parallelism? Level 1: Single instruction multiple data (SIMD) vector registers within individual CPU cores Level 2: Increasing number of cores per CPU Level 3: Accelerator-equipped systems General purpose graphics processors (GPGPU) Intel Xeon Phi / many integrated core (MIC) Level 4: Supercomputing resources Large number of compute nodes multiple levels of parallelism Increasing heterogeneity in hardware components

Motivations for Multithreading and Vectorization Expose parallelism that is inaccessible using MPI alone Fine-grained parallelism Task-parallelism Automatic vectorization (Single Instruction Multiple Data) Vector processors are more prevalent and getting wider Compilers will vectorize automatically if possible Accelerators such as GPU / Intel Xeon Phi Multi-threaded code is important to efficiently multi-core processors Multi-core CPU present in laptops, desktops, and supercomputers

Multi-threaded Programs Expose parallelism that is inaccessible using MPI alone Fine-grained parallelism Task-parallelism Automatic vectorization (Single Instruction Multiple Data) Vector processors are more prevalent and getting wider Compilers will vectorize automatically if possible Accelerators such as GPU / Intel Xeon Phi Multi-threaded code is important to efficiently multi-core processors Multi-core CPU present in laptops, desktops, and supercomputers

Multi-threaded Programs OpenMP: Most widely used for CPU-based parallelization and for targeting the Intel Xeon Phi OpenACC: Primarily used in the development of GPUbased codes pthreads, C++ 11 (Multithreading features): in the C++ standard, not fully supported CUDA OpenCL Intel Thread Building Blocks (TBB), Cilk++

What is OpenMP? API for parallel programming on shared memory systems Parallel threads Implemented through the use of: Compiler Directives Runtime Library Environment Variables Supported in C, C++, and Fortran Maintained by OpenMP Architecture Review Board (http://www.openmp.org/)

Shared Memory Memory P P P P P Your laptop Multicore, multiple memory NUMA system HokieOne (SGI UV) One node on blueridge

OpenMP constructs OpenMP language extensions parallel control structures work sharing data environment synchronization runtime functions, env. variables governs flow of control in the program parallel directive distributes work among threads do/parallel do and Section directives specifies variables as shared or private shared and private clauses coordinates thread execution critical and atomic directives barrier directive Runtime environment omp_set_num_threads( ) omp_get_thread_num() OMP_NUM_THREADS OMP_SCHEDULE

Factors Affecting Multi-thread Performance Avoid overhead of initializing new threads wherever possible Bind threads to physical hardware cores Cache coherence issues can cause serious performance degradation when memory is written by different cores Data for a calculation performed by a particular core should be local to that core Avoid synchronization; try to enforce thread safety without serializing code

Single Instruction Multiple Data (SIMD) Each clock cycle a processor loads instructions and data on which those instructions operate SIMD processors can apply a single instruction to multiple pieces of data in a single clock cycle Modern processors increasingly enable or rely on SIMD to achieve high performance: Intel SandyBridge / IvyBridge / Haswell AMD Opteron IBM BlueGene Q Accelerators such as GPU and the Intel Xeon Phi

Auto-Vectorization Summary Performance gains from auto-vectorization are not guaranteed: Certain algorithms vectorize while others do not Problem details can also impact performance Compiler and hardware combination impact the efficiency of vectorization However: SIMD is becoming more prevalent and speedup can be significant SIMD data structure optimizations provide benefits on both CPU and accelerators (GPU, Intel Xeon Phi)

Software Challenges for Multithreading Programming models for multi-threading are actively evolving Compiler support and performance for different implementations can vary widely Tradeoffs between portability and performance C++ 11, OpenMP Architecture specific programming models: Intel thread building blocks, Cilk++, CUDA, OpenCL etc.

Compiler Auto-Vectorization Many compilers can automatically generate vector instructions Intel 13.0 gcc 4.7 llvm 3.4 pgi 14.0 IBM XL How you write your code has a huge impact on whether or not the compiler will generate vector instructions (and how optimal it will be) The performance of the various compilers will vary

Programming Practices that Inhibit Auto-Vectorization Loops without single point of entry and exit Branching prevents vectorization Data dependencies Read after write Write after read Aliasing may cause compiler to assume data dependencies exist for safety! Non-contiguous memory accesses Function calls within loops

Data Structures and Auto- Vectorization Structure of arrays is preferred over array of structures Memory alignment has a big impact on how efficiently vectorization is performed Example task: add two vectors together to obtain a third vector: C[i] = A[i] + B[i]

Single Instruction Multiple Data (SIMD)

Data Structures and Auto- Vectorization struct ArrayOfStruct { double A,B,C; void add(){ C = A+B; } } /* some code */ ArrayOfStruct *AOS; AOS = new ArrayOfStruct[SIZE] for (i=0; i<size; i++) AOS[i].add();

Data Structures and Auto- Vectorization struct StructOfArrays { /*... */ double *A,*B,*C; void add(){ for (i=0; i<size; i++) C[i] = A[i]+B[i]; } } /* some code */ StructOfArrays SOA(SIZE); SOA.add(); // Same calculation // different data layout

Data Structures and Auto- Vectorization Compilers can often be prompted to print out information about whether vectorization is performed icc vec-report2 restrict VecAdd.cpp For the array of structures loop: for (i=0; i<size; i++) AOS[i]. add(); The compiler prints the following: remark: loop was not vectorized: vectorization possible but seems inefficient.

Data Structures and Auto- Vectorization For the structure of arrays loop: for (i=0; i<size; i++) C[i] =A[i]+B[i]; The compiler prints the following: remark: LOOP WAS VECTORIZED ( structure of arrays is preferred for SIMD computations, including on accelerators like GPU)

Data Structures and Auto- Vectorization // Memory alignment and auto-vectorization // Little things can make a big difference double *A = new double[size]; double *B = new double[size]; double *C = new double[size]; // Explicitly aligning memory is advantageous! declspec (align(16)) double A[SIZE]; declspec (align(16)) double B[SIZE]; declspec (align(16)) double C[SIZE];

Data Structures and Auto- Vectorization Compare the performance Intel SandyBridge CPU Intel 13.0 compiler 256-bit SIMD register (4 x double per instruction) Aligned structure of arrays is a clear winner: Array of structures = 2.1 seconds Structure of arrays = 0.99 seconds (~2x speedup) Aligned structure of arrays = 0.6 seconds (~3.5x)

Questions???