SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Similar documents
OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Shared Memory programming paradigm: openmp

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Our new HPC-Cluster An overview

Introduction to Parallel Programming

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

CS691/SC791: Parallel & Distributed Computing

Parallel Computing. Lecture 17: OpenMP Last Touch

Lecture 4: OpenMP Open Multi-Processing

Introduction to parallel Computing

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Introduction to Parallel Computing

Parallelism paradigms

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Parallel Computing. November 20, W.Homberg

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

Shared Memory Programming Model

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 470 Spring Mike Lam, Professor. OpenMP

Computer Architecture

MPI & OpenMP Mixed Hybrid Programming

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

JURECA Tuning for the platform

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Introduction to Parallel Programming

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Implementation of Parallelization

CUDA GPGPU Workshop 2012

Hybrid Programming with MPI and OpenMP. B. Estrade

DPHPC: Introduction to OpenMP Recitation session

Chapter 4: Multithreaded Programming

EPL372 Lab Exercise 5: Introduction to OpenMP

High Performance Computing (HPC) Introduction

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

<Insert Picture Here> OpenMP on Solaris

CS510 Operating System Foundations. Jonathan Walpole

CS 470 Spring Mike Lam, Professor. OpenMP

Multithreading in C with OpenMP

Introduction to parallel computing concepts and technics

CSCE 626 Experimental Evaluation.

Chapter 4: Threads. Chapter 4: Threads

Introduc)on to OpenMP

High Performance Computing: Tools and Applications

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Hybrid MPI and OpenMP Parallel Programming

Tutorial: parallel coding MPI

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

CS420: Operating Systems

High-Performance and Parallel Computing

ECE 574 Cluster Computing Lecture 13

COMP Parallel Computing. SMM (2) OpenMP Programming Model

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Parallel Programming in C with MPI and OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

CS 426. Building and Running a Parallel Application

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Introduction to Parallel Programming with MPI

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

ECE 574 Cluster Computing Lecture 10

DPHPC: Introduction to OpenMP Recitation session

Parallel Computing: Overview

Shared Memory Parallelism - OpenMP

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Optimization of MPI Applications Rolf Rabenseifner

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

EE/CSCI 451: Parallel and Distributed Computation

OPERATING SYSTEM. Chapter 4: Threads

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

CS333 Intro to Operating Systems. Jonathan Walpole

Holland Computing Center Kickstart MPI Intro

Shared memory programming

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

A brief introduction to OpenMP

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Scientific Programming in C XIV. Parallel programming

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Transcription:

SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008

What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem Implementing parallelism requires modification of a sequentially executing algorithm such that the workload is decomposed into independent tasks. This may be trivial or laborious

Rationale for using SHARCNET: SHARCNET provides access to, and substantial user support for, large-scale parallel computing systems These HPC systems are prohibitively expensive to operate for individual groups/departments The majority of SHARCNET computing projects implement parallelism to achieve faster and/or more significant results

Why implement parallelism? Faster computation (more processors) strong scaling workload / processor is inversely proportional to the number of processors Larger problems (more memory) weak scaling constant workload / processor high-density memory is expensive Parallelism is an essential feature in HPC...

Parallel Hardware Parallelism is abundant in computer hardware: increases throughput exists on many, if not all, levels technological trend (eg. multi-core) Examples: Parallel instruction scheduling Parallel data operators (vector units) Multiple processing cores per network node (computer) Multiple computers (nodes) connected in a network (cluster)

Memory Locality A major constraint on developing parallel algorithms is how processing elements view the global address space and communicate with one-another: shared-memory (SMP) multiple cores on the same computer all of the cores can directly access all of the memory distributed-memory (cluster) multiple computers on a network each core can only directly access their local memory It s possible to have combinations and multiple levels of shared/distributed memory, and technologies exist that can bridge the gap, eg. MPI-2, distributed shared-memory

SMP Programming Shared-memory systems are usually programmed to take advantage of multiple processing cores by threading applications with: OpenMP POSIX threads (pthreads) Both are application programming interfaces (APIs). OpenMP is easier to implement and is generally supported by the compiler, whereas pthreads is more complex (lower-level) and is used as an external library. We will only focus on OpenMP.

Processes and Threads A process is an instance of a computer program that is being executed: Includes binary instructions, different memory storage areas, hidden kernel data Performs serial execution of instructions It may be possible to perform many operations simultaneously (instructions are independent). To process these instructions in parallel, the process can spawn a number of threads.

Processes and Threads A thread is a light-weight process connected to the parent process, sharing it's memory space threads each have their own list of instructions and independent stack memory The process of splitting up the execution stream is typically called a fork-join model execution progresses serially, then a team of threads is created, work in parallel, and return execution flow to the parent process

Distributed Memory Programming On distributed memory systems one uses a team of processes that run simultaneously on a network of computers, passing data to one another in the form of messages. This is facilitated by the Message Passing Interface (MPI). MPI is a portable API that includes an external library and runtime programs that are used to compile and execute MPI programs. One can run MPI programs on shared-memory systems, or even mix the two approaches (MPI & OpenMP/pthreads).

Network Properties Message passing performance depends on the time to send data across network (latency) and the amount of data that can be passed in a given amount of time (bandwidth). Examples: gigabit ethernet: 3.0x10-5 s latency, 1Gb/s bandwidth infiniband: 1.5x10-6 s latency, 20Gb/s bandwidth Issues to be aware of: Programs may not scale on lower-performance networks Most algorithms are latency limited as opposed to bandwidth (writing to disk is slower than the network!) Network topology can dramatically affect performance (multiple hops increased latency, shared-links decreased bandwidth)

Amdahl s Law In a parallel computational process, the degree of speedup is limited by the fraction of code that can be parallelized: Tparallel(NP) = { ( 1 P ) + P / NP } * Tserial + overhead Where: Tparallel == time to run application in parallel P == fraction of serial runtime that can be parallelized NP == number of parallel elements (cores, nodes, threads, etc.) Tserial == time to run application serially overhead == overhead introduced by adding the parallelization If P is small, application will not scale to large NP!

Parallel Scaling To compare the performance of a parallelized code to the serial version, one uses the term scaling, which is just: Scaling (NP)= Tserial / Tparallel(NP) This is only meaningful for strong scaling In the weak scaling case, one attempts to maintain a constant runtime while using additional compute elements to solve a larger problem Perfect scaling over a range of NP is referred to as being linear

Parallelization Overhead Parallelization may involve additional computational overhead as well communications overhead Optimal performance may require tuning for a particular system Usually best to maximize the amount of computation versus communication Asynchronous message passing can hide communications overhead

Graininess of parallelism Fine grained vs. coarse grained parallelism Refers to the amount of work that each compute element does before a synchronization point Fine-grained parallelism (frequent communications) requires low latency!

Decomposition Approach There are three main approaches to decomposing a problem so that it can run in parallel: Serial Farming (Embarrassingly or Perfectly Parallel) Data parallelization Task parallelization The approach to use depends on the algorithm, and can be a combination of the three!

Serial Farming Algorithms that can be decomposed with little or no communication parameter space algorithms with independent results No change required to serial algorithm instead of running sequentially, perform all permutations simultaneously Can still suffer from I/O bottlenecks!

Data Parallelism Algorithms that can be decomposed into distinct work units mesh-based solvers, linear algebra Each compute element only works with its local data may require sending boundary data, issues with synchronization and load balancing Susceptible to large amounts of communications overhead impacts scaling performance

Task Parallelism Algorithms that can be decomposed into distinct tasks which can be operated on in parallel eg. Real-time signal correlator One thread to do I/O, one to do message passing, other threads process data Communication strategy may not be as straightforward or structured as in data parallelism

Important Consideration Don t over-complicate things (KISS rule): avoid unnecessarily implementing parallelism (use the best tool for the job) complexity may result in slower performance, worse scalability, and increased likelihood of job failures a simple, modular solution is elegant (eg. mathematical proofs) easy to understand and maintain

The Parallel Development Process Initially one must profile and understand the code. If P isn t ~1, is it worthwhile? a major code re-design may be necessary to implement parallelism data structures may have to change and new methods to solving problems may have to be implemented. Once these changes are made, and for more straightforward problems, the parallelization process follows the same cycle used to optimize a program.

The Parallel Development Cycle 1. Identify hot-spots (functions/routines that use the most cycles) Requires profiling / instrumenting timing in the code 2. Modify code to parallelize hot-spots Not all operations can be parallelized 3. Check for accuracy and correctness, debug any program errors that are introduced 4. Tune for performance, repeat cycle as necessary

OpenMP: Open Multi-Processing programmed by utilizing special compiler directives and functions in the OpenMP runtime library In simple cases (loop-level parallelism), one doesn't have to make any code modifications, only addition of these directives the compiler will only consider the directives if it is told to do so at compile time It's easy to incrementally add parallelization!

OpenMP Blocks of code that are to be done in parallel are considered parallel regions, and must have only 1 entry and 1 exit point Threaded programs are executed in the same fashion as serial programs OMP_NUM_THREADS environment variable to set number of threads, or else hardwire it in your code

Data Scoping Scoping refers to how different variables should be accessed by the threads. Basic scoping types are: private: each thread gets a copy of the variable that is stored on it's private stack Values in private variables are not returned to the parent process shared: all threads access the same variable If variables are read-only, safe to declare as shared reduction: like private, but the value is reduced (sum,max,min,etc.) and returned to the parent process

Simple OpenMP Example #pragma omp parallel default(shared) private(i,x) reduction(+:pi_int) { #ifdef _OPENMP printf("thread %d of %d started\n", omp_get_thread_num(), omp_get_num_threads()); #endif #pragma omp for } for (i = 1; i <= n; ++i) { } x = h * ( i - 0.5 ) ; //center of interval pi_int += 4.0 / ( 1.0 + pow(x,2) ) ;

MPI: Message Passing Interface programmed using functions/routines in the MPI library Requires more work than implementing OpenMP involves new data structures and increased algorithmic complexity portable runs on both shared and distributed memory MPI programs are executed on top of an underlying MPI framework. Typically each computing element (network node) runs an MPI daemon that facilitates the passing of messages Usually executed with a special mpirun command takes the number of processes to be used as arguments

MPI: The 6 basic functions MPI essentially boils down to only 6 different functions: MPI_Init() MPI_Comm_size(MPI_COMM_WORLD,&numprocs) MPI_Comm_rank(MPI_COMM_WORLD,&rank) MPI_Send(buffer,buffer_size,MPI_TYPE,to_rank,tag,mpi_comm) MPI_Recv(buffer,buffer_size,MPI_TYPE,from_rank,tag,mpi_comm,status) MPI_Finalize() All other communication patterns fundamentally consist of sends and receives.

MPI Hello World #include "mpi.h" int main( int argc, char *argv[]) { int numprocs, myrank; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); printf("hello World from rank %d of %d\n", myrank, numprocs); } MPI_Finalize(); return 0;

Parallel Libraries Many libraries use threading and MPI internally: Intel MKL, AMD ACML, etc. threaded FFTs, linear algebra, random number generators FFTW Threaded and MPI parallel FFT routines Scalapack MPI routines for doing linear algebra These can relieve a great deal of the burden required to implement parallel programming use them! Be careful not to oversubscribe processors by calling threaded library routines from inside parallel regions.

Debugging Commercial parallel debuggers (SHARCNET supports DDT) can provide an efficient way to debug problems with MPI code Most contemporary debuggers support debugging threaded applications One can also run a separate debugger for each MPI process, which is sometimes helpful for capturing stack traces, etc. If all else fails, use a lot of print statements Many MPI problems result from incorrect synchronization Deadlocking (posting a receive before a send) Beware of libraries that are not thread-safe