Parallel Programming Patterns Overview and Concepts

Similar documents
Parallel Programming Patterns. Overview and Concepts

Fractals exercise. Investigating task farms and load imbalance

Fractals. Investigating task farms and load imbalance

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

Parallelization Strategy

An Introduction to Parallel Programming

Parallelization Strategy

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CFD exercise. Regular domain decomposition

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

Modern Processor Architectures. L25: Modern Compiler Design

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Introduction to OpenMP

In multiprogramming systems, processes share a common store. Processes need space for:

CS 326: Operating Systems. CPU Scheduling. Lecture 6

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

Introduction. L25: Modern Compiler Design

Welcome to Part 3: Memory Systems and I/O

Fundamental Concepts of Parallel Programming

Parallel Programming Patterns

Concurrent/Parallel Processing

Introduction to Parallel Programming

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Computer Organization & Assembly Language Programming

Multicore Computing and Scientific Discovery

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Partitioning Strategies for Concurrent Programming

Computer Architecture. R. Poss

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

Lecture 2: Memory Systems

Concepts from High-Performance Computing

Common Parallel Programming Paradigms

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Cache introduction. April 16, Howard Huang 1

Patterns for! Parallel Programming II!

Chapter Seven Morgan Kaufmann Publishers

Lecture 6: Input Compaction and Further Studies

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

ARCHER ecse Technical Report. Algorithmic Enhancements to the Solvaware Package for the Analysis of Hydration. 4 Technical Report (publishable)

Operating Systems. Introduction & Overview. Outline for today s lecture. Administrivia. ITS 225: Operating Systems. Lecture 1

The Use of Cloud Computing Resources in an HPC Environment

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Computer Organization and Assembly Language

Architectural Design

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

HPC Architectures. Types of resource currently in use

Lecture-14 (Memory Hierarchy) CS422-Spring

Announcement. Computer Architecture (CSC-3501) Lecture 20 (08 April 2008) Chapter 6 Objectives. 6.1 Introduction. 6.

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Chapter 6 Objectives

2 TEST: A Tracer for Extracting Speculative Threads

White Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents

Parallel design patterns ARCHER course. Vectorisation and active messaging

Reusing this material

Objectives. Architectural Design. Software architecture. Topics covered. Architectural design. Advantages of explicit architecture

PERFORMANCE OF GRID COMPUTING FOR DISTRIBUTED NEURAL NETWORK. Submitted By:Mohnish Malviya & Suny Shekher Pankaj [CSE,7 TH SEM]

objects, not on the server or client! datastore to insure that application instances see the same view of data.

Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila

PC I/O. May 7, Howard Huang 1

Parallel Mesh Partitioning in Alya

Lecture 8: Synthesis, Implementation Constraints and High-Level Planning

Adaptive-Mesh-Refinement Pattern

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Message Passing Programming. Modes, Tags and Communicators

Patterns for! Parallel Programming!

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

High-Performance Scientific Computing

Memory. Objectives. Introduction. 6.2 Types of Memory

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Mixed Mode MPI / OpenMP Programming

GIS Data Models. 4/9/ GIS Data Models

ECE519 Advanced Operating Systems

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Parallelism: The Real Y2K Crisis. Darek Mihocka August 14, 2008

Technical Briefing. The TAOS Operating System: An introduction. October 1994

ECE 587 Hardware/Software Co-Design Lecture 23 Hardware Synthesis III

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

TK2123: COMPUTER ORGANISATION & ARCHITECTURE. CPU and Memory (2)

Computer Architecture

Unit 2. Chapter 4 Cache Memory

Introducing CPU Scheduling Algorithms: The Photocopier Scenario

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Review: Creating a Parallel Program. Programming for Performance

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to

Multiprocessor Systems. COMP s1

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Chapter 18 Parallel Processing

Transcription:

Parallel Programming Patterns Overview and Concepts Partners Funding

Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-ncsa/4.0/deed.en_us This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original.

Outline Why parallel programming? Parallel decomposition patterns Geometric decomposition Task farm / worker queue Pipeline Loop parallelism

Why parallel programming? More difficult than normal (serial) programming, so why bother? We are reaching limitations in speed of individual processors Limitations to size and speed of a single chip (heat!) Developing new processor technology is very expensive Ultimately enounter fundamental limits: speed of light and size of atoms Processor designs moving away from few fast cores to many slower ones Increasingly need parallel applications to continue advancing science Parallelism is not a silver bullet There are many additional considerations Careful thought is required to take advantage of parallel machines

But I m not a programmer This course will not teach you how to program in parallel It will provide you with an overview of some of the common ways this is done There are two aspects to this: 1. how computational work can be split up and divided amongst processors/cores in an abstract sense (the topic of this lecture) 2. how this can actually be implemented in hardware and software (later lectures)

But I m not a programmer Understanding how programs run in parallel should help you: make better-informed choices what software to use for your research understand what problems can emerge (parallel performance or errors) make better use of high-performance computers (and even your laptop!) get your research done more quickly You may decide to learn to program in parallel! you will already have an overview of many key concepts

Performance A key aim is to solve problems faster To improve the time to solution Enable new scientific problems to be solved To exploit parallel computers, need to split the program up between different processors (cores) distinguish between processors and cores when it matters, otherwise use interchangeably - more in lecture on hardware. Ideally, would like program to run P times faster on P processors Not all parts of program can be successfully split up Splitting the program up may introduce additional overheads such as communication

Parallel tasks How we split a problem up into tasks to run in parallel is critical 1. Ideally limit interaction (information exchange) between tasks as this takes time and may require processors to wait for each other 2. Want to balance the workload so all processors are equally busy - if they are all equally powerful this gives the shortest time to solution Tightly coupled problems require lots of interaction between their parallel tasks Embarrassingly parallel problems require very little (or no) interaction between their parallel tasks e.g. sequence alignment queries for multiple independent sequences In reality most problems sit somewhere between the two extremes

Parallel Decomposition How do we split problems up to solve them efficiently in parallel?

Decomposition One of the most challenging, but also most important, decisions is how to split the problem up How you do this depends upon a number of factors The nature of the problem The required amount and frequency of interaction (information exchange) between tasks Support from implementation technologies We are going to look at some frequently used parallel decomposition patterns

1. Geometric Decomposition Molecular Dynamics Practical Based on a geometric division of the spatial domain of a problem Biomolecular simulation Weather Simulation From: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, Volumes 1 2, 2015, 19 25. http://dx.doi.org/10.1016/j.softx.2015.06.001 (reuse permitted under CC BY 4.0)

1. Geometric Decomposition Molecular Dynamics Practical Spatial domain divided geometrically into cells Simplest case: all cells same size and shape one cell per processor More adaptable: variably-sized and shaped cells more cells than processors Information exchange required between cells: temperature, pressure, humidity etc. for weather simulation atomic/molecular charges to compute forces in biomolecular simulation

1. Geometric Decomposition Molecular Dynamics Practical Splitting the problem up does have an associated cost Requires exchange of information between processors Need to carefully consider granularity Aim to minimise communication and maximise computation

1. Geometric Decomposition Molecular Dynamics Practical Swap data between cells Often only need information on cell boundaries Many small messages result in far greater overhead instead exchange all boundary values periodically by swapping cell halos.

Load Imbalance Fractal Practical Overall execution time worse if some processors take longer than the rest Each processor should have (roughly) the same amount of work, i.e. they should be load balanced For many problems it is unlikely that all cells require same amount of computation Expect hotspots regions where more compute is needed, e.g.: localised high-low pressure fronts (weather simulation) cells containing complex protein segments (biomolecular simulation)

Load Imbalance Fractal Practical Can measure degree of load imbalance see lecture on measuring parallel performance Techniques exist to deal with load imbalance: Assign multiple cells to each processor Use variably-sized cells in the first place to compensate for hotspots Allow processors to dynamically steal work from others Load balancing can be done once at the start of program execution (static) throughout execution (dynamic)

2. Task farm (master / worker) Fractal & Sequence Alignment Practicals Split a problem up into distinct, independent, tasks Master Worker 1 Worker 2 Worker 3 Worker n Master process sends task to a worker Worker process sends results back to the master The number of tasks is often much greater than the number of workers and tasks get allocated to idle workers dynamically

Task farm considerations Fractal & Sequence Alignment Practicals Communication is between the master and the workers Communication between the workers can complicate things The master process can become a bottleneck Workers are idle waiting for the master to send them a task or acknowledge receipt of results Potential solution: implement work stealing Resilience what happens if a worker stops responding? Master could maintain a list of tasks and redistribute that worker s work

3. Pipelines Some problems involve operating on many pieces of data in turn. The overall calculation can be viewed as data flowing through a sequence of stages and being operated on at each stage. Data Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Result Each stage runs on a processor, each processor communicates with the processor holding the next stage One way flow of data

Example: pipeline with 4 processors 1 Data 2 1 3 2 1 4 3 2 1 Result Each processor (one per colour) is responsible for a different task or stage of the pipeline Each processor acts on data (numbered) as they move through the pipeline

Examples of pipelines CPU architectures Fetch, decode, execute, write back Intel Pentium 4 had a 20 stage pipeline Unix shell i.e. cat datafile grep energy awk {print $2, $3} Graphics/GPU pipeline A generalisation of pipeline (a workflow, or dataflow) is becoming more and more relevant to large, distributed scientific workflows Can combine the pipeline with other decompositions

4. Loop Parallelism Serial scientific applications are often dominated by computationally intensive loops Some of these can be parallelised directly e.g.10 cores simultaneously perform 1000 iterations each instead of 1 core performing 10 000 iterations Simple techniques exist to do this incrementally, i.e. in small steps whilst maintaining a working code This makes the decomposition very easy to implement Often large restructuring of the code is not required Tends to work best with small-scale parallelism Not suited to all architectures Not suited to all loops

Summary A variety of common decomposition patterns exist that provide well-known approaches to parallelising a serial problem You can see examples of some of these during the practical sessions There are many considerations when parallelising code: Granularity of the decomposition Tradeoff of communication and computation Load imbalance Parallel applications implement clever variations on these decomposition schemes to optimise parallel performance Knowing some of this will help you understand what is going on