Fractals exercise. Investigating task farms and load imbalance

Similar documents
Fractals. Investigating task farms and load imbalance

Parallel Programming Patterns Overview and Concepts

CFD exercise. Regular domain decomposition

Reusing this material

Parallel Programming Patterns. Overview and Concepts

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Message Passing Programming. Designing MPI Applications

HPC Architectures. Types of resource currently in use

Parallel Programming Libraries and implementations

Batch Systems. Running your jobs on an HPC machine

Parallel Programming. Libraries and Implementations

Analytical Modeling of Parallel Programs

Fractal Dimension of Julia Sets

Building Blocks. Operating Systems, Processes, Threads

Advanced OpenMP. Memory model, flush and atomics

Data Analytics with HPC. Data Streaming

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

Multiprocessor Systems. COMP s1

Meltdown for Dummies. The road to hell is full of good intentions

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Lecture 14. Performance Programming with OpenMP

Multiprocessor Systems. Chapter 8, 8.1

Requirements, Partitioning, paging, and segmentation

A Comparative Study of Load Balancing Algorithms: A Review Paper

Parallel Programming. Libraries and implementations

Parallel Programming Patterns

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Introduction to Object- Oriented Programming

Message-Passing Programming with MPI. Message-Passing Concepts

MPI Casestudy: Parallel Image Processing

Complex Numbers, Polar Equations, and Parametric Equations. Copyright 2017, 2013, 2009 Pearson Education, Inc.

CS179: GPU Programming Recitation 5: Rendering Fractals

Molecular Modelling and the Cray XC30 Performance Counters. Michael Bareford, ARCHER CSE Team

Welcome! Virtual tutorial starts at 15:00 BST

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Runtime Support for Scalable Task-parallel Programs

Reusing this material

The Evaluation of Parallel Compilers and Trapezoidal Self- Scheduling

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others

Domain Decomposition: Computational Fluid Dynamics

Reusing this material

Concepts from High-Performance Computing

Embarrassingly Parallel Computations Creating the Mandelbrot set

Search Algorithms for Discrete Optimization Problems

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Introduction to OpenMP

Escape-Time Fractals

Technical Report. Performance Analysis for Parallel R Programs: Towards Efficient Resource Utilization. Helena Kotthaus, Ingo Korb, Peter Marwedel

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

It s. slow! SQL Saturday. Copyright Heraflux Technologies. Do not redistribute or copy as your own. 1. Database. Firewall Load Balancer.

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Allows program to be incrementally parallelized

Chapter 11 Search Algorithms for Discrete Optimization Problems

The MPI Message-passing Standard Lab Time Hands-on. SPD Course Massimo Coppola

ANALYSIS OF A DYNAMIC LOAD BALANCING IN MULTIPROCESSOR SYSTEM

High Performance Computing

Parallel design patterns ARCHER course. Vectorisation and active messaging

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Domain Decomposition: Computational Fluid Dynamics

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Requirements, Partitioning, paging, and segmentation

Domain Decomposition: Computational Fluid Dynamics

Message Passing Programming. Introduction to MPI

Introduction to OpenMP. Lecture 4: Work sharing directives

Introduction to OpenMP

Millisort: An Experiment in Granular Computing. Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 3.5, Page 1

Algorithms PART I: Embarrassingly Parallel. HPC Fall 2012 Prof. Robert van Engelen

Introduction to parallel Computing

Introduction to running C based MPI jobs on COGNAC. Paul Bourke November 2006

Introduction to Performance Tuning & Optimization Tools

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Message Passing Programming. Modes, Tags and Communicators

Search Algorithms for Discrete Optimization Problems

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser

Overloading, abstract classes, and inheritance

Fun with Fractals Saturday Morning Math Group

Multiprocessor and Real- Time Scheduling. Chapter 10

Welcome! Virtual tutorial starts at 15:00 GMT. Please leave feedback afterwards at:

Real-time grid computing for financial applications

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

Fortran classes and data visibility

PowerVR Series5. Architecture Guide for Developers

Scheduling. Don Porter CSE 306

Parallelizing Compilers

Problem Set 9 Solutions

Computable Sets. KR Chowdhary Professor & Head.

Molecular Dynamics Simulations with Julia

MATLAB Distributed Computing Server (MDCS) Training

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Hybrid Implementation of 3D Kirchhoff Migration

Reusing this material

Parallelization Strategy

Adaptively Mapping Parallelism Based on System Workload Using Machine Learning

A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche

This exam is open book / open notes. No electronic devices are permitted.

Transcription:

Fractals exercise Investigating task farms and load imbalance

Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that this presentation contains images owned by others. Please seek their permission before reusing these images.

Aims Explore how the granularity of tasks impacts performance Trade-off between the amount of parallelism (number of parallel tasks) and amount of communication (size of tasks) Consider issues surrounding load balance Remember the runtime of the code is determined by the slowest running task so we want work to be as evenly distributed as possible The exercise introduces a Load Imbalance Factor (LIF) which illustrates how much faster your code could run if the load was evenly distributed

What are fractals? Ideas behind the Mandelbrot and Julia sets

The Mandelbrot Set The Mandelbrot Set is the set of numbers resulting from repeated iterations of the complex (i = -1) function: Z n Z 2 n C 1 with the initial condition Z 0 0 C = x 0 +iy 0 belongs to the Mandelbrot set if Z n remains bounded i.e. does not diverge Z n = x n + iy n, Z n 2 = (x n 2 y n 2 + 2ix n y n ), Z n 2 =(x n2 +y n2 )

The Mandelbrot Set cont. Separating out the real and imaginary parts gives: Z n = Z n r +iz n i Z r n = x 2 2 n-1 - y n-1 Z n i + x 0 = 2x n-1 y n-1 + y 0 Take the threshold value as: Z 2 ³ 4.0 Set the maximum number of iterations to N max - Assume that Z does not diverge at higher values of N max

The Julia Set Similar algorithm to Mandelbrot Set recall: Z n Z 2 n C Z 0 1, C x 0 iy 0, 0 There are an infinite number of Julia sets, parameterised by a complex number C Z n Z 2 C Z x iy 0 0 n 1, 0 for example, C = 0.8 + i 0.156

Visualisation To visualise a Mandelbrot/Julia set: Represent the complex plane as a 2D grid where complex numbers correspond to points on the grid (x, y) Calculate number of iterations N for the series to diverge (exceed the threshold) for each point on the grid If it does not diverge, N = N max Convert the value of N to a colour and plot this on the grid

Mandelbrot Set Very slow to compute Very quick to compute

Parallel implementation How do we parallelise computation of these fractals?

Parallelisation Values for each coordinate depend only on the previous values at that coordinate. decompose 2D grid into equally sized blocks no communications between blocks needed. Don t know in advance how much work is needed. number of iterations across the blocks varies. work dynamically assigned to workers as they become available. Implementation Split the grid into blocks: each block corresponds to a task. master process hands out tasks to worker processes. workers return completed task to master.

Example: Parallelisation on 4 CPUs master CPU 1 7 8 9 workers CPU 2 CPU 3 CPU 4 4 5 6 1 25 2 3 y x 1 2 3 In diagram, colour represents which worker did the task number gives the task id tasks scan from left to right, moving upwards

Parallelisation cont. 4 4 4 4 3 3 1 3 1 4 2 1 1 2 3 4 in supplied code shading represents worker here we have added worker id as a number by hand e.g. taskfarm run on 5 CPUs 1 master 4 workers total number of tasks = 16

Notes about the exercise

Exercise You are supplied with source code etc. Compile and run on the machine Visualise results Quantify performance results For a fixed number of workers improve load balance by increasing number of tasks (decrease size) compute LIF to estimate minimum achievable runtime is this minimum ever reached?

Exercise outcomes What do the timings tell us about HPC machines?

Example results (fixed number of workers)

Results cont.

16 workers and 16 tasks -----Workload Summary (number of iterations)---- Total Number of Workers: 16 Total Number of Tasks: 16 Total Worker Load: 498023053 Average Worker Load: 31126440 Maximum Worker Load: 156694685 Minimum Worker Load: 62822 Time taken by 16 workers was 1.929219 (secs) Load Imbalance Factor: 5.034134

16 workers and 64 tasks -----Workload Summary (number of iterations)--------- Total Number of Workers: 16 Total Number of Tasks: 64 Total Worker Load: 498023053 Average Worker Load: 31126440 Maximum Worker Load: 46743511 Minimum Worker Load: 10968369 Time taken by 16 workers was 0.586923 (secs) Load Imbalance Factor: 1.501730

Key points to take away TASK FARMS Also known as the master/worker pattern Allows a master process to distribute work to a set of worker processors. Can be used for other types of tasks but it complicates the situation and other patterns may be more suitable for implementing. Master process is responsible for creating, distributing and gathering the individual jobs. Can improve load balance by using more tasks than workers with some overhead Load imbalance adversely affects performance especially as number of processors increases

Key points to take away TASKS Units of work Vary in size, do not have to be of consistent execution time. If execution times are known it can help with load balancing. QUEUES Master generates a pool of tasks and puts them in a queue Workers assigned task from queue when idle

Key points to take away LOAD BALANCING How a system determines how work or tasks are distributed across workers (processes or threads) Successful load balancing avoids idle processes and overloading single cores Poor load balancing leads to under-utilised cores, reducing performance.

Key points to take away COST Increasingly important Finite budgets require optimal use of resources requested. Load balancing is just one method of ensuring optimal usage and avoiding wasting resources. More power and resources do not necessarily mean improved performance. Always ask is it necessary to run this on 4000 cores or could it be run on 2000 more efficiently?