Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Similar documents
Introduction to Parallel Programming

Introduction II. Overview

Introduction to Parallel Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

COSC 6385 Computer Architecture - Multi Processor Systems

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to Parallel Programming

Introduction to Parallel Computing

Multicores, Multiprocessors, and Clusters

High Performance Computing Systems

Parallel Computing Introduction

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Them Threads, Them Threads, Them Useless Threads

Parallel Computing Why & How?

Chapter 11. Introduction to Multiprocessors

Introduction to Parallel Computing

Introduction to High Performance Computing

Introduction to High-Performance Computing

CSL 860: Modern Parallel

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Online Course Evaluation. What we will do in the last week?

Multiprocessors & Thread Level Parallelism

Parallel Programming: Background Information

Parallel Programming: Background Information

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

CA4006 Concurrent & Distributed Programming

Application Programming

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

BlueGene/L (No. 4 in the Latest Top500 List)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Top500 Supercomputer list

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. Session 1 Introduction

Multiprocessors - Flynn s Taxonomy (1966)

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Lecture 7: Parallel Processing

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

PRACE Autumn School Basic Programming Models

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Beyond Latency and Throughput

An Introduction to Parallel Programming

CA463 Concurrent Programming

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

WHY PARALLEL PROCESSING? (CE-401)

CS418 Operating Systems

Programmation Concurrente (SE205)

Message Passing Interface (MPI)

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CA463 Concurrent Programming

The Art of Parallel Processing

High Performance Computing in C and C++

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

UVA HPC & BIG DATA COURSE INTRODUCTORY LECTURES. Adam Belloum

Introduction to Parallel Programming and Computing for Computational Sciences. By Justin McKennon

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Chap. 4 Multiprocessors and Thread-Level Parallelism

Comp. Org II, Spring

Lecture 7: Parallel Processing

Parallel and Distributed Computing (PD)

Parallel Processing & Multicore computers

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Parallel Computing. Hwansoo Han (SKKU)

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Comp. Org II, Spring

CUDA GPGPU Workshop 2012

parallel Parallel R ANF R Vincent Miele CNRS 07/10/2015

Parallel Programming & Murat Keçeli

Introduction. EE 4504 Computer Organization

Fundamentals of Quantitative Design and Analysis

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Copyright 2012, Elsevier Inc. All rights reserved.

Parallel Computing. November 20, W.Homberg

Chapter 5: Thread-Level Parallelism Part 1

Modelling and implementation of algorithms in applied mathematics using MPI

Beyond Instruction Level Parallelism

High Performance Computing

Multi-Processor / Parallel Processing

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

High Performance Computing. Introduction to Parallel Computing

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

What is Parallel Computing?

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel programming. Luis Alejandro Giraldo León

Transcription:

Parallel Programming Presentation to Linux Users of Victoria, Inc. November 4th, 2015 http://levlafayette.com

1.0 What Is Parallel Programming? 1.1 Historically, software has been written for serial computation (discrete instructions, sequential execution, single processor) 1.2 Parallel programming is the procedures used for simultaneous computation (discrete parts, concurrent executions, multiple processing). 1.3 Flynn's Taxonomy of Computer Architecture into streams of data and instructions (SISD, SIMD, MISD, MIMD). Later three architectures allow for parallel programming.

2.0 Why Do We Need Parallel Programming? 2.1 The size of dataset are growing faster than the processing capability of unicore systems. 2.2 Modelling most real world phenomena is a parallel task (weather and climate, astronomy, medicine, geological change, economics, robotics). 2.3 Heat to performance metrics limits unicore processing speeds, leading to cost issues. 2.4 Parallelisation allows distribution of tasks into a wider network of non local systems.

3.0 How Is Parallel Programming Made Possible in Hardware and Architecture? 3.1 In individual systems units, hardware parallelisation is implemented in multiprocessor and multicore architectures. 3.2 In a tightly coupled distributed system (e.g., cluster computing) the system can be considered as a single unit. 3.3 In a loosely coupled distributed system (e.g., grid computing) the system is a network of independent units. 3.4 In a virtualised distributed system (e.g., cloud computing) the system is an on demand network of independent units.

3.0 How Is Parallel Programming Made Possible in Hardware and Architecture?

4.0 How Is Parallel Programming Implemented in Software? 4.1 Automatic parallelisation of a sequential program by a language's internal logic would be very beneficial and is much desired. Alas, it is very elusive. 4.2 There are a very large variety of concurrent programming languages, libraries, API etc. which implement parallelisation in a variety of ways. 4.3 Most common implementations are through multithreaded programming in a shared memory environment (e.g., OpenMP) and message passing in a distributed memory system (e.g., OpenMPI). The former uses a fork join procedure and the latter by building a communications world. 4.4 A sequential program can be implemented in a batch system to invoke multiple instances simultaneously over different data sets ("data parallelism" versus "task parallelism")

5.0 Embarrassing Pleasures 5.1 A computation problem where it is relatively simple to implement the task into separate parallel problems is "embarrassingly parallel" (or, more optimistically, "pleasingly parallel"). Usually the data or tasks can be carried out independently, simple domain decomposition. 5.2 Example #1: Generation of random numbers in Octave using PBS job arrays. 5.3 Example #2: OpenMP "hello world" thread examples using C and Fortran 5.4 Example #3: OpenMPI "hello world" tasks using C and Fortran.

6.0 Race Conditions, Locks, Profiling 6.1 General rule: programming is hard, parallel programming is really hard. Always start with a working serial program and then determine what parts can become parallel. 6.2 Less pleasing parallel ("more complex") problems require communication between tasks or changes to a shared dataset. This can be lead to race conditions. 6.3 To avoid race conditions barriers and locks can be put into place. These can either slow down the program (barrier) or can lead to deadlocks with a mutual exclusion problem (e.g., apocryphal Kansas railway statute) or livelocks with a circular wait condition (polite people in a corridor problem). 6.4 Example #4: Game theory example in C and Fortran, with MPI_Wait routines. 6.5 Profiling parallel programs needs to consider load balancing between concurrent tasks (but increases fine grain complexity), reducing I/O (much worse that memory operations), using profilers and debuggers (e.g., PDT/TAU, Valgrind, GDB).

7.0 Parallel Computation Limits and Solutions 7.1 The benefit from parallelisation can be computed as a ratio: Speedup (p) = Time (serial)/ Time (parallel) 7.2 However parallel compution must include some serial component. This component sets a limit on the advantage gained from parallelisation. i.e., S(N) = 1 / (1 P) + (P/N), Amdahl's Law Cores Mallacoota Sydney Total Time 1 8 hours +8 hours 16 hours 2 8 hours +4 hours 12 hours 4 8 hours +2 hours 10 hours 8 8 hours +1 hour 9 hours........ Inf 8 hours +nil 8 hours 7.3 The Gustafon Barsis solution to Amdahl's Law is to effectively to reduce the serial proportion by making the parallel tasks larger. Why stop at Sydney?

8.0 Further Examples of Parallel Computation 8.1 Example #5: R with SNOW library R is usually a single core application. However using the Rmpi and SNOW libraries, MPI functionality can be provided. The R script creates a random data set and samples, fits the model to the samples, and eventually the mean squared difference with a test to ensure that the results are equal. 8.2 Example #6: Octave with Parallel and MPI examples Like R, Octave by default is a unicore application but can be extended with packages to provide multicore functionality, specifically the parallel and MPI packages. These two examples create a vector from basic function and make a calculation of Pi. 8.3 Example #7: Python with Multiprocessor Package Examples Three examples (courtesy of Praetorian) generate N random integers, and calculate the sum of the generated integers. The timed first example is without any multicore functionality. The second and third approach uses subprocesses from the Python, with the latter making use of a pool.