An Introduction to Parallel Programming

Similar documents
6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

Introduction to Parallel Computing

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Patterns for! Parallel Programming!

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

Parallelism. CS6787 Lecture 8 Fall 2017

Review: Creating a Parallel Program. Programming for Performance

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Trends and Challenges in Multicore Programming

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

CS 194 Parallel Programming. Why Program for Parallelism?

Moore s Law. Computer architect goal Software developer assumption

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

I-1 Introduction. I-0 Introduction. Objectives:

Online Course Evaluation. What we will do in the last week?

CS427 Multicore Architecture and Parallel Computing

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Chapter 18 - Multicore Computers

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

COMP 322: Fundamentals of Parallel Programming

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

High Performance Computing

Multiprocessors & Thread Level Parallelism

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Why do we care about parallel?

27. Parallel Programming I

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CS 475: Parallel Programming Introduction

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Computer Architecture Crash course

High Performance Computing. Introduction to Parallel Computing

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

27. Parallel Programming I

Introduction to parallel Computing

Multiprocessor Systems. Chapter 8, 8.1

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

High Performance Computing Systems

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

27. Parallel Programming I

Programming Models for Supercomputing in the Era of Multicore

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Moore s Law. Computer architect goal Software developer assumption

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Introduction to Parallel Programming

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Multicore Hardware and Parallelism

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

CS4961 Parallel Programming. Lecture 1: Introduction 08/25/2009. Course Details. Mary Hall August 25, Today s Lecture.

Parallel Computing. Hwansoo Han (SKKU)

Multiprocessors and Locking

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Parallel Computing Introduction

Chap. 4 Multiprocessors and Thread-Level Parallelism

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Modern Processor Architectures. L25: Modern Compiler Design

Computer Architecture

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Lect. 2: Types of Parallelism

Amdahl's Law in the Multicore Era

Parallel Computing: Parallel Architectures Jin, Hai

WHY PARALLEL PROCESSING? (CE-401)

CS671 Parallel Programming in the Many-Core Era

The Art of Parallel Processing

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer Systems Architecture

Kernel Synchronization I. Changwoo Min

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Computer Systems Architecture

Concurrency & Parallelism, 10 mi

Programming at Scale: Concurrency

Algorithm Engineering with PRAM Algorithms

Computer Architecture Spring 2016

High Performance Computing on GPUs using NVIDIA CUDA

Parallelization Principles. Sathish Vadhiyar

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Page 1. Multilevel Memories (Improving performance using a little cash )

CS 426 Parallel Computing. Parallel Computing Platforms

Parallelism in Hardware

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Parallelism I. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Transcription:

An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe and Dr. Rodric Rabbah and from Parallel Programming for Multicore course at UC Berkeley, by prof Kathy Yelick

Technology Trends: Microprocessor Capacity Moore s Law 2X transistors/chip Every 1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

How to make effective use of transistors? Make the processor go faster Give it a faster clock (more operations per second) Give the processor more ability E.g., exploit instruction-level parallelism But It gets very expensive to keep doing this

How to make effective use of transistors? More instruction-level parallelism hard to find Very complex designs needed.. Take longer to design and debug..for small gains Clock frequency scaling is slowing drastically Too much power and heat when pushing the envelope Cannot communicate across chip fast enough Better to design small local units with short paths Difficult and very expensive for memory speed to keep up

How to make effective use of transistors? INTRODUCE MULTIPLE PROCESSORS Advantages: Effective use of billion of transistors Easier to reuse a basic unit many times Potential for very easy scaling Just keep adding cores for higher (peak) performance Each individual processor can be less powerful Which means it s cheaper to buy and run (less power)

How to make effective use of transistors? INTRODUCE MULTIPLE PROCESSORS Disadvantages Parallelization is not a limitless way to infinite performance! Algorithms and computer hardware give limits on performance One task many processors We need to think about how to share the task amongst them We need to co-ordinate carefully We need a new way of writing our programs

Vocabulary in the multi-era AMP (Asymmetric MP) Each processor has local memory Tasks statically allocated to one processor SMP (Symmetric MP) Processors share memory Tasks dynamically scheduled to any processor

Vocabulary in the multi-era Heterogeneous: Specialization among processors Often different instruction sets Usually AMP design Homogeneous: All processors have the same instruction set Processors can run any task Usually SMP design

Modern many-cores Locally homogeneous Globally heterogeneous

Multicore design paradigm Two primary patterns of multicore architecture design Shared memory - One copy of data shared among many cores - Atomicity, locking and synchronization essential for correctness - Many scalability issues Distributed memory - Cores primarily access local memory - Explicit data exchange between cores - Data distribution and communication orchestration is essential for performance

Shared Memory Programming Processor 1..n ask for X There is only one place to look Communication through shared variables Race conditions possible Use synchronization to protect from conflicts Change how data is stored to minimize synchronization

Example Data parallel Perform same computation but operate on different data A single process can fork multiple concurrent threads Each thread encapsulates its own execution path Each thread has local state and shared resources Threads communicate through shared resources such as global memory

Pthreads

Types of Parallelism Data parallelism Perform same computation but operate on different data Task (control) parallelism Perform different functions pthread_create(/* thread id */ /* attributes */ /* any function */ /* args to function */);

Distributed Memory Programming Processor 1..n ask for X There are n places to look Each processor s memory has its own X Xs may vary For Processor 1 to look at Processor s 2 s X Processor 1 has to request X from Processor 2 Processor 2 sends a copy of its own X to Processor 1 Processor 1 receives the copy Processor 1 stores the copy in its own memory

Message Passing Architectures with distributed memories use explicit communication to exchange data Data exchange requires synchronization (cooperation) between senders and receivers

Example

Example

Example

Example

Example

Example

Understanding Performance What factors affect performance of parallel programs? Coverage or extent of parallelism in algorithm Granularity of partitioning among processors Locality of computation and communication

Coverage Not all programs are embarassingly parallel Programs have sequential parts and parallel parts

Amdahl's Law Amdahl s Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used

Amdahl's Law Potential program speedup is defined by the fraction of code that can be parallelized

Amdahl's Law Speedup = old running time / new running time = 100 seconds / 60 seconds = 1.67 (parallel version is 1.67 times faster)

Amdahl's Law

Amdahl's Law S Suppose that 75% of a program can be parallelized. What is the speedup that could be achieved with 256 processors? o,, And for 65536 processors? o,, 3,95!!!!!!

Amdahl's Law S Suppose that n=32. Which p is required for S=12,55? o o o o o 1 1 1 1,, 1, 1 %!!!!!

Amdahl's Law S If the portion of the program that can be parallelized is small, then the speedup is limited The non-parallel portion limits the performance

Amdahl's Law Speedup tends to 1/(1-p) as number of processors tends to infinity Parallel programming is worthwhile when programs have a lot of work that is parallel in nature Overhead

Overhead of parallelism Given enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation Each of these can take a very large amount of time if a parallel program is not carefully designed Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Understanding Performance Coverage or extent of parallelism in algorithm Granularity of partitioning among processors Locality of computation and communication

Granularity Granularity is a qualitative measure of the ratio of computation to communication Computation stages are typically separated from periods of communication by synchronization events

Granularity Fine-grain Parallelism Low computation to communication ratio Small amounts of computational work between communication stages Less opportunity for performance enhancement High communication overhead Coarse-grain Parallelism High computation to communication ratio Large amounts of computational work between communciation events More opportunity for performance increase Harder to load balance efficiently

Load Balancing Processors that finish early have to wait for the processor with the largest amount of work to complete Leads to idle time, lowers utilization Particularly urgent with barrier synchronization P 1 P 2 P 3 P 4 P 1 P 2 P 3 P 4 Slowest core dictates overall execution time

Static Load Balancing Programmer make decisions and assigns a fixed amount of work to each processing core a priori Works well for homogeneous multicores All core are the same Each core has an equal amount of work Not so well for heterogeneous multicores Some cores may be faster than others Work distribution is uneven

Dynamic Load Balancing When one core finishes its allocated work, it takes on work from core with the heaviest workload Ideal for codes where work is uneven, and in heterogeneous multicore P 1 P 2 P 3 P 4

Communication and Synchronization In parallel programming processors need to communicate partial results on data or synchronize for correct processing In shared memory systems Communication takes place implicitly by concurrently operating on shared variables Synchronization primitives must be explicitly inserted in the code In distributed memory systems Communication primitives (send/receive) must be explicitly inserted in the code Synchronization is implicitly achieved through message exchange

Communication Cost Model

Types of Communication Cores exchange data or control messages example: DMA Control messages are often short Data messages are relatively much larger

Overlapping messages and computation Computation and communication concurrency can be achieved with pipelining Think instruction pipelining in superscalars CPU is idle MEM is idle

Overlapping messages and computation Computation and communication concurrency can be achieved with pipelining Think instruction pipelining in superscalars Essential for performance on Cell and similar distributed memory multicores

Understanding Performance Coverage or extent of parallelism in algorithm Granularity of partitioning among processors Locality of computation and communication

Locality in communication (message passing)

Locality of memory accesses (shared memory)

Locality of memory accesses (shared memory)

Memory access latency in shared memory architectures Uniform Memory Access (UMA) Centrally located memory All processors are equidistant (access times) Non-Uniform Access (NUMA) Physically partitioned but accessible by all Processors have the same address space Placement of data affects performance

Distributed Shared Memory A.k.a. Partitioned Global Address Space (PGAS) Each processor has a local memory node, globally visible by every processor in the system Local accesses are fast, remote accesses are slow (NUMA)

Distributed Shared Memory Concurrent threads with a partitioned shared space Similar to the shared memory Memory partition Mi has affinity to thread Th i Helps exploiting locality Legend Simple statements as SM Synchronization

Summary Coverage or extent of parallelism in algorithm Granularity of partitioning among processors Locality of computation and communication so how do I parallelize my program?

4 common steps to creating a parallel program

Decomposition Identify concurrency and decide at what level to exploit it Break upon computation into tasks to be divided among processors Tasks may become available dynamically Number of tasks may vary with time Enough tasks to keep processors busy Number of tasks available at a time is upper bound on achievable speedup

Guidelines for TASK decomposition Algorithms start with a good understanding of the problem being solved Programs often naturally decomposes into tasks Two common decompositions are Function calls and Distinct loop iterations Easier to start with many tasks and later fuse them, rather than too few tasks and later try to split them

Guidelines for TASK decomposition Flexibility Program design should afford flexibility in the number and size of tasks generated Tasks should not be tied to a specific architecture Fixed tasks vs. parameterized tasks Efficiency Tasks should have enough work to amortize the cost of creating and managing them Tasks should be sufficiently independent so that managing dependencies does not become the bottleneck Simplicity The code has to remain readable and easy to understand, and debug

Guidelines for DATA decomposition Data decomposition is often implied by task decomposition Programmers need to address task and data decomposition to create a parallel program Which decomposition to start with? Data decomposition is a good starting point when Main computation is organized around manipulation of a large data structure Similar operations are applied to different parts of the data structure

Common DATA decompositions Array data structures Decomposition of arrays along rows, columns, blocks Recursive data structures Example: decomposition of trees into sub-trees

Algorithmic Considerations Determine which tasks can be done in parallel with each other, and which must be done in some order Conceptualize a decomposition as a task dependency graph: A directed graph with Nodes corresponding to tasks sqr (A[0])sqr(A[1]) sqr(a[2]) Edges indicating dependencies, that the result of one task is required for processing the next. sum A given problem may be decomposed into tasks in many different ways. Tasks may be of same, different, or indeterminate sizes. sqr(a[n])

Where's the parallelism?

Where's the parallelism?

Where's the parallelism?

Where's the parallelism?

Further readings (online resource) http://heather.cs.ucdavis.edu/~matloff/158/pln/parprocbook.pdf