Distributed Systems CS /640

Similar documents
Cloud Computing CS

Distributed Programming for the Cloud: Models, Challenges and. Analytics Engines. Mohammad Hammoud and Majd F. Sakr

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Parallel Computing. Hwansoo Han (SKKU)

Computer parallelism Flynn s categories

Multiprocessors - Flynn s Taxonomy (1966)

Introduction to Parallel Programming

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

COSC 6385 Computer Architecture - Multi Processor Systems

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

PARALLEL MEMORY ARCHITECTURE

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Introduction to Parallel Programming

Computer Architecture

Lecture 10 Midterm review

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallel and High Performance Computing CSE 745

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

15-440: Recitation 8

Shared Memory Architecture Part One

Parallel Computing Platforms

High Performance Computing Systems

Parallel Programming in C with MPI and OpenMP

Introduction to Parallel Computing

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Multiprocessors & Thread Level Parallelism

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Why & How?

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Parallel Programming

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

High Performance Computing Course Notes HPC Fundamentals

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 7: Parallel Processing

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Programming in C with MPI and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallel Architecture. Hwansoo Han

Memory Systems in Pipelined Processors

High Performance Computing

CSE 392/CS 378: High-performance Computing - Principles and Practice

6.1 Multiprocessor Computing Environment

Introduction II. Overview

Lecture 1: Parallel Architecture Intro

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Message-Passing Shared Address Space

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallelization of Sequential Programs

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Parallel Architectures

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

High Performance Computing in C and C++

Lecture 7: Parallel Processing

Parallel Programming in C with MPI and OpenMP

Multi-Processor / Parallel Processing

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Chapter 5: Thread-Level Parallelism Part 1

ECE 563 Second Exam, Spring 2014

Online Course Evaluation. What we will do in the last week?

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Introduction to parallel Computing

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

An Introduction to Parallel Programming

Review: Creating a Parallel Program. Programming for Performance

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Multicores, Multiprocessors, and Clusters

Chapter 5. Multiprocessors and Thread-Level Parallelism

WHY PARALLEL PROCESSING? (CE-401)

Dr. Joe Zhang PDC-3: Parallel Platforms

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Introduction to OpenMP

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Programming at Scale: Concurrency

12:00 13:20, December 14 (Monday), 2009 # (even student id)

Transcription:

Distributed Systems CS 15-440/640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andvinay Kolar 1

Objectives Discussion on Programming Models MapReduce Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI)

Amdahl s Law We parallelize our programs in order to run them faster How much faster will a parallel program run? Suppose that the sequential execution of a program takes T 1 time units and the parallel execution on p processors takes T p time units Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is parallelizable Then the speedup (Amdahl s formula): 3

Amdahl s Law: An Example Suppose that 80% of you program can be parallelized and that you use 4 processors to run your parallel version of the program The speedup you can get according to Amdahl is: Although you use 4 processors you cannot get a speedup more than 2.5 times (or 40% of the serial running time) 4

Real Vs. Actual Cases Amdahl s argument is too simplified to be applied to real cases When we run a parallel program, there are a communication overhead and a workload imbalance among processes in general Serial 20 80 Serial 20 80 Parallel 20 20 Parallel 20 20 Process 1 Process 1 Process 2 Process 2 Cannot be parallelized Process 3 Cannot be parallelized Process 3 Can be parallelized Process 4 Can be parallelized Process 4 Communication overhead Load Unbalance 1. Parallel Speed-up: An Ideal Case 2. Parallel Speed-up: An Actual Case

Guidelines In order to efficiently benefit from parallelization, we ought to follow these guidelines: 1. Maximize the fraction of our program that can be parallelized 2. Balance the workload of parallel processes 3. Minimize the time spent for communication 6

Objectives Discussion on Programming Models MapReduce Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI)

Parallel Computer Architectures We can categorize the architecture of parallel computers in terms of two aspects: Whether the memory is physically centralized or distributed Whether or not the address space is shared Memory Address Space Shared Centralized SMP UMA (Symmetric SMP (Symmetric Multiprocessor) Multiprocessor) Distributed NUMA (Non-Uniform Memory Access) Individual N/A MPP (Massively Parallel Processors) 8

Symmetric Multiprocessor Symmetric Multiprocessor (SMP) architecture uses shared system resources that can be accessed equally from all processors Processor Processor Processor Processor Cache Cache Cache Cache Bus or Crossbar Switch Memory I/O A single OS controls the SMP machine and it schedules processes and threads on processors so that the load is balanced 9

Massively Parallel Processors Massively Parallel Processors (MPP) architecture consists of nodes with each having its own processor, memory and I/O subsystem Interconnection Network Processor Processor Processor Processor Cache Cache Cache Cache Bus Bus Bus Bus Memory I/O Memory I/O Memory I/O Memory I/O An independent OS runs at each node 10

Non-Uniform Memory Access Non-Uniform Memory Access (NUMA) architecture machines are built on a similar hardware model as MPP NUMA typically provides a shared address space to applications using a hardware/software directory-based coherence protocol The memory latency varies according to whether you access memory directly (local) or through the interconnect (remote). Thus the name non-uniform memory access As in an SMP machine, a single OS controls the whole system 11

Objectives Discussion on Programming Models MapReduce Why parallelizing our programs? Parallel computer architectures Traditional Models of parallel programming Examples of parallel processing Message Passing Interface (MPI)

Models of Parallel Programming What is a parallel programming model? A programming model is an abstraction provided by the hardware to programmers It determines how easily programmers can specify their algorithms into parallel unit of computations (i.e., tasks) that the hardware understands It determines how efficiently parallel tasks can be executed on the hardware Main Goal: utilize all the processors of the underlying architecture (e.g., SMP, MPP, NUMA) and minimize the elapsed time of your program 13

Traditional Parallel Programming Models Parallel Programming Models Shared Memory Message Passing 14

Shared Memory Model In the shared memory programming model, the abstraction is that parallel tasks can access any location of the memory Parallel tasks can communicate through reading and writing common memory locations This is similar to threads from a single process which share a single address space Multi-threaded programs (e.g., OpenMP programs) are the best fit with shared memory programming model 15

Time Time Shared Memory Model S i = Serial P j = Parallel Single Thread S1 S1 Spawn Multi-Thread P1 P2 P3 P4 Join P1 P2 P3 P3 S2 Shared Address Space S2 Process Process 16

Shared Memory Example begin parallel // spawn a child thread private int start_iter, end_iter, i; shared int local_iter=4, sum=0; shared double sum=0.0, a[], b[], c[]; shared lock_type mylock; for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential start_iter = getid() * local_iter; end_iter = start_iter + local_iter; for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; barrier; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) { lock(mylock); sum = sum + a[i]; unlock(mylock); } barrier; // necessary end parallel // kill the child thread Print sum; Parallel

Traditional Parallel Programming Models Parallel Programming Models Shared Memory Message Passing 18

Message Passing Model In message passing, parallel tasks have their own local memories One task cannot access another task s memory Hence, to communicate data they have to rely on explicit messages sent to each other This is similar to the abstraction of processes which do not share an address space MPI programs are the best fit with message passing programming model 19

Time Time Message Passing Model S = Serial P = Parallel Single Thread Message Passing S1 S1 S1 S1 S1 P1 P1 P1 P1 P1 P2 S2 S2 S2 S2 P3 P4 Process 0 Process 1 Process 2 Process 3 S2 Node 1 Node 2 Node 3 Node 4 Data transmission over the Network Process 20

Message Passing Example id = getpid(); local_iter = 4; start_iter = id * local_iter; end_iter = start_iter + local_iter; for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential if (id == 0) send_msg (P1, b[4..7], c[4..7]); else recv_msg (P0, b[4..7], c[4..7]); for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; local_sum = 0; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) local_sum = local_sum + a[i]; if (id == 0) { recv_msg (P1, &local_sum1); sum = local_sum + local_sum1; Print sum; } else send_msg (P0, local_sum); Parallel

Shared Memory Vs. Message Passing Comparison between shared memory and message passing programming models: Aspect Shared Memory Message Passing Communication Implicit (via loads/stores) Explicit Messages Synchronization Explicit Implicit (Via Messages) Hardware Support Typically Required None Development Effort Lower Higher Tuning Effort Higher Lower 22

Objectives Discussion on Programming Models MapReduce Why parallelizing our programs? Parallel computer architectures Traditional Models of parallel programming Examples of parallel processing Message Passing Interface (MPI)

SPMD and MPMD When we run multiple processes with message-passing, there are further categorizations regarding how many different programs are cooperating in parallel execution We distinguish between two models: 1. Single Program Multiple Data (SPMD) model 2. Multiple Programs Multiple Data (MPMP) model 24

SPMD In the SPMD model, there is only one program and each process uses the same executable working on different sets of data a.out Node 1 Node 2 Node 3 25

MPMD The MPMD model uses different programs for different processes, but the processes collaborate to solve the same problem MPMD has two styles, the master/worker and the coupled analysis a.out b.out a.out b.out c.out a.out= Structural Analysis, b.out = fluid analysis and c.out = thermal analysis Example Node 1 Node 2 Node 3 Node 1 Node 2 Node 3 1. MPMD: Master/Slave 2. MPMD: Coupled Analysis

3 Key Points To summarize, keep the following 3 points in mind: The purpose of parallelization is to reduce the time spent for computation Ideally, the parallel program is p times faster than the sequential program, where p is the number of processes involved in the parallel execution, but this is not always achievable Message-passing is the tool to consolidate what parallelization has separated. It should not be regarded as the parallelization itself 27

Objectives Discussion on Programming Models MapReduce Why parallelizing our programs? Parallel computer architectures Traditional Models of parallel programming Examples of parallel processing Message Passing Interface (MPI)

Message Passing Interface In this part, the following concepts of MPI will be described: Basics Point-to-point communication Collective communication 29

What is MPI? The Message Passing Interface (MPI) is a message passing library standard for writing message passing programs The goal of MPI is to establish a portable, efficient, and flexible standard for message passing By itself, MPI is NOT a library - but rather the specification of what such a library should be MPI is not an IEEE or ISO standard, but has in fact, become the industry standard for writing message passing programs on HPC platforms 30

Reasons for using MPI Reason Standardization Portability Performance Opportunities Functionality Availability Description MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms There is no need to modify your source code when you port your application to a different platform that supports the MPI standard Vendor implementations should be able to exploit native hardware features to optimize performance Over 115 routines are defined A variety of implementations are available, both vendor and public domain 31

Programming Model MPI is an example of a message passing programming model MPI is now used on just about any common parallel architecture including MPP, SMP clusters, workstation clusters and heterogeneous networks With MPI the programmer is responsible for correctly identifying parallelism and implementing parallel algorithms using MPI constructs 32

Communicators and Groups MPI uses objects called communicators and groups to define which collection of processes may communicate with each other to solve a certain problem Most MPI routines require you to specify a communicator as an argument The communicator MPI_COMM_WORLD is often used in calling communication subroutines MPI_COMM_WORLD is the predefined communicator that includes all of your MPI processes 33

Ranks Within a communicator, every process has its own unique, integer identifier referred to as rank, assigned by the system when the process initializes A rank is sometimes called a task ID. Ranks are contiguous and begin at zero Ranks are used by the programmer to specify the source and destination of messages Ranks are often also used conditionally by the application to control program execution (e.g., if rank=0 do this / if rank=1 do that) 34

Multiple Communicators It is possible that a problem consists of several sub-problems where each can be solved concurrently This type of application is typically found in the category of MPMD coupled analysis We can create a new communicator for each sub-problem as a subset of an existing communicator MPI allows you to achieve that by using MPI_COMM_SPLIT 35

Example of Multiple Communicators Consider a problem with a fluid dynamics part and a structural analysis part, where each part can be computed in parallel MPI_COMM_WORLD Comm_Fluid Comm_Struct Rank=0 Rank=1 Rank=0 Rank=1 Rank=0 Rank=1 Rank=4 Rank=5 Rank=2 Rank=3 Rank=2 Rank=3 Rank=2 Rank=3 Rank=6 Rank=7 Ranks within MPI_COMM_WORLD are printed in red Ranks within Comm_Fluid are printed with green Ranks within Comm_Struct are printed with blue