Cloud Computing CS

Similar documents
Distributed Systems CS /640

Distributed Programming for the Cloud: Models, Challenges and. Analytics Engines. Mohammad Hammoud and Majd F. Sakr

The Cache-Coherence Problem

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors & Thread Level Parallelism

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Lecture 10 Midterm review

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

COSC 6385 Computer Architecture - Multi Processor Systems

PARALLEL MEMORY ARCHITECTURE

Parallel Computing. Hwansoo Han (SKKU)

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Synchronization I. Jo, Heeseung

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Shared Symmetric Memory Systems

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Review: Creating a Parallel Program. Programming for Performance

Concurrency and Synchronization. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Parallel Architecture. Hwansoo Han

Multiprocessors - Flynn s Taxonomy (1966)

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Lecture 13. Shared memory: Architecture and programming

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Chapter 5: Thread-Level Parallelism Part 1

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Synchronization I. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

High Performance Computing Systems

Parallel Programming in C with MPI and OpenMP

Parallel Computing Concepts. CSInParallel Project

Computer parallelism Flynn s categories

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiprocessors and Locking

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Introduction to Parallel Computing

M4 Parallelism. Implementation of Locks Cache Coherence

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Lecture 22: 1 Lecture Worth of Parallel Programming Primer. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 25: Multiprocessors

Lec 26: Parallel Processing. Announcements

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computer Architecture

An Introduction to Parallel Programming

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Cloud Computing CS

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

OpenMP and more Deadlock 2/16/18

Parallel Programming in C with MPI and OpenMP

Cloud Computing CS

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Parallel Programming in C with MPI and OpenMP

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Lecture notes for CS Chapter 4 11/27/18

CSE 392/CS 378: High-performance Computing - Principles and Practice

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to Parallel Computing

Parallel Architecture, Software And Performance

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Introduction to Parallel Programming

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Curriculum 2013 Knowledge Units Pertaining to PDC

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Introduction to Parallel Programming

Lecture 9: MIMD Architectures

Parallelism Marco Serafini

Lecture 7: Parallel Processing

Flynn classification. S = single, M = multiple, I = instruction (stream), D = data (stream)

Three basic multiprocessing issues

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Application Programming

Limitations of parallel processing

Transcription:

Cloud Computing CS 15-319 Programming Models- Part I Lecture 4, Jan 25, 2012 Majd F. Sakr and Mohammad Hammoud

Today Last 3 sessions Administrivia and Introduction to Cloud Computing Introduction to Cloud Computing and Cloud Software Stack Course Project and Amazon AWS Today s session Programming Models Part I Announcement: Project update is due today 2

Objectives Discussion on Programming Models MapReduce Pregel, Dryad, and GraphLab Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI) 3

Amdahl s Law We parallelize our programs in order to run them faster How much faster will a parallel program run? Suppose that the sequential execution of a program takes T 1 time units and the parallel execution on p processors takes T p time units Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is parallelizable Then the speedup (Amdahl s formula): 4

Amdahl s Law: An Example Suppose that 80% of you program can be parallelized and that you use 4 processors to run your parallel version of the program The speedup you can get according to Amdahl is: Although you use 4 processors you cannot get a speedup more than 2.5 times (or 40% of the serial running time) 5

Real Vs. Actual Cases Amdahl s argument is too simplified to be applied to real cases When we run a parallel program, there are a communication overhead and a workload imbalance among processes in general Serial 20 80 Serial 20 80 Parallel 20 20 Parallel 20 20 Process 1 Process 1 Process 2 Process 2 Cannot be parallelized Process 3 Cannot be parallelized Process 3 Can be parallelized Process 4 Can be parallelized Process 4 Communication overhead Load Unbalance 1. Parallel Speed-up: An Ideal Case 2. Parallel Speed-up: An Actual Case 6

Guidelines In order to efficiently benefit from parallelization, we ought to follow these guidelines: 1. Maximize the fraction of our program that can be parallelized 2. Balance the workload of parallel processes 3. Minimize the time spent for communication 7

Objectives Discussion on Programming Models MapReduce Pregel, Dryad, and GraphLab Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI) 8

Parallel Computer Architectures Parallel Computer Architectures Multi-Chip Multiprocessors Single-Chip Multiprocessors 9

Multi-Chip Multiprocessors We can categorize the architecture of multi-chip multiprocessor computers in terms of two aspects: Whether the memory is physically centralized or distributed Whether or not the address space is shared Memory Centralized Distributed Address Space Shared Individual SMP (Symmetric Multiprocessor)/UMA (Uniform Memory Access) Architecture Distributed Shared Memory (DSM)/NUMA (Non-Uniform Memory Access) Architecture N/A MPP (Massively Parallel Processors)/UMA Architecture 10

Symmetric Multiprocessors A system with Symmetric Multiprocessors (SMP) architecture uses a shared memory that can be accessed equally from all processors Processor Processor Processor Processor Cache Cache Cache Cache Bus or Crossbar Switch Memory I/O Usually, a single OS controls the SMP system 11

Massively Parallel Processors A system with a Massively Parallel Processors (MPP) architecture consists of nodes with each having its own processor, memory and I/O subsystem Interconnection Network Processor Processor Processor Processor Cache Cache Cache Cache Bus Bus Bus Bus Memory I/O Memory I/O Memory I/O Memory I/O Typically, an independent OS runs at each node 12

Distributed Shared Memory A Distributed Shared Memory (DSM) system is typically built on a similar hardware model as MPP DSM provides a shared address space to applications using a hardware/software directory-based coherence protocol The memory latency varies according to whether the memory is accessed directly (a local access) or through the interconnect (a remote access) (hence, NUMA) As in a SMP system, typically a single OS controls a DSM system 13

Parallel Computer Architectures Parallel Computer Architectures Multi-Chip Multiprocessors Single-Chip Multiprocessors 14

Moore s Law As chip manufacturing technology improves, transistors are getting smaller and smaller and it is possible to put more of them on a chip This empirical observation is often called Moore s Law (# of transistors doubles every 18 to 24 months) An obvious question is: What do we do with all these transistors? Option 1: Add More Cache to the Chip Option 2: Add More Processors (Cores) to the Chip This option is more serious Reduces complexity and power consumption as well as improves performance This option is serious However, at some point increasing the cache size may only increase the hit rate from 99% to 99.5%, which does not improve application performance much 15

Chip Multiprocessors The outcome is a single-chip multiprocessor referred to as Chip Multiprocessor (CMP) CMP is currently considered the architecture of choice Cores in a CMP might be coupled either tightly or loosely Cores may or may not share caches Cores may implement a message passing or a shared memory inter-core communication method Common CMP interconnects (referred to as Network-on-Chips or NoCs) include bus, ring, 2D mesh, and crossbar CMPs could be homogeneous or heterogeneous: Homogeneous CMPs include only identical cores Heterogeneous CMPs have cores which are not identical 16

Objectives Discussion on Programming Models MapReduce Pregel, Dryad, and GraphLab Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI) 17

Models of Parallel Programming What is a parallel programming model? A programming model is an abstraction provided by the hardware to programmers It determines how easily programmers can specify their algorithms into parallel unit of computations (i.e., tasks) that the hardware understands It determines how efficiently parallel tasks can be executed on the hardware Main Goal: utilize all the processors of the underlying architecture (e.g., SMP, MPP, CMP) and minimize the elapsed time of your program 18

Traditional Parallel Programming Models Parallel Programming Models Shared Memory Message Passing 19

Shared Memory Model In the shared memory programming model, the abstraction is that parallel tasks can access any location of the memory Parallel tasks can communicate through reading and writing common memory locations This is similar to threads from a single process which share a single address space Multi-threaded programs (e.g., OpenMP programs) are the best fit with shared memory programming model 20

Shared Memory Model S i = Serial P j = Parallel Single Thread Multi-Thread Time S1 Time S1 Spawn P1 P2 P3 P4 Join P1 P2 P3 P3 S2 Shared Address Space S2 Process Process 21

Shared Memory Example begin parallel // spawn a child thread private int start_iter, end_iter, i; shared int local_iter=4, sum=0; shared double sum=0.0, a[], b[], c[]; shared lock_type mylock; for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential start_iter = getid() * local_iter; end_iter = start_iter + local_iter; for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; barrier; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) { lock(mylock); sum = sum + a[i]; unlock(mylock); } barrier; // necessary end parallel // kill the child thread Print sum; Parallel 22

Why Locks? Unfortunately, threads in a shared memory model need to synchronize This is usually achieved through mutual exclusion Mutual exclusion requires that when there are multiple threads, only one thread is allowed to write to a shared memory location (or the critical section) at any time How to guarantee mutual exclusion in a critical section? Typically, a lock can be implemented //In a high level language void lock (int *lockvar) { while (*lockvar == 1) {} ; *lockvar = 1; } void unlock (int *lockvar) { *lockvar = 0; } Is this Enough/Correct? In machine language, it looks like this: lock: ld R1, &lockvar bnz R1, lock st &lockvar, #1 ret unlock: st &lockvar, #0 ret 23

The Synchronization Problem Let us check if this works: Time Thread 0 lock: ld R1, &lockvar bnz R1, lock sti &lockvar, #1 Thread 1 lock: ld R1, &lockvar bnz R1, lock sti &lockvar, #1 Both will enter the critical section The execution of ld, bnz, and sti is not atomic (or indivisible) Several threads may be executing them at the same time This allows several threads to enter the critical section simultaneously 24

The Peterson s Algorithm To solve this problem, let us consider a software solution referred to as the Peterson s Algorithm [Tanenbaum, 1992] int turn; int interested[n]; // initialized to 0 void lock (int process, int lvar) { // process is 0 or 1 int other = 1 process; interested[process] = TRUE; turn = process; while (turn == process && interested[other] == TRUE) {} ; } // Post: turn!= process or interested[other] == FALSE void unlock (int process, int lvar) { interested[process] = FALSE; } 25

No Race Time interested[0] = TRUE; turn = 0; while (turn == 0 && interested[1] == TRUE) {} ; Thread 0 Since interested[1] is FALSE, Thread 0 enters the critical section interested[0] = FALSE; Thread 1 interested[1] = TRUE; turn = 1; while (turn == 1 && interested[0] == TRUE) {} ; Since turn is 1 and interested[0] is TRUE, Thread 1 waits in the loop until Thread 0 releases the lock Now Thread 1 exits the loop and can acquire the lock No Synchronization Problem 26

With Race Time Thread 0 interested[0] = TRUE; turn = 0; while (turn == 0 && interested[1] == TRUE) {} ; Although interested[1] is TRUE, turn is 1, Hence, Thread 0 enters the critical section interested[0] = FALSE; Thread 1 interested[1] = TRUE; turn = 1; while (turn == 1 && interested[0] == TRUE) {} ; Since turn is 1 and interested[0] is TRUE, Thread 1 waits in the loop until Thread 0 releases the lock Now Thread 1 exits the loop and can acquire the lock No Synchronization Problem 27

Traditional Parallel Programming Models Parallel Programming Models Shared Memory Message Passing 28

Message Passing Model In message passing, parallel tasks have their own local memories One task cannot access another task s memory Hence, to communicate data they have to rely on explicit messages sent to each other This is similar to the abstraction of processes which do not share an address space Message Passing Interface (MPI) programs are the best fit with the message passing programming model 29

Message Passing Model S = Serial P = Parallel Single Thread Message Passing Time S1 Time S1 S1 S1 S1 P1 P1 P1 P1 P1 P2 S2 S2 S2 S2 P3 P4 Process 0 Process 1 Process 2 Process 3 S2 Node 1 Node 2 Node 3 Node 4 Data transmission over the Network Process 30

Message Passing Example for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential id = getpid(); local_iter = 4; start_iter = id * local_iter; end_iter = start_iter + local_iter; if (id == 0) send_msg (P1, b[4..7], c[4..7]); else recv_msg (P0, b[4..7], c[4..7]); No Mutual Exclusion is Required! for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; local_sum = 0; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) local_sum = local_sum + a[i]; if (id == 0) { recv_msg (P1, &local_sum1); sum = local_sum + local_sum1; Print sum; } else send_msg (P0, local_sum); Parallel 31

Shared Memory Vs. Message Passing Comparison between shared memory and message passing programming models: Aspect Shared Memory Message Passing Communication Implicit (via loads/stores) Explicit Messages Synchronization Explicit Implicit (Via Messages) Hardware Support Typically Required None Development Effort Lower Higher Tuning Effort Higher Lower 32

Objectives Discussion on Programming Models MapReduce Pregel, Dryad, and GraphLab Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI) 33

SPMD and MPMD When we run multiple processes with message-passing, there are further categorizations regarding how many different programs are cooperating in parallel execution We distinguish between two models: 1. Single Program Multiple Data (SPMD) model 2. Multiple Programs Multiple Data (MPMP) model 34

SPMD In the SPMD model, there is only one program and each process uses the same executable working on different sets of data a.out Node 1 Node 2 Node 3 35

MPMD The MPMD model uses different programs for different processes, but the processes collaborate to solve the same problem MPMD has two styles, the master/worker and the coupled analysis a.out b.out a.out b.out c.out a.out= Structural Analysis, b.out = fluid analysis and c.out = thermal analysis Example Node 1 Node 2 Node 3 Node 1 Node 2 Node 3 1. MPMD: Master/Slave 2. MPMD: Coupled Analysis 36

An Example A Sequential Program 1. Read array a() from the input file 2. Set is=1 and ie=6 //is = index start and ie = index end 3. Process from a(is) to a(ie) 4. Write array a() to the output file a is ie 1 2 3 4 5 6 a a Colored shapes indicate the initial values of the elements Black shapes indicate the values after they are processed 37

An Example Process 0 1. Read array a() from the input file 2. Get my rank 3. If rank==0 then is=1, ie=2 If rank==1 then is=3, ie=4 If rank==2 then is=5, ie=6 4. Process from a(is) to a(ie) 5. Gather the results to process 0 6. If rank==0 then write array a() to the output file is ie a 1 2 3 4 5 6 Process 1 1. Read array a() from the input file 2. Get my rank 3. If rank==0 then is=1, ie=2 If rank==1 then is=3, ie=4 If rank==2 then is=5, ie=6 4. Process from a(is) to a(ie) 5. Gather the results to process 0 6. If rank==0 then write array a() to the output file is ie a 1 2 3 4 5 6 Process 2 1. Read array a() from the input file 2. Get my rank 3. If rank==0 then is=1, ie=2 If rank==1 then is=3, ie=4 If rank==2 then is=5, ie=6 4. Process from a(is) to a(ie) 5. Gather the results to process 0 6. If rank==0 then write array a() to the output file is ie a 1 2 3 4 5 6 a a a a 38

Concluding Remarks To summarize, keep the following 3 points in mind: The purpose of parallelization is to reduce the time spent for computation Ideally, the parallel program is p times faster than the sequential program, where p is the number of processes involved in the parallel execution, but this is not always achievable Message-passing is the tool to consolidate what parallelization has separated. It should not be regarded as the parallelization itself 39

Next Class Discussion on Programming Models MapReduce Pregel, Dryad, and GraphLab Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI) 40