Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Similar documents
Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Online Course Evaluation. What we will do in the last week?

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Chapter 7. Multicores, Multiprocessors, and Clusters

COMPUTER ORGANIZATION AND DESIGN

Multiprocessors & Thread Level Parallelism

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

CS3350B Computer Architecture

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Chapter 7. Multicores, Multiprocessors, and Clusters

Computer Architecture

Chapter 7. Multicores, Multiprocessors, and

Computer parallelism Flynn s categories

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chap. 4 Multiprocessors and Thread-Level Parallelism

Processor Architecture and Interconnect

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Distributed Systems CS /640

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Chapter 5. Multiprocessors and Thread-Level Parallelism

Mestrado em Informática

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Why Multiprocessors?

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

High Performance Computing Systems

Computer Systems Architecture

Memory Systems in Pipelined Processors

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Multicores, Multiprocessors, and Clusters

Lec 25: Parallel Processors. Announcements

DISTRIBUTED SHARED MEMORY

PARALLEL MEMORY ARCHITECTURE

Lecture 9: MIMD Architectures

Computer Systems Architecture

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Cloud Computing CS

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

High Performance Computing Course Notes HPC Fundamentals

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Parallel Computing Platforms

Lecture 7: Parallel Processing

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 9: MIMD Architectures

Engineering Robust Server Software

Review: Creating a Parallel Program. Programming for Performance

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Multiprocessors 1. Outline

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Lecture 21 Thread Level Parallelism II!

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Parallel Computing. Hwansoo Han (SKKU)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

1. Memory technology & Hierarchy

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CSCI 4717 Computer Architecture

High Performance Computing Course Notes Course Administration

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Aleksandar Milenkovich 1

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Chapter 5. Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

COSC4201 Multiprocessors

EECS 750: Advanced Operating Systems. 01/29 /2014 Heechul Yun

Introduction to Parallel Computing

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Lect. 2: Types of Parallelism

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

An Introduction to Parallel Programming

The Cache-Coherence Problem

High Performance Computing

Chapter 11. Introduction to Multiprocessors

Chapter 5: Thread-Level Parallelism Part 1

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

27. Parallel Programming I

COSC 6385 Computer Architecture - Multi Processor Systems

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Multiprocessor Systems

Transcription:

Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1

Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it s easier! Difficulties Partitioning Coordination Communications overhead 2

Amdahl s Law Sequential part can limit speedup Example: 100 processors, 90 speedup? T new = T parallelizable /100 + T sequential Solving: F parallelizable = 0.999 Need sequential part to be 0.1% of original time 3

Scaling Example Workload: sum of 10 scalars, and 10 10 matrix sum Speed up from 10 to 100 processors Single processor: Time = (10 + 100) t add 10 processors Time = 10 t add + 100/10 t add = 20 t add Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 t add + 100/100 t add = 11 t add Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processors 4

Scaling Example (cont) What if matrix size is 100 100? Single processor: Time = (10 + 10000) t add 10 processors Time = 10 t add + 10000/10 t add = 1010 t add Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 t add + 10000/100 t add = 110 t add Speedup = 10010/110 = 91 (91% of potential) Assuming load balanced 5

Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to number of processors 10 processors, 10 10 matrix Time = 20 t add 100 processors, 32 32 matrix Time = 10 t add + 1000/100 t add = 20 t add Constant performance in this example 6

Shared Memory SMP: shared memory multiprocessor Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time UMA (uniform) vs. NUMA (nonuniform) 7

Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA Each processor has ID: 0 Pn 99 Partition 1000 numbers per processor Initial summation on each processor sum[pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[pn] = sum[pn] + A[i]; Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, Need to synchronize between reduction steps 8

Example: Sum Reduction half = 100; repeat synch(); if (half%2!= 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[pn] = sum[pn] + sum[pn+half]; until (half == 1); 9

Message Passing Each processor has private physical address space Hardware sends/receives messages between processors 10

Loosely Coupled Clusters Network of independent computers Each has private memory and OS Connected using I/O system E.g., Ethernet/switch, Internet Suitable for applications with independent tasks Web servers, databases, simulations, High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMP 11

Sum Reduction (Again) Sum 100,000 on 100 processors First distribute 1000 numbers to each The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; Reduction Half the processors send, other half receive and add The quarter send, quarter receive and add, 12

Sum Reduction (Again) Given send() and receive() operations limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */ Send/receive also provide synchronization Assumes send/receive take similar time to addition 13

Cache Coherence Problem Suppose two CPU cores share a physical address space Write-through caches Time step Event CPU A s cache CPU B s cache Memory 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1

Coherence Defined Informally: Reads return most recently written value Formally: P writes X; P reads X (no intervening writes) read returns written value P 1 writes X; P 2 reads X (sufficiently later) read returns written value c.f. CPU B reading X after step 3 in example P 1 writes X, P 2 writes X all processors see writes in the same order End up with the same final value for X

Memory Consistency When are writes seen by other processors Seen means a read returns the written value Can t be instantaneously Assumptions A write completes only when all processors have seen it A processor does not reorder writes with other accesses Consequence P writes X then writes Y all processors that see new Y also see new X Processors can reorder reads, but not writes

Multiprocessor Caches 17

Multiprocessor Caches Caches provide [for shared items] Migration Replication Migration Reduces Latency Bandwidth demands Replication reduces Latency Contention for a read of shared item 18