Chapter 5: Thread-Level Parallelism Part 1

Similar documents
Lecture notes for CS Chapter 4 11/27/18

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors & Thread Level Parallelism

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Computer parallelism Flynn s categories

COSC 6385 Computer Architecture - Multi Processor Systems

Lect. 2: Types of Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors 1. Outline

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chapter 5. Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer Systems Architecture

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Multicores, Multiprocessors, and Clusters

PARALLEL MEMORY ARCHITECTURE

Online Course Evaluation. What we will do in the last week?

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 11. Introduction to Multiprocessors

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Aleksandar Milenkovich 1

Lecture 24: Virtual Memory, Multiprocessors

Parallel Architecture. Hwansoo Han

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Computer Systems Architecture

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Handout 3 Multiprocessor and thread level parallelism

Dr. Joe Zhang PDC-3: Parallel Platforms

CDA3101 Recitation Section 13

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel Computing Platforms

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

M4 Parallelism. Implementation of Locks Cache Coherence

Issues in Multiprocessors

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Why Multiprocessors?

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Flynn's Classification

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Issues in Multiprocessors

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnect Routing

Computer Architecture

Parallel Computing Introduction

Introduction to Parallel Computing

Parallel Architectures

CS 475: Parallel Programming Introduction

Shared Symmetric Memory Systems

Computer Architecture Crash course

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Top500 Supercomputer list

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CSE 392/CS 378: High-performance Computing - Principles and Practice

COSC4201 Multiprocessors

Parallel Computing. Hwansoo Han (SKKU)

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

CS/COE1541: Intro. to Computer Architecture

Parallel Architectures

Chapter 9 Multiprocessors

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Introduction to parallel computing

BlueGene/L (No. 4 in the Latest Top500 List)

Processor Architecture and Interconnect

Multiprocessor Synchronization

High Performance Computing Systems

CS3350B Computer Architecture

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Lecture 7: Parallel Processing

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

The Art of Parallel Processing

CSE502: Computer Architecture CSE 502: Computer Architecture

Introduction II. Overview

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer Architecture

Parallel Computers. c R. Leduc

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

10th August Part One: Introduction to Parallel Computing

Transcription:

Chapter 5: Thread-Level Parallelism Part 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? Performance potential Flynn classification Communication models Architectures Centralized shared-memory Distributed shared-memory Parallel programming Synchronization Memory consistency models

What is a parallel or multiprocessor system? Multiple processor units working together to solve the same problem Key architectural issue: Communication model

Why parallel architectures? Absolute performance Technology and architecture trends Dennard scaling, ILP wall, Moore s law Þ Multicore chips Connect multicore together for even more parallelism

Amdahl's Law is pessimistic Let s be the serial part Let p be the part that can be parallelized n ways Serial: SSPPPPPP 6 processors: SSP P P P P P Speedup = 8/3 = 2.67 T(n) = As n, T(n) Pessimistic 1 s+p/n Performance Potential 1 s

Performance Potential (Cont.) Gustafson's Corollary Amdahl's law holds if run same problem size on larger machines But in practice, we run larger problems and ''wait'' the same time

Gustafson's Corollary (Cont.) Performance Potential (Cont.) Assume for larger problem sizes Serial time fixed (at s) Parallel time proportional to problem size (truth more complicated) Old Serial: SSPPPPPP 6 processors: SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP Hypothetical Serial: SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP Speedup = (8+5*6)/8 = 4.75 T'(n) = s + n*p; T'( )!!!! How does your algorithm ''scale up''?

Flynn classification Single-Instruction Single-Data (SISD) Single-Instruction Multiple-Data (SIMD) Multiple-Instruction Single-Data (MISD) Multiple-Instruction Multiple-Data (MIMD)

Communication models Shared-memory Message passing Data parallel

Communication Models: Shared-Memory P P P interconnect Each node a processor that runs a process One shared memory Accessible by any processor The same address on two different processors refers to the same datum Therefore, write and read memory to Store and recall data MMMMMMM Communicate, Synchronize (coordinate)

Communication Models: Message Passing P M P M P M interconnect Each node a computer Processor runs its own program (like SM) Memory local to that node, unrelated to other memory Add messages for internode communication, send and receive like mail

Communication Models: Data Parallel P M P M P M interconnect Virtual processor per datum Write sequential programs with ''conceptual PC'' and let parallelism be within the data (e.g., matrices) C = A + B Typically SIMD architecture, but MIMD can be as effective

Architectures All mechanisms can usually be synthesized by all hardware Key: which communication model does hardware support best? Virtually all small-scale systems, multicores are shared-memory

Which is Best Communication Model to Support? Shared-memory Used in small-scale systems Easier to program for dynamic data structures Lower overhead communication for small data Implicit movement of data with caching Hard to build? Message-passing Communication explicit harder to program? Larger overheads in communication OS intervention? Easier to build?

Shared-Memory Architecture The model PROC PROC PROC INTERCONNECT MEMORY For now, assume interconnect is a bus centralized architecture

Centralized Shared-Memory Architecture PROC PROC PROC BUS MEMORY

Centralized Shared-Memory Architecture (Cont.) For higher bandwidth (throughput) For lower latency Problem?

Cache Coherence Problem PROC 1 PROC 2 PROC n A CACHE BUS MEMORY A MEMORY

Cache Coherence Solutions Snooping PROC 1 PROC 2 PROC n A CACHE BUS MEMORY A MEMORY Problem with centralized architecture

Distributed Shared-Memory (DSM) Architecture Use a higher bandwidth interconnection network PROC 1 PROC 2 PROC n CACHE CACHE CACHE GENERAL INTERCONNECT MEMORY MEMORY MEMORY Uniform memory access architecture (UMA)

Distributed Shared-Memory (DSM) - Cont. For lower latency: Non-Uniform Memory Access architecture (NUMA)

Non-Bus Interconnection Networks Example interconnection networks

Distributed Shared-Memory - Coherence Problem Directory scheme PROC PROC PROC MEM MEM MEM CACHE CACHE CACHE SWITCH/NETWORK Level of indirection!

Parallel Programming Example Add two matrices: C = A + B Sequential Program main(argc, argv) int argc; char *argv; { Read(A); Read(B); for (i = 0; i! N; i++) for (j = 0; j! N; j++) C[i,j] = A[i,j] + B[i,j]; Print(C); }

Parallel Program Example (Cont.)

The Parallel Programming Process

Synchronization Communication Exchange data Synchronization Exchange data to order events Mutual exclusion or atomicity Event ordering or Producer/consumer Point to Point Flags Global Barriers

Example Mutual Exclusion Each processor needs to occasionally update a counter Processor 1 Processor 2 Load reg1, Counter reg1 = reg1 + tmp1 Store Counter, reg1 Load reg2, Counter reg2 = reg2 + tmp2 Store Counter, reg2

Mutual Exclusion Primitives Hardware instructions Test&Set Atomically tests for 0 and sets to 1 Unset is simply a store of 0 while (Test&Set(L)!= 0) {;} Critical Section Unset(L) Problem?

Mutual Exclusion Primitives Alternative? Test&Test&Set

Mutual Exclusion Primitives Fetch&Add Fetch&Add(var, data) { /* atomic action */ temp = var var = temp + data } return temp E.g., let X = 57 P1: a = Fetch&Add(X,3) P2: b = Fetch&Add(X,5) If P1 before P2,? If P2 before P1,? If P1, P2 concurrent?

Example Point to Point Event Ordering Producer wants to indicate to consumer that data is ready Processor 1 Processor 2 A[1] = = A[1] A[2] = = A[2].... A[n] = = A[n]

Global Event Ordering Barriers Example All processors produce some data Want to tell all processors that it is ready In next phase, all processors consume data produced previously Use barriers