Chap. 4 Multiprocessors and Thread-Level Parallelism

Similar documents
Multiprocessors 1. Outline

Computer Systems Architecture

Computer Systems Architecture

Handout 3 Multiprocessor and thread level parallelism

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Multiprocessors & Thread Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Multiprocessors - Flynn s Taxonomy (1966)

Chapter 5. Thread-Level Parallelism

Computer parallelism Flynn s categories

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CSC 631: High-Performance Computer Architecture

Aleksandar Milenkovich 1

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Shared Symmetric Memory Systems

COSC4201 Multiprocessors

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

COSC 6385 Computer Architecture - Multi Processor Systems

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Chapter-4 Multiprocessors and Thread-Level Parallelism

Processor Architecture and Interconnect

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Flynn s Classification

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Computer Architecture

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Advanced Topics in Computer Architecture

Issues in Multiprocessors

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Parallel Architecture. Hwansoo Han

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Why Multiprocessors?

Lect. 2: Types of Parallelism

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Chapter 5. Multiprocessors and Thread-Level Parallelism

Issues in Multiprocessors

Organisasi Sistem Komputer

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computer Organization. Chapter 16

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Comp. Org II, Spring

Lecture 9: MIMD Architectures

Chapter 9 Multiprocessors

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Parallel Processing & Multicore computers

Introduction II. Overview

Comp. Org II, Spring

Lecture 9: MIMD Architectures

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Limitations of parallel processing

UNIT I (Two Marks Questions & Answers)

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Lecture 7: Parallel Processing

WHY PARALLEL PROCESSING? (CE-401)

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Chapter 18 Parallel Processing

TDT 4260 lecture 3 spring semester 2015

Chapter 5: Thread-Level Parallelism Part 1

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

High Performance Computing Systems

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Introduction to Parallel Computing

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 9: MIMD Architecture

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Intro to Multiprocessors

PARALLEL MEMORY ARCHITECTURE

Computer Architecture Spring 2016

Chapter 5. Multiprocessors and Thread-Level Parallelism

Transcription:

Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006 25%/year 52%/year??%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86:??%/year 2002 to present 2 1

From ILP to TLP & DLP (Almost) All microprocessor companies moving to multiprocessor systems Single processors gain performance by exploiting instruction level ism (ILP) Multiprocessors exploit either: Thread level ism (TLP), or Data level ism (DLP) What s the problem? From ILP to TLP & DLP (cont.) We ve got tons of infrastructure for singleprocessor systems Algorithms, languages, compilers, operating systems, architectures, etc. These don t exactly scale well Multiprocessor design: not as simple as creating a chip with 1000 CPUs Task scheduling/division Communication Memory issues Even programming moving from 1 to 2 CPUs is extremely difficult 3 4 2

Why Multiprocessors? Slowdown in uniprocessor performance arising from diminishing returns in exploiting ILP, combined with growing concern on power Growth in data-intensive applications Data bases, file servers, Growing interest in servers, server perf. Increasing desktop perf. less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Multiprocessing Flynn s Taxonomy of Parallel Machines How many Instruction streams? How many Data streams? SISD: Single I Stream, Single D Stream A uniprocessor SIMD: Single I, Multiple D Streams Each processor works on its own data But all execute the same instrs in lockstep E.g. a vector processor or MMX =>Data Level Parallelism 3

Flynn s Taxonomy MISD: Multiple I, Single D Stream Not used much MIMD: Multiple I, Multiple D Streams Each processor executes its own instructions and operates on its own data This is your typical off-the-shelf multiprocessor (made using a bunch of normal processors) Includes multi-core processors, Clusters, SMP servers Thread Level Parallelism MIMD popular because Flexible: can run both N programs, or work on 1 multithreaded program together Cost-effective: same processor in desktop & MIMD Back to Basics A computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Parallel Architecture = Computer Architecture + Communication Architecture 2 classes of multiprocessors WRT memory: 1. Centralized Memory Multiprocessor < few dozen processor chips (and < 100 cores) in 2006 Small enough to share single, centralized memory 2. Physically Distributed-Memory multiprocessor Larger number chips and cores than 1. BW demands Memory distributed among processors 4

Centralized Shared Memory Multiprocessors Distributed Memory Multiprocessors 5

Centralized-Memory Machines Also Symmetric Multiprocessors (SMP) Uniform Memory Access (UMA) All memory locations have similar latencies Data sharing through memory reads/writes P1 can write data to a physical address A, P2 can then read physical address A to get that data Problem: Memory Contention All processor share the one memory Memory bandwidth becomes bottleneck Used only for smaller machines Most often 2,4, or 8 processors Shared Memory Pros and Cons Pros Communication happens automatically More natural way of programming Easier to write correct programs and gradually optimize them No need to manually distribute data (but can help if you do) Cons Needs more hardware support Easy to write correct, but inefficient programs (remote accesses look the same as local ones) 6

Distributed-Memory Machines Two kinds Distributed Shared-Memory (DSM) All processors can address all memory locations Data sharing like in SMP Also called NUMA (non-uniform memory access) Latencies of different memory locations can differ (local access faster than remote access) Message-Passing A processor can directly address only local memory To communicate with other processors, must explicitly send/receive messages Also called multicomputers or clusters Most accesses local, so less memory contention (can scale to well over 1000 processors) Message-Passing Machines A cluster of computers Each with its own processor and memory An interconnect to pass messages between them Producer-Consumer Scenario: P1 produces data D, uses a SEND to send it to P2 The network routes the message to P2 P2 then calls a RECEIVE to get the message Two types of send primitives Synchronous: P1 stops until P2 confirms receipt of message Asynchronous: P1 sends its message and continues Standard libraries for message passing: Most common is MPI Message Passing Interface 7

Message Passing Pros and Cons Pros Simpler and cheaper hardware Explicit communication makes programmers aware of costly (communication) operations Cons Explicit communication is painful to program Requires manual optimization If you want a variable to be local and accessible via LD/ST, you must declare it as such If other processes need to read or write this variable, you must explicitly code the needed sends and receives to do this Challenges of Parallel Processing First challenge is % of program inherently sequential (limited ism available in programs) Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? a. 10% b. 5% c. 1% d. <1% 5/7/2009 16 8

Amdahl s Law Answers Speedup overall 80 ( = 80 = ( 1 Fraction ) ( 1 Fraction ) ( 1 Fraction ) 79 = 80 Fraction Fraction enhanced 1 Fraction + Speedup 1 Fraction + 100 Fraction + ) = 1 100 0.8 Fraction = 79 / 79.2 = 99.75% Challenges of Parallel Processing Second challenge is long latency to remote memory (High cost of communications) delay ranges from 50 clock cycles to 1000 clock cycles. Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) What is performance impact if 0.2% instructions involve remote access? a. 1.5X b. 2.0X c. 2.5X 17 5/7/2009 9

CPI Equation CPI = Base CPI + Remote request rate x Remote request cost CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve local access Challenges of Parallel Processing 1. Application ism primarily via new algorithms that have better performance 2. Long remote latency impact both by architect and by the programmer For example, reduce frequency of remote accesses either by Caching shared data (HW) Restructuring the data layout to make more accesses local (SW) 19 5/7/2009 10

Cache Coherence Problem Shared memory easy with no caches P1 writes, P2 can read Only one copy of data exists (in memory) Caches store their own copies of the data Those copies can easily get inconsistent Classic example: adding to a sum P1 loads allsum, adds its mysum, stores new allsum P1 s cache now has dirty data, but memory not updated P2 loads allsum from memory, adds its mysum, stores allsum P2 s cache also has dirty data Eventually P1 and P2 s cached data will go to memory Regardless of write-back order, the final value ends up wrong Small-Scale Shared Memory Caches serve to: Increase bandwidth versus bus/memory Reduce latency of access Valuable for both private data and shared data What about cache consistency? Time Event $A $B X (memory) 0 1 1 CPU A 1 1 reads X 2 CPU B 1 1 1 reads X 3 CPU A stores 0 into X 0 1 0 Read and write a single memory location (X) by two processors (A and B) Assume Write-through cache 22 11

Cache coherence problem Example Cache Coherence Problem Time Event $A $B X (memory) 0 1 1 CPU A 1 1 reads X 2 CPU B 1 1 1 reads X 3 CPU A stores 0 into X 0 1 0 23 u :5 1 P 1 u =? 4 $ $ 5 $ u:5 Memory P 2 P 3 I/O devices Processors see different values for u after event 3 With write back caches, value written back to memory depends on which cache flushes or writes back value Processes accessing main memory may see very stale value Unacceptable for programming, and it s frequent! u =? 2 3 u :5 u= 7 5/7/2009 24 12

Cache Coherence Definition A memory system is coherent if 1. A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and read by P, always returns the value written by P. Preserves program order 2. If P1 writes to X and P2 reads X after a sufficient time, and there are no other writes to X in between, P2 s read returns the value written by P1 s write. any write to an address must eventually be seen by all processors 3. Writes to the same location are serialized: two writes to location X are seen in the same order by all processors. preserves causality Maintaining Cache Coherence Hardware schemes Shared Caches Trivially enforces coherence Not scalable (L1 cache quickly becomes a bottleneck) Snooping Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept Needs a broadcast network (like a bus) to enforce coherence Directory Sharing status of a block of physical memory is kept in just one location, the directory Can enforce coherence even with a point-to-point network 13

Snoopy Cache-Coherence Protocols Example: Write-thru Invalidate State Address Data P 1 $ Bus snoop P n $ P 1 P 2 P 3 u =? u =? 3 4 $ $ 5 $ u :5 u :5 u= 7 Mem I/O devices Cache-memory transaction Cache Controller snoops all transactions on the shared medium (bus or switch) relevant transaction if for a block it contains take action to ensure coherence invalidate, update, or supply value depends on state of the block and the protocol Either get exclusive access before write via write invalidate or update all copies on write 1 u:5 Memory u = 7 Must invalidate before step 3 I/O devices Write update uses more broadcast medium BW all recent MPUs use write invalidate 2 27 5/7/2009 14

Proce ssor CPU A reads X CPU B reads X Bus Activity Cache miss for X Cache miss for X CPU AIndalidati stores on for X 1 into X CPU B reads X Cache miss for X $A $B X (memory) 0 0 0 0 0 0 1 0 1 1 1 29 15