EE382 Processor Design. Processor Issues for MP

Similar documents
EE382 Processor Design. Illinois

Handout 3 Multiprocessor and thread level parallelism

Multiprocessors and Locking

Computer Science 146. Computer Architecture

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Multiprocessors & Thread Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Systems Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Computer Systems Architecture

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

CS5460: Operating Systems

The Cache-Coherence Problem

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 25: Multiprocessors

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter 5. Multiprocessors and Thread-Level Parallelism

Interconnect Routing

Multiprocessor Synchronization

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CSCI 4717 Computer Architecture

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Chap. 4 Multiprocessors and Thread-Level Parallelism

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer Architecture

ECE/CS 757: Homework 1

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Chapter 5. Multiprocessors and Thread-Level Parallelism

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Concurrent Preliminaries

1. Memory technology & Hierarchy

Shared Symmetric Memory Systems

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Multiprocessor Systems

M4 Parallelism. Implementation of Locks Cache Coherence

Multiprocessor Synchronization

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Chapter-4 Multiprocessors and Thread-Level Parallelism

The complete license text can be found at

CS 654 Computer Architecture Summary. Peter Kemper

Multiprocessor Systems. Chapter 8, 8.1

EC 513 Computer Architecture

Chapter 5. Thread-Level Parallelism

Computer Architecture

Flynn's Classification

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Limitations of parallel processing

Portland State University ECE 588/688. Graphics Processors

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

CS533 Concepts of Operating Systems. Jonathan Walpole

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Multiprocessor Systems. COMP s1

Other consistency models

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS3350B Computer Architecture

5008: Computer Architecture

Intro to Multiprocessors

Portland State University ECE 588/688. Cray-1 and Cray T3E

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

Kaisen Lin and Michael Conley

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Shared memory multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Lecture 13. Shared memory: Architecture and programming

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Bus-Based Coherent Multiprocessors

Page 1. Cache Coherence

Shared Memory Multiprocessors

Scientific Applications. Chao Sun

Foundations of Computer Systems

Transcription:

EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency Physical Memory Coherency Synchronization Consistency Emphasis on Physical Memory and System Interconnect EE 382 Processor Design Winter 98/99 Michael Flynn 2 1

Outline Partitioning Granularity Overhead and efficiency Multi-threaded MP Shared Bus Coherency Synchronization Consistency Scalable MP Cache directories Interconnection networks Trends and tradeoffs Additional References Hennessy and Patterson, CAQA, Chapter 8 Culler, Singh, Gupta, Parallel Computer Architecture A Hardware/Software Approach http://http.cs.berkeley.edu/~culler/ book.alpha/index.html EE 382 Processor Design Winter 98/99 Michael Flynn 3 L2 Cache Representative System L1 Icache L1 Dcache Pipelines Registers CPU CPU Chipset Memory I/O Bus(es) EE 382 Processor Design Winter 98/99 Michael Flynn 4 2

Shared-Memory Shared Memory MP Consider systems with a single memory address space Contrasted to multi-computers separate memory address spaces message passing for communication and synchronization Example: Network of Workstations EE 382 Processor Design Winter 98/99 Michael Flynn 5 Shared Memory MP Types of shared-memory MP multithreaded or shared resource MP shared-bus MP (broadcast protocols) scalable MP (networked protocols) Issues partitioning of application into p parallel tasks scheduling of tasks to minimize dependency T w communications and synchronization EE 382 Processor Design Winter 98/99 Michael Flynn 6 3

Partitioning If a uniprocessor executes a program in time T 1 with O 1 operations, and a p parallel proc. executes in T p with O p ops, then O p >O 1 due to task overhead Also Sp = T 1 /T p < p, where p=no. of processors in the system and this is also the amount of parallelism (or the degree of partitioning) available in the program. EE 382 Processor Design Winter 98/99 Michael Flynn 7 Granularity Sp overhead limited limited by parallelism and load balance fine grain size coarse EE 382 Processor Design Winter 98/99 Michael Flynn 8 4

Task Scheduling Static at compile time Dynamic run time system load balancing load balancing clustering of tasks with inter-processor communication schedule with compiler assistance EE 382 Processor Design Winter 98/99 Michael Flynn 9 Overhead Limits Sp to less than p with p processors Efficiency = Sp/p = T 1 /(T p * p) Lee s equal work hypothesis: Sp < p/ln(p) Task overhead due to communications delays context switching cold cache effects EE 382 Processor Design Winter 98/99 Michael Flynn 10 5

Multi-threaded MP Multiple processors sharing many execution units each processor has its own state share function units, caches, TLBs, etc. Types time multiplex multiple processors so that there are no pipeline breaks,etc. pipelined processor switch context and on any processor delay (cache miss,etc) Optimizes multi-thread throughput, but limits singlethread performance See Study 8.1 on p. 537 Processors share D cache EE 382 Processor Design Winter 98/99 Michael Flynn 11 Shared-Bus MP Processors with own D cache require cache coherency protocol. Simplest protocols have processors snoop on writes to memory that occur on a shared bus If write is to a line in own cache, either invalidate or update that line. EE 382 Processor Design Winter 98/99 Michael Flynn 12 6

Coherency, Synchronization, and Consistency Coherency Property that the value returned after a read is the same value as the latest write Required for process migration even without sharing Synchronization Instructions that control access to critical sections of data shared by multiple processors Consistency Rules for allowing memory references to be reordered that may lead to observed differences in memory state by multiple processors EE 382 Processor Design Winter 98/99 Michael Flynn 13 Shared-Bus Cache Coherency Protocols Write invalidate, simple 3 state -V,I,D Berkeley (w.invalidate) 4 state - V,S,D,I Illinois (w.invalidate) 4 state - M,E,S,I Dragon (w.update) 5 state - M,E,S,D,I Simpler protocols have somewhat more memory bus traffic. EE 382 Processor Design Winter 98/99 Michael Flynn 14 7

MESI Protocol EE 382 Processor Design Winter 98/99 Michael Flynn 15 Coherence Overhead for Parallel Processing Results for 4 parallel programs with 16 CPUs and 64KB cache Coherence traffic is a substantial portion of bus demand Large blocks can lead to false sharing Hennessy and Patterson CAQA Fig 8.15 EE 382 Processor Design Winter 98/99 Michael Flynn 16 8

Synchronization Primitives Communicating Sequential Processes Process A Process B acquire semaphore acquire semaphore access shared data access shared data (read/modify/write) (read/modify/write) release semaphore release semaphore EE 382 Processor Design Winter 98/99 Michael Flynn 17 Synchronization Primitives Acquiring the semaphore generally requires an atomic read-modify-write operation a location Ensure that only one process enters critical section Test&Set, Locked-Exchange, Compare&Exchange, Fetch&Add, Load-Locked/Store-Conditional Looping on a semaphore with a test and set or similar instruction is called a spin lock Techniques to minimize overhead for spin contention: Test + Test&Set, exponential backoff EE 382 Processor Design Winter 98/99 Michael Flynn 18 9

Memory Consistency Problem Can the tests at L1 and L2 below both succeed? Process A Process B A = 0; B = 0;...... A = 1; B = 1; L1: if (B==0) L2: if (A==0) Memory Consistency Model Rules for allowing memory references by a program executing on one processor to be observed in a different order by a program executing on another processor Memory Fence operations explicitly control ordering of memory references EE 382 Processor Design Winter 98/99 Michael Flynn 19 Memory Consistency Models (Part I) Sequential consistency (strong ordering) All memory ops execute in some sequential order. Memory ops of each processor appear in program order. Processor consistency (Total Store Ordering) Writes are buffered and stored in order Reads are performed in order, but can bypass writes Processor flushes store buffer when synchronization instruction executed Weak consistency Memory references generally allowed in any order Programs enforce ordering when required for shared data by executing Memory Fence instructions All memory references for previous instructions complete before fence No memory references for subsequent instructions issued before fence Synchronization instructions act like fences EE 382 Processor Design Winter 98/99 Michael Flynn 20 10

Memory Consistency Models (Part II) Release consistency Distinguish between acquire/release of semaphore before/after access to shared data Acquire semaphore Ensure that semaphore acquired before any reads or writes by subsequent instructions (which may access shared data) Release semaphore Ensure that any writes by previous instructions (which may access shared data) are visible before semaphore released Hennessy and Patterson CAQA Fig 8.39 EE 382 Processor Design Winter 98/99 Michael Flynn 21 Pentium Processor Example 2-Level Cache Hierarchy Inclusion Enforced Snoops on system bus only need interrogate L2 Cache Policy Write-Back supported Write-Through optional selected by page or line write buffers used Cache Coherence MESI at both levels Memory Consistency Processor Ordering Issues Writes hit E-line on-chip Writes hit E or M line while buffer occupied Pipelines Data Cache Write Buffer Cache Write Buffer System Bus CPU L2 Cache EE 382 Processor Design Winter 98/99 Michael Flynn 22 11

Shared-Bus Performance Models Null Binomial Resubmissions don t automatically occur, e.g, multithreaded MP See study 8.1, page 537 Resubmissions model Where requests remain on bus until serviced See pp 413-415 and cache example posting on web Bus traffic usually limits number of processors Bus optimized for MP supports 10-20 But high cost for small systems Bus that incrementally extends uniprocessor limited to 2-4 EE 382 Processor Design Winter 98/99 Michael Flynn 23 12