Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Similar documents
Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Chapter-4 Multiprocessors and Thread-Level Parallelism

Multiprocessor Synchronization

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Handout 3 Multiprocessor and thread level parallelism

Thread-Level Parallelism

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Parallel Architecture. Hwansoo Han

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Multiprocessors & Thread Level Parallelism

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

M4 Parallelism. Implementation of Locks Cache Coherence

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Chap. 4 Multiprocessors and Thread-Level Parallelism

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Aleksandar Milenkovich 1

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Multiprocessors 1. Outline

SHARED-MEMORY COMMUNICATION

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Systems Architecture

Chapter - 5 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Chapter 5: Thread-Level Parallelism Part 1

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Lecture 24: Virtual Memory, Multiprocessors

5008: Computer Architecture

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Architecture

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Multiprocessors - Flynn s Taxonomy (1966)

EE382 Processor Design. Processor Issues for MP

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Distributed Shared Memory and Memory Consistency Models

Computer parallelism Flynn s categories

COSC 6385 Computer Architecture - Multi Processor Systems

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

CPE 631 Lecture 21: Multiprocessors

Computer Systems Architecture

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Shared Symmetric Memory Systems

Computer Science 146. Computer Architecture

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Multiprocessor Cache Coherency. What is Cache Coherence?

CMSC 611: Advanced. Parallel Systems

Cache Coherence and Atomic Operations in Hardware

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Chapter 9 Multiprocessors

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

1. Memory technology & Hierarchy

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Multiprocessors and Locking

COSC4201 Multiprocessors

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Flynn's Classification

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture notes for CS Chapter 4 11/27/18

Multi-Processor / Parallel Processing

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 25: Multiprocessors

9. Distributed Shared Memory

Lecture 24: Memory, VM, Multiproc

UNIT I (Two Marks Questions & Answers)

EC 513 Computer Architecture

Computer Organization. Chapter 16

Overview: Memory Consistency

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Organisasi Sistem Komputer

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Case Study 1: Single-Chip Multicore Multiprocessor

Transcription:

Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST

Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will be attractive Small scale multiprocessors rather than large scale parallel computers Compiler problem Low ILP in applications challenges of parallel processing Amdahl s law Single chip multiprocessors Integrate a small number of processors and memory on a chip

SISD (Single instruction stream, single data stream) SIMD (Single instruction stream, multiple data streams) Special purpose, for example, media-processors MISD (Multiple instruction streams, single data stream) No machine to date MIMD (Multiple instruction streams, multiple data streams) Offer flexibility Build on the cost/performance advantages of microprocessors

Centralized shared-memory Small no. of processors Single centralized main memory Connected to a bus UMA (Uniform Memory Access) Popular organization Distributed shared-memory Large no. of processors Memory is distributed among processors Large memory bandwidth, easily scalable Interconnection network

Shared address space DSM (Distributed shared memory) NUMA (Non-uniform Memory Access) Private address space Multi-computers Communication is done by passing messages Called message passing machines Remote procedure call (RPC)

Share-memory communication Well-understood mechanism Easy programming Lower overhead and better use of bandwidth when communicating small items Hardware controlled caching of remote data Supporting message passing on top of shared memory is easier Message-passing communication Simple hardware Explicit communication Results in optimization in user level Supporting shared memory on top of message passing hardware is more difficult

Insufficient parallelism Low ILP Amdahl s law Long latency of remote access Can reduce remote accesses with assistance of hardware and software mechanisms

Cache coherence Coherence: defines what value can be returned by a read A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated and no other writes to X occur between the two accesses Writes to the same location are serialized Consistency: determines when a written value is returned by a read

Directory based Sharing status is kept in one location called directory DSM Snooping Shared-memory bus Cache controllers monitor on the bus to determine whether they have a copy of a block that is requested on the bus popular

Write invalidate protocol Invalidate all other copies on a write Works on cache blocks One invalidation for multiple writes to the same word Preferred choice in a bus-based multiprocessor Write update protocol Update all other copies on a write Works on word Multiple update for writes to the same cache block Less delay between a write of one processor and a read of another processor

Write-through cache Recent value is always in memory Write-back cache Recent value may be in a cache Preferable due to reduced memory bandwidth The cache that has the dirty block provides that cache block in response to the read request Owner: memory or a cache Valid bit, dirty bit, shared bit The write is not placed on the bus if not shared Snooping overhead Duplicating the tag Multi-level cache with inclusion

Exclusive coherence Only private data is kept in the caches Shared data is marked as uncacheable Include coherence An accepted requirement Directory-based cache coherence protocol To provide scalability Associate an entry in the directory with each memory block Directory entries can be distributed along with the memory Shared state, uncached state, exclusive state Keeps a bit-vector to indicate the processors that have a copy the block Local node, home node, remote node

Hardware primitives Automatically read and modify a memory location Primitives (single instruction): atomic read-and-update Exchange Exchange a value in a register for a value in memory Test-and-set Test a value and set if the value passes the test Fetch-and-increment Returns the value in a memory and automatically increments it

Primitives (a pair of instructions) Load linked(load locked) and store conditional Store conditional fails if the memory location specified by the load linked is changed before the store conditional Fail cases: context switch, other writes Store conditional returns 1 if it succeeds Load linked is implemented with the link register that keeps track of the address specified in the load linked instruction Link register is cleared if an interrupt occurs or its cache block is invalidated

Locks that a processor continuously tries to acquire li R2, #1 lockit: exch R2, 0(R1) bnez R2,lockit If cache coherence is supported, we can cache the locks Each exchange requires a write operation lockit: lw R2, 0(R1) bnez R2, lockit li R2, #1 exch R2, 0(R1) bnez R2, lockit

Load-linked/store-conditional lockit: ll R2, 0(R1) bnez R2, lockit li R2, #1 sc R2, 0(R1) beqz R2, lockit Simple but leads to lots of contention as well as traffic Fairness of the bus makes things worse

A barrier forces all processes to wait until all the processes reach the barrier and then releases all of the processes Implemented with two spin locks lock (counterlock); if(counter == 0) release = 0; count++; unlock(counterlock); if(count == total) { count = 0; release = 1; } else { spin(release == 1); } The fast process can trap slow processes in the barrier by resetting the flag release

Sense-reverse barrier local_sense =! local_sense; lock (counterlock); count++; unlock(counterlock); if(count == total) { count = 0; release = local_sense; } else { spin(release == local_sense); }

Software implementation Exponential back-off Combining tree n-ary tree structure where multiple requests are locally combined in tree fashion k processes are arrived at a node, we signal the next level in the tree Hardware primitives Unneeded contention after the release Queuing lock Keep a list of waiting processes and hand the lock to one explicitly when its turn comes

Sequential consistency The most straightforward model Sequential consistency requires that the result of any execution be the same as if the accesses executed by each processor were kept in order and the accesses among different processors were interleaved Delay the next access until the previous one is completed We cannot use the write buffer with read bypassing Programmer s View Release and Acquire release = unlock acquire = lock

Release and Acquire release = unlock acquire = lock write (x). release (s).. acquire (s) read(x). Memory Fence Fixed points in a computation that ensure that no read and write is moved across the fence Read fence / Write fence In sequential consistency, all reads are read fences and all writes are write fences

TSO: processor consistency or total store ordering Eliminate the order W R Allow the buffering or writes with bypassing by reads Check to determine whether a pending write is the same as the read miss PSO: partial store ordering Relaxing W W Pipelining or overlapping of write operations

Weak ordering Relaxing R R, R W A read or write is completed before any synchronization operation executed in program order by the processor after the read or write A synchronization operation is always completed before any reads or writes that occur in program order after the operation Take advantage of nonblocking reads Release consistency Acquire synchronization (S A ) Release synchronization (S R ) Removes W S A, R S A, S R R, and S R W