Limitations of parallel processing

Similar documents
ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Computer Systems Architecture

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

EC 513 Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Computer Systems Architecture

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors & Thread Level Parallelism

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Chapter 5. Thread-Level Parallelism

Aleksandar Milenkovich 1

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

1. Memory technology & Hierarchy

Chap. 4 Multiprocessors and Thread-Level Parallelism

Shared Symmetric Memory Systems

Handout 3 Multiprocessor and thread level parallelism

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Multiprocessors 1. Outline

Page 1. Cache Coherence

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter-4 Multiprocessors and Thread-Level Parallelism

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Computer Organization

CMSC 611: Advanced. Distributed & Shared Memory

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Directory Implementation. A High-end MP

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Lecture 24: Board Notes: Cache Coherency

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

ECSE 425 Lecture 30: Directory Coherence

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

COSC 6385 Computer Architecture - Multi Processor Systems

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Computer Architecture

COSC4201 Multiprocessors

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Multiprocessor Synchronization

Cache Coherence and Atomic Operations in Hardware

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

ECE/CS 757: Homework 1

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Thread- Level Parallelism. ECE 154B Dmitri Strukov

CMSC 611: Advanced Computer Architecture

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Shared Memory Architectures. Approaches to Building Parallel Machines

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

12 Cache-Organization 1

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Parallel Architecture. Hwansoo Han

UNIT I (Two Marks Questions & Answers)

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Lecture 13. Shared memory: Architecture and programming

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Computer Architecture

Flynn s Classification

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

The Cache Write Problem

Multiprocessors - Flynn s Taxonomy (1966)

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

High Performance Multiprocessor System

Transcription:

Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors 1 Limitations of parallel processing Multiprocessors are used to speed up computation solve larger problems But most workloads include sequential and parallel phases synchronization increases sequentiality Amdahl s law will limit speedup e.g., with 100 processors, and 10% sequential, what is speedup? Communication is expensive (50-10000 clock cycles) communication:computation ratio limits speedup embarrassingly parallel applications 11/8/00 CSE 471 Multiprocessors 2

What is maximum speedup for N processors? If a program takes X seconds on a uniprocessor, what is the fastest it can run on an N-way multiprocessor? N? 11/8/00 CSE 471 Multiprocessors 3 SMP (Symmetric MultiProcessors) single shared-bus systems Proc. Caches Shared-bus Interleaved Memory I/O adapter 11/8/00 CSE 471 Multiprocessors 4

P2 reads A; P3 reads A Replication of shared data P1 P2 P3 P4 Mem. 11/8/00 CSE 471 Multiprocessors 5 The coherence problem Now what happens if P4 writes A?? P1 P2 P3 P4 Mem. 11/8/00 CSE 471 Multiprocessors 6

Cache coherence Determines what are valid values returned by a read If P1 writes to X, followed by P1 reads from X then P1 must read its own write If P1 writes to X, then P2 reads from X a short while later must have P2 reads P1 s write If P1 writes value A to X, and P2 writes value B to X at about the same time require that if any processor reads A then B, all processors must read A then B. 11/8/00 CSE 471 Multiprocessors 7 Snooping vs. directory based Directory based coherence: (distributed memory machines) a special location called a directory maintains sharing status of cache blocks cache controllers must explicitly communicate with directory Snooping cache coherence: (mostly SMPs) all caches maintain the sharing status of cache blocks no centralized state all cache controllers listen on the shared-memory bus to update status of cache blocks automatic serialization of writes (needed for coherence) 11/8/00 CSE 471 Multiprocessors 8

Snooping cache coherence techniques Two main flavors: Write-invalidate protocols P4 sends an invalidate message out on memory bus P2 and P3 invalidate the copy in their caches Write-update or write-broadcast protocols P4 s write is write-through, and sent on memory bus the new value of A is snooped by caches of P2 and P3 11/8/00 CSE 471 Multiprocessors 9 Write-update P1 P2 P3 P4 Mem. 11/8/00 CSE 471 Multiprocessors 10

Write-invalidate Invalid line P1 P2 P3 P4 Mem. 11/8/00 CSE 471 Multiprocessors 11 False sharing and write-invalidate Occurs when two processors read/write to different words in the same cache block from point of view of cache coherence protocol, looks like the entire block is shared data block bounces back and forth between the processors worst case: P1 is constantly reading word X, and P2 is constantly writing word Y, and both X and Y are in same cache block 11/8/00 CSE 471 Multiprocessors 12

Effects on false sharing larger cache block size? larger cache? larger miss penalty? more processors? ways to reduce false sharing: compiler optimization (layout of data in memory, block padding) cache-conscious programming can massively inflate/deflate constants in big-o notation 11/8/00 CSE 471 Multiprocessors 13 State machine implementation of snooping protocols Associate states with each cache block (minimal 3) Invalid (can use existing valid bit in cache block) Shared (possibly many copies, all up to date - need another bit) Modified (dirty data; exists in only one cache - use dirty bit) Fourth state (and sometimes more) for performance purposes Exclusive (clean, only copy) 11/8/00 CSE 471 Multiprocessors 14

State transitions for a given cache block Events incurred by processor associated with the cache Read miss, write miss, write on clean block Events incurred by snooping on the bus as result of other processor actions, e.g., Read miss by Q might make P s block transit from dirty to clean Write miss by Q might make P s block transit from dirty/clean to invalid (write invalidate protocol) 11/8/00 CSE 471 Multiprocessors 15 Simple 3-state protocol (CPU activity) invalid shared modified

Simple 3-state protocol (snooped bus activity) invalid shared modified Illinois protocol: more advanced Add 4th exclusive state to enhance performance On a write to a block in E state, no need to send an invalidation message (occurs often for private variables). On a read miss with no cache having the block in dirty state Who sends the data: memory or cache (if any)? Answer: cache for that particular protocol; other protocols might use the memory If more than one cache, which one? Answer: the first to grab the bus (tri-state devices) 11/8/00 CSE 471 Multiprocessors 18

Performance of snoopy protocols Performance depends on the length of a write run Write run: sequence of write references by 1 processor to a shared address (or shared block) uninterrupted by either access by another processor or replacement long write runs better to have write invalidate short write runs better to have write update There have been proposals to make the choice between protocols at run time 11/8/00 CSE 471 Multiprocessors 19 Wrinkle #1: snooping interfering with CPU each bus transaction causes a check on cache tags could interfere with CPU s cache access, stalling processor fix #1: extra set of tags for snooping activity on cache miss, processor must arbitrate for and update both sets if snoop finds match, it must also update both tags for invalidates or to update shared bit fix #2: exploit multilevel cache, and cache inclusion property every entry in L1 cache also in L2 cache snoop uses L2 cache, CPU uses L1 cache snoop hit: snoop may need to update L1 cache as well can do fix #1 and fix #2 combined 11/8/00 CSE 471 Multiprocessors 20

Wrinkle #2: atomicity of operations protocols assume that operations are atomic e.g., assumes write miss can be detected, acquire bus, and receive response as a single atomic action introduces possibility that protocol can deadlock to fix, need to augment protocol to deal with non-atomic writes without adding deadlock a topic for a different course 11/8/00 CSE 471 Multiprocessors 21