Thread- Level Parallelism. ECE 154B Dmitri Strukov

Similar documents
Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 5. Multiprocessors and Thread-Level Parallelism

CMSC 611: Advanced. Distributed & Shared Memory

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Aleksandar Milenkovich 1

Multiprocessors & Thread Level Parallelism

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

CMSC 611: Advanced Computer Architecture

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

ECSE 425 Lecture 30: Directory Coherence

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Flynn s Classification

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

COSC4201 Multiprocessors

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Computer Systems Architecture

Handout 3 Multiprocessor and thread level parallelism

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter 6: Multiprocessors and Thread-Level Parallelism

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Directory Implementation. A High-end MP

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Multiprocessor Cache Coherency. What is Cache Coherence?

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

EC 513 Computer Architecture

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

1. Memory technology & Hierarchy

COSC 6385 Computer Architecture. - Thread Level Parallelism (III)

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

5008: Computer Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (III)

Computer Systems Architecture

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Review. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Multiprocessor Systems

Lecture 24: Virtual Memory, Multiprocessors

Thread-Level Parallelism

Memory Hierarchy in a Multiprocessor

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Processor Architecture

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Interconnect Routing

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Multiprocessors 1. Outline

??%/year. ! All microprocessor companies switch to MP (2X CPUs / 2 yrs) " Procrastination penalized: 2X sequential perf. / 5 yrs

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Foundations of Computer Systems

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Chapter-4 Multiprocessors and Thread-Level Parallelism

Shared Memory Architectures. Approaches to Building Parallel Machines

Lecture 25: Multiprocessors

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Shared Symmetric Memory Systems

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Limitations of parallel processing

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

CS 152 Computer Architecture and Engineering

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CSC 631: High-Performance Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Parallel Architecture. Hwansoo Han

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Scalable Cache Coherence

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Computer Science 146. Computer Architecture

Transcription:

Thread- Level Parallelism ECE 154B Dmitri Strukov

Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors Threads can be used for data- level parallelism, but the overheads may outweigh the benefit

Natural Extension of Memory System P 1 Switch P n Scale (Interleaved) First- level $ P 1 P n (Interleaved) Main memory $ Inter connec?on network $ Shared Cache Mem Mem Centralized Memory Dance Hall, UMA P 1 P n Mem $ Mem $ Inter connec?on network Distributed Memory (NUMA)

Symmetric mul?processors (SMP) Small number of cores Share single memory with uniform memory latency Distributed shared memory (DSM) Memory distributed among processors Non- uniform memory access/ latency (NUMA) Processors connected via direct (switched) and non- direct (mul?- hop) interconnec?on networks SMP vs DSM

Example Cache Coherence Problem P 1 P 2 P 3 u =? u =? 4 $ $ 5 $ u :5 u :5 3 u = 7 1 u :5 Memory I/O devices Things to note: Processors see different values for u auer event 3 With write back caches, value wrien back to memory depends on happenstance of which cache flushes or writes back value when Processes accessing main memory may see very stale value Unacceptable to programs, and frequent! 2

Cache Coherence Snoopy schemes vs. Directory schemes Snooping Solu?on: Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching informa?on is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory- Based Schemes Keep track of what is being shared in a centralized place (logically) Distributed memory distributed directory for scalability (avoids bolenecks) Send point- to- point requests to processors via network Scales beer than Snooping Write through vs. write- back protocols Update vs. invalida?on protocols

Invalida?on vs. Update (Broadcast) Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Read Miss:» Write- through: memory is always up- to- date» Write- back: snoop in caches to find most recent copy Write Broadcast Protocol: Write to shared data: broadcast on bus, processors snoop, and update copies Read miss: memory is always up- to- date Bandwidth vs. latency tradeoff

An Example Snoopy Protocol Invalida?on protocol, write- back cache Each block of memory is in one state: Clean in all caches and up- to- date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has only copy, its writeable, and dirty OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses

Snoopy Coherence Protocols

Snoopy- Cache State Machine- I CPU Read hit State machine for CPU requests for each cache block Invalid CPU Read Place read miss on bus Shared (read/only) Cache Block State CPU Write Place Write Miss on bus CPU read hit CPU write hit Exclusive (read/write) CPU read miss Write back block, Place read miss on bus CPU Read miss Place read miss on bus CPU Write Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus

Snoopy- Cache State Machine- II State machine for bus requests for each cache block Appendix I gives details of bus requests Invalid Write miss for this block Shared (read/only) Write miss for this block Write Back Block; (abort memory access) Exclusive (read/write) Read miss for this block Write Back Block; (abort memory access)

Snoopy- Cache State Machine- III State machine for CPU requests for each cache block and for bus requests for each cache block Cache Block State Write miss for this block Write Back Block; (abort memory access) CPU read hit CPU write hit Write miss for this block Invalid CPU Read Place read miss CPU Write on bus Place Write Miss on bus Exclusive (read/write) CPU read miss Write back block, Place read miss on bus CPU Read hit Shared (read/only) CPU Read miss Place read miss on bus CPU Write Place Write Miss on Bus Read miss for this block Write Back Block; (abort memory access) CPU Write Miss Write back cache block Place write miss on bus

Example P1 P2 Bus Memory step State Addr Value State Addr ValueAction Proc. Addr ValueAddrValue P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: P1: Read A1 A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 What happen if P1 reads A1 at this time?

Coherence Protocols: Extensions Shared memory bus and snooping bandwidth is boleneck for scaling symmetric mul?processors Duplica?ng tags Place directory in outermost cache Use crossbars or point- to- point networks with banked memory

True vs. False Sharing Misses Coherence influences cache miss rate Coherence misses True sharing misses Write to shared block (transmission of invalida?on) Read an invalidated block False sharing misses Read an unmodified word in an invalidated block

Performance Study: Commercial Workload L3 True sharing misses is constant with $L3 size grows with processor count and drops with $L3 block size - Performed on AlphaServer (21164) with three level memory hierarchy (1998) - OLTP, DSS and AltaVista benchmarks

Another Example for Snooping Request AcIon P0: read 120 P0.B0: (S, 120, 0020) returns 0020 P0: write 120 ß 80 P1: read 110 P0: write 108 ß 48 P0: write 130 ß 78 P3: write 130 ß 78

Another Example for Snooping Request AcIon P0: read 120 P0.B0: (S, 120, 0020) returns 0020 P0: write 120 ß 80 P0.B0: (M, 120, 0080) P3.B0: (I, 120, 0020) P1: read 110 P0: write 108 ß 48 P0: write 130 ß 78 P3: write 130 ß 78

Another Example for Snooping Request AcIon P0: read 120 P0.B0: (S, 120, 0020) returns 0020 P0: write 120 ß 80 P0.B0: (M, 120, 0080) P3.B0: (I, 120, 0020) P1: read 110 P0.B2: (S, 110, 0030) P1.B2: (S, 110, 0030) returns 0030 M: 110 ß 0030 (writeback) P0: write 108 ß 48 P0: write 130 ß 78 P3: write 130 ß 78

Another Example for Snooping Request AcIon P0: read 120 P0.B0: (S, 120, 0020) returns 0020 P0: write 120 ß 80 P0.B0: (M, 120, 0080) P3.B0: (I, 120, 0020) P1: read 110 P0.B2: (S, 110, 0030) P1.B2: (S, 110, 0030) returns 0030 M: 110 ß 0030 (writeback) P0: write 108 ß 48 P0.B1: (M, 108, 0048) P3.B1: (I, 108, 0008) P0: write 130 ß 78 P3: write 130 ß 78

Another Example for Snooping Request AcIon P0: read 120 P0.B0: (S, 120, 0020) returns 0020 P0: write 120 ß 80 P0.B0: (M, 120, 0080) P3.B0: (I, 120, 0020) P1: read 110 P0.B2: (S, 110, 0030) P1.B2: (S, 110, 0030) returns 0030 M: 110 ß 0030 (writeback) P0: write 108 ß 48 P0.B1: (M, 108, 0048) P3.B1: (I, 108, 0008) P0: write 130 ß 78 P0.B2: (M, 130, 0078) P3: write 130 ß 78

Another Example for Snooping Request AcIon P0: read 120 P0.B0: (S, 120, 0020) returns 0020 P0: write 120 ß 80 P0.B0: (M, 120, 0080) P3.B0: (I, 120, 0020) P1: read 110 P0.B2: (S, 110, 0030) P1.B2: (S, 110, 0030) returns 0030 M: 110 ß 0030 (writeback) P0: write 108 ß 48 P0.B1: (M, 108, 0048) P3.B1: (I, 108, 0008) P0: write 130 ß 78 P0.B2: (M, 130, 0078) P3: write 130 ß 78 P0.B2: (I, 130, 0078) P3.B2: (M, 130, 0078) returns 0078 M: 110 ß 0078 (writeback)

Directory Protocols Directory keeps track of every block Which caches have each block Dirty status of each block Implement in shared L3 cache Keep bit vector of size = # cores for each block in L3 Not scalable beyond shared L3 Implement in a distributed fashion:

Directory Protocols For each block, maintain state: Shared One or more nodes have the block cached, value in memory is up- to- date Set of node IDs Uncached Modified Exactly one node has a copy of the cache block, value in memory is out- of- date Owner node ID Directory maintains block states and sends invalida?on messages

Directory Protocols FSM for individual cache block FSM for memory block in directory

Messages

Directory Protocols For uncached block: Read miss Reques?ng node is sent the requested data and is made the only sharing node, block is now shared Write miss The reques?ng node is sent the requested data and becomes the sharing node, block is now exclusive For shared block: Read miss The reques?ng node is sent the requested data from memory, node is added to sharing set Write miss The reques?ng node is sent the value, all nodes in the sharing set are sent invalidate messages, sharing set only contains reques?ng node, block is now exclusive

Directory Protocols For exclusive block: Read miss The owner is sent a data fetch message, block becomes shared, owner sends data to the directory, data wrien back to memory, sharers set contains old owner and requestor Data write back Block becomes uncached, sharer set is empty Write miss Message is sent to old owner to invalidate and send the value to the directory, requestor becomes new owner, block remains exclusive

Directory- based Protocol Interconnection Network Directory Directory Directory Local Memory Local Memory Local Memory Cache Cache Cache CPU 0 CPU 1 CPU 2

Directory- based Protocol Interconnection Network Bit Vector Directories U 0 0 0 Memories 7 Caches CPU 0 CPU 1 CPU 2

CPU 0 Reads Interconnection Network Read Miss Directories U 0 0 0 Memories 7 Caches CPU 0 CPU 1 CPU 2

CPU 0 Reads Interconnection Network Directories S 1 0 0 Memories 7 Caches CPU 0 CPU 1 CPU 2

CPU 0 Reads Interconnection Network Directories S 1 0 0 Memories 7 Caches 7 CPU 0 CPU 1 CPU 2

CPU 2 Reads Interconnection Network Directories S 1 0 0 Memories Read Miss 7 Caches 7 CPU 0 CPU 1 CPU 2

CPU 2 Reads Interconnection Network Directories S 1 0 1 Memories 7 Caches 7 CPU 0 CPU 1 CPU 2

CPU 2 Reads Interconnection Network Directories S 1 0 1 Memories 7 Caches 7 7 CPU 0 CPU 1 CPU 2

CPU 0 Writes 6 to Write Miss Interconnection Network Directories S 1 0 1 Memories 7 Caches 7 7 CPU 0 CPU 1 CPU 2

CPU 0 Writes 6 to Interconnection Network Directories S 1 0 1 Memories Invalidate 7 Caches 7 7 CPU 0 CPU 1 CPU 2

CPU 0 Writes 6 to Interconnection Network Directories E 1 0 0 Memories 7 Caches 6 CPU 0 CPU 1 CPU 2

CPU 1 Reads Interconnection Network Read Miss Directories E 1 0 0 Memories 7 Caches 6 CPU 0 CPU 1 CPU 2

CPU 1 Reads Interconnection Network Switch to Shared Directories E 1 0 0 Memories 7 Caches 6 CPU 0 CPU 1 CPU 2

CPU 1 Reads Interconnection Network Directories E 1 0 0 Memories 6 Caches 6 CPU 0 CPU 1 CPU 2

CPU 1 Reads Interconnection Network Directories S 1 1 0 Memories 6 Caches 6 6 CPU 0 CPU 1 CPU 2

CPU 2 Writes 5 to Interconnection Network Directories S 1 1 0 Memories Write Miss 6 Caches 6 6 CPU 0 CPU 1 CPU 2

CPU 2 Writes 5 to Interconnection Network Directories S 1 1 0 Memories 6 Caches 6 6 CPU 0 CPU 1 CPU 2

CPU 2 Writes 5 to (Write back) Interconnection Network Directories E 0 0 1 Memories 6 Caches 5 CPU 0 CPU 1 CPU 2

CPU 0 Writes 4 to Interconnection Network Directories E 0 0 1 Memories 6 Caches 5 CPU 0 CPU 1 CPU 2

CPU 0 Writes 4 to Interconnection Network Directories E 1 0 0 Memories Take Away 6 Caches 5 CPU 0 CPU 1 CPU 2

CPU 0 Writes 4 to Interconnection Network Directories E 1 0 0 Memories 5 Caches 5 CPU 0 CPU 1 CPU 2

CPU 0 Writes 4 to Interconnection Network Directories E 1 0 0 Memories 5 Caches CPU 0 CPU 1 CPU 2

CPU 0 Writes 4 to Interconnection Network Directories E 1 0 0 Memories 5 Caches 5 CPU 0 CPU 1 CPU 2

CPU 0 Writes 4 to Interconnection Network Directories E 1 0 0 Memories 5 Caches 4 CPU 0 CPU 1 CPU 2

Acknowledgements Some of the slides contain material developed and copyrighted by D. Paerson (UCB), B. Chen (U Arkansas) and instructor material for the textbook 53