Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Similar documents
Multiprocessor Synchronization

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 21: Multiprocessors

Flynn s Classification

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

5008: Computer Architecture

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

M4 Parallelism. Implementation of Locks Cache Coherence

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Chapter 5. Thread-Level Parallelism

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Chapter 5. Multiprocessors and Thread-Level Parallelism

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Scientific Applications. Chao Sun

Chapter - 5 Multiprocessors and Thread-Level Parallelism

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

Thread-Level Parallelism

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Relaxed Memory Consistency

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

The MESI State Transition Graph

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Chapter 9 Multiprocessors

SHARED-MEMORY COMMUNICATION

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Computer Science 146. Computer Architecture

Parallel Architecture. Hwansoo Han

Computer Architecture

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Bus Snooping Topology

Chapter 6: Multiprocessors and Thread-Level Parallelism

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Limitations of parallel processing

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Flynn's Classification

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Getting CPI under 1: Outline

Computer System Architecture Final Examination Spring 2002

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Handout 3 Multiprocessor and thread level parallelism

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Overview: Memory Consistency

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Aleksandar Milenkovich 1

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

CSE502: Computer Architecture CSE 502: Computer Architecture

EE382 Processor Design. Processor Issues for MP

I.1 Introduction I-2 I.2 Interprocessor Communication: The Critical Performance Issue I-3 I.3 Characteristics of Scientific Applications I-6 I.

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Page 1. Cache Coherence

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

EC 513 Computer Architecture

A Cache Hierarchy in a Computer System

CS5460: Operating Systems

Lecture 25: Multiprocessors

CSC 631: High-Performance Computer Architecture

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

Multiprocessors and Locking

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Interconnect Routing

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

[ 5.4] What cache line size is performs best? Which protocol is best to use?

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Transcription:

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1

Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors: increase Coherency, decrease Capacity 20% 18% 18% 16% 14% 14% 15% 13% Miss Rate 12% 10% 8% 8% 8% 8% 8% 8% 9% 6% 4% 2% 0% 2% 2% 2% 2% 2% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% fft lu barnes ocean volrend 1 2 4 8 16 Fast Fourier Transform (FFT): Matrix transposes + comp. LU factorization of dense 2D matrix (linear algebra) Barnes-Hut n-body algorithm solving galaxy evolution probem Ocean simluates influence of eddy & boundary currents on large-scale flow in ocean: dynamic arrays per grid VolRend is parallel volume rendering: scientific visualization RHK.S96 2

Review: Block Size Miss Rate 14% 12% 10% 8% 6% 4% 2% 0% 13% 8% 5% 4% 4% 2% 1% 0% 1% 1% 1% 1% 13% 1% 1% 1% 1% fft lu barnes ocean volrend 9% 6% 5% 16 32 64 128 Since cache block hold multiple words, may get coherency traffic for unrelated variables in same block False sharing arises from the use of an invalidation-based coherency algorithm. False sharing occurs when a block is invalidated (and a subsequent reference causes a miss) because some word in the block, other than the one being read, is written into. RHK.S96 3

Review: % Misses Caused by Coherency Traffic vs. Block Size 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 16 32 64 128 fft lu barnes ocean volrend FFT communicates data in large blocks & communication adapts to the block size (it is a parameter to the code); makes effective use of large blocks. Ocean competing effects that favor different block size accesses to the boundary of each subgrid, in one direction the accesses match the array layout, taking advantage of large blocks, while in the other dimension, they do not match. These two effects largely cancel each other out leading to an overall decrease in the coherency misses as well as the capacity misses. RHK.S96 4

Review: Miss Rates for Snooping Protocol Miss Rate 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% 18% 14% 15% 13% 9% 8% 8% 8% 8% 8% 2% 2% 2% 2% 2% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% fft lu barnes ocean volrend 1 2 4 8 16 Cache size is 64KB, 2-way set associative, with 32B blocks. With the exception of Volrend, the misses in these applications are generated by accesses to data that is potentially shared. Except for Ocean, data is heavily shared; in Ocean only the boundaries of the subgrids are shared, though the entire grid is treated as a shared data object. Since the boundaries change as we increase the processor count (for a fixed size problem), different amounts of the grid become shared. The analmous increase in miss rate for Ocean in moving from 1 to 2 processors arises because of conflict misses in accessing the subgrids. RHK.S96 5

% Misses Caused by Coherency Traffic vs. Processors Miss Rate 80% 70% 60% 50% 40% 30% 20% 10% 0% 1 2 4 8 16 Processor Count fft lu barnes ocean volrend % cache misses caused by coherency transactions typically rises when a fixed size problem is run on more processors. The absolute number of coherency misses is increasing in all these benchmarks, including Ocean. In Ocean, however, it is difficult to separate out these misses from others, since the amount of sharing of the grid varies with processor count. Invalidations increases significantly; FFT the miss rate arising from coherency misses increases from nothing to almost 7%. RHK.S96 6

Review: Bus Traffic as Increase Block Size Bytes per data reference 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 16 32 64 128 fft lu barnes ocean volrend Bus traffic climbs steadily as the block size is increased. Volrend the increase is more than a factor of 10, although the low miss rate keeps the absolute traffic small. The factor of 3 increase in traffic for Ocean is the best argument against larger block sizes. Remember that our protocol treats ownership misses the same as other misses, slightly increasing the penalty for large cache blocks: in both Ocean and FFT this effect accounts for less than 10% of the traffic. RHK.S96 7

Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Syncronization: Uninterruptable instruction to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization RHK.S96 8

Uninterruptable Instruction to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 & swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it 0 => synchronization variable is free RHK.S96 9

Uninterruptable Instruction to Fetch and Update Memory Hard to have read & write in 1 instruction: use 2 instead Load linked (or load locked) + store conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceeding load) and 0 otherwise Example doing atomic swap with LL & SC: try: mov R3,R4 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store beqz R3,try ; branch store fails mov R4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try: ll R2,0(R1) ; load linked addi R2,R2,#1 ; increment (OK if reg reg) sc R2,0(R1) ; store beqz R2,try ; branch store fails RHK.S96 10

User Level Synchronization Operation Using this Primitive Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ;atomic exchange bnez R2,lockit ;already locked? What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likely to get cache hits for such variables Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it changes, then try exchange ( test and test&set ): try: li R2,#1 lockit: lw R3,0(R1) ;load var bnez R3,lockit ;not free=>spin exch R2,0(R1) ;atomic exchange bnez R2,try ;already locked? RHK.S96 11

Steps for Invalidate Protocol Step P0 $ P1 $ P2 $ Bus/Direct activity 1. Has lock Sh spins Sh spins Sh None 2. Lock< 0 Ex Inv Inv P0 Invalidates lock 3. Sh miss Sh miss Sh WB P0; P2 gets bus 4. Sh waits Sh lock = 0 Sh P2 cache filled 5. Sh lock=0 Sh exch Sh P2 cache miss(wi) 6. Inv exch Inv r=0;l=1 Ex P2 cache filled; Inv 7. Inv r=1;l=1 Ex locked Inv WB P2; P1 cache 8. Inv spins Ex Inv None RHK.S96 12

For Large Scale MPs, Synchronization Can Be a Bottleneck 20 procs spin on lock held by 1 proc, 50 cycles for bus Read miss all waiting processors to fetch lock 1000 Write miss by releasing processor and invalidates 50 Read miss by all waiting processors 1000 Write miss by all waiting processors, one successful lock, & invalidate all copies 1000 Total time for 1 proc. to acquire & release lock 3050 Each time one gets a lock, drops out of competition= 1525 20 x 1525 = 30,000 cycles for 20 processors to pass through the lock Problem is contention for lock and serialization of lock access: once lock is free, all compete to see who gets it Alternative: create a list of waiting processors, go through list: called a queuing lock Special HW to recognize 1st lock access & lock release Another mechanism: fetch-and-increment; can be used to create barrier; wait until everyone reaches same point RHK.S96 13

Another MP Issue: Memory Consistency Models What is consistency? When must a processor see the new value? e.g., seems that P1: A = 0; P2: B = 0;...... A = 1; B = 1; L1: if (B == 0)... L2: if (A == 0)... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Memory consistency models: what are the rules for such cases? Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => assignments before ifs above SC: delay all memory accesses until all invalidates done RHK.S96 14

Memory Consistency Model Schemes faster execution to sequential consistency Not really an issue for most programs; they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x)... release (s) {unlock}... acquire (s) {lock}... read(x) Only those programs willing to be nondeterministic are not synchronized Several Relaxed Models for Memory Consistency since most programs are synchronized: characterized by their attitude towards: RAR, WAR, RAW, WAW to different RHK.S96 15 addresses

Key Issues for MPs Measuring Performance Not just time on one size, but how performance scales with P For fixed size problem (same memory per processor) and scaled up problem (fixed execution time) Care to compare to best uniprocessor algorithm, not just parallel program on 1 processor (unless its best) Multilevel Caches, Coherency, and Inclusion Invalidation at L2 cache forces invalidation at higher levels if caches adher to the inclusion property But larger L2 blocks lead to several L1 blocks getting invalidated Nonblocking Caches and Prefetching More latency to hide, so nonblocking caches even more important Makes sense if there is available memory bandwidth; must balance bus utilization, false sharing (conflict w/ other processors) Want prefetch to be coherent ( nonbinding to local copy) Virtual Memory to get Shared Memory MP: Distributed VIrtual Memory (DVM); pages are units of coherency RHK.S96 16