EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Similar documents
CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Chapter-4 Multiprocessors and Thread-Level Parallelism

Multiprocessor Synchronization

Memory Consistency and Multiprocessor Performance

Handout 3 Multiprocessor and thread level parallelism

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

5008: Computer Architecture

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

M4 Parallelism. Implementation of Locks Cache Coherence

Computer Systems Architecture

Computer Systems Architecture

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Shared Symmetric Memory Systems

Advanced Topics in Computer Architecture

Computer Science 146. Computer Architecture

CPE 631 Lecture 21: Multiprocessors

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Parallel Architecture. Hwansoo Han

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5. Thread-Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chapter 9 Multiprocessors

Multiprocessors & Thread Level Parallelism

CS3350B Computer Architecture

SHARED-MEMORY COMMUNICATION

Multiprocessors 1. Outline

Limitations of parallel processing

CSE502: Computer Architecture CSE 502: Computer Architecture

Cache Coherence and Atomic Operations in Hardware

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

EE382 Processor Design. Processor Issues for MP

CS 654 Computer Architecture Summary. Peter Kemper

Chapter - 5 Multiprocessors and Thread-Level Parallelism

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

COSC4201 Multiprocessors

Aleksandar Milenkovich 1

Computer Architecture

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Memory Consistency Models

Lecture: Consistency Models, TM

CS533 Concepts of Operating Systems. Jonathan Walpole

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

1. Memory technology & Hierarchy

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Multiprocessor Cache Coherency. What is Cache Coherence?

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Parallelism, Multicore, and Synchronization

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

EECS 570 Final Exam - SOLUTIONS Winter 2015

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Multiprocessor Synchronization

Flynn's Classification

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

EC 513 Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Transcription:

EEC 581 Computer rchitecture Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6) Chansu Yu Electrical and Computer Engineering Cleveland State University cknowledgement Part of class notes are from David Patterson Electrical Engineering and Computer Sciences University of California, erkeley http://www.eecs.berkeley.edu/~pattrsn http://www-inst.eecs.berkeley.edu/~cs252 11/2/2010 2 1

Review Caches contain all information on state of cached memory blocks Snooping cache over shared medium for smaller MP by invalidating other cached copies on write Sharing cached data Coherence (values returned by a read), Consistency (when a written value will be returned by a read) 11/2/2010 3 Outline Review Synchronization Relaxed Consistency Models Fallacies and Pitfalls Cautionary Tale Conclusion 11/2/2010 4 2

Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Example lock(l); /* wait until L is free */ critical section; unlock(l); Issues for Synchronization: Uninterruptable instruction to fetch and update memory (atomic operation) For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization 11/2/2010 5 Software Solution (?) Procedure lock(l) wait: lw R1, L ; bne R1, #0, wait ; if L=#1 (locked), wait sw L, #1 ; now, lock it myself Procedure lock(l) is wrong! Is it wrong in MP or wrong in both UP & MP? Required capability tomically read and modify a memory location 11/2/2010 6 3

asic Hardware Primitive tomic instruction test&set R1, X ; R1=X and X=1 (locked) ; if X=1, R1=1 & X=1 ; if X=0, R1=0 & X=1 ; Thus, X=1 and R1=previous value of X Procedure lock(l) wait: test&set R1, L ; L=1 and R1=prev. L bne R1, #0, wait ; if L=#1 (locked), wait Correct, but busy-waiting (spin-waiting) 11/2/2010 7 Other asic Hardware Primitives lternatives: Swap, Fetch-and-increment Recent alternatives: a pair of instructions Load-locked (ll) and store-conditional (sc) Procedure lock(l) wait: ll R1, L ; sc #1, L ; L=#1 if no intervening ; write after ll bne R1, #0, wait ; if L=#1 (locked), wait Correct, but busy-waiting (spin-waiting) 11/2/2010 8 4

usy-waiting Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likely to get cache hits for such variables Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic 11/2/2010 9 usy-waiting Example ssume P1 locked the variable and in the critical section ssume also P2 and P3 want to get into P2 and P3 execute procedure lock(l), which means write hit (miss) repeatedly, leading to cache invalidate with each other 11/2/2010 10 5

etter Solutions Procedure lock(l) wait: test&set R1, L ; L=1 and R1=prev. L bne R1, #0, wait ; if L=#1 (locked), wait Procedure lock(l) wait: lw R1, L bne R1, #0, wait test&set R1, L ; L=1 and R1=prev. L bne R1, #0, wait ; if L=#1 (locked), wait Correct, and not busy-waiting 11/2/2010 11 etter Solutions Procedure lock(l) wait: ll R1, L ; sc #1, L ; L=#1 if no intervening write after ll bne R1, #0, wait ; if L=#1 (locked), wait Procedure lock(l) wait: ll R1, L bne R1, #0, wait sc #1, L ; L=#1 if no intervening write after ll bne R1, #0, wait ; if L=#1 (locked), wait Correct, and not busy-waiting 11/2/2010 12 6

Outline Review Synchronization Relaxed Consistency Models Fallacies and Pitfalls Cautionary Tale Conclusion 11/2/2010 13 nother MP Issue: Memory Consistency Models What is consistency? When must a processor see the new value? P1: = 0; P2: = 0; Read Read = 1; = 1; L1: if ( == 0)... L2: if ( == 0)... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? 11/2/2010 14 7

nother MP Issue: Memory Consistency Models 1. Write miss for Proc 1 Proc 2 1. Write miss for 2. Read miss for Proc 1 2. Read miss for Proc 2 E E What is consistency? When must a processor see the new value? e.g., seems that P1: = 0; P2: = 0; Read Read = 1; = 1; L1: if ( == 0)... L2: if ( == 0)... 11/2/2010 15 nother MP Issue: Memory Consistency Models for Proc 1 Proc 2 for 4+ Read h/m 4+ Read h/m Proc 1 Proc 2 What is consistency? When must a processor see the new value? e.g., seems that P1: = 0; P2: = 0; Read Read = 1; = 1; L1: if ( == 0)... L2: if ( == 0)... 11/2/2010 16 8

nother MP Issue: Memory Consistency Models Memory consistency models: what are the rules for such cases? Sequential consistency (SC): result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved assignments before ifs above SC: delay all memory accesses until all invalidates done 11/2/2010 17 Sequential Consistency Model Can t proceed until invalidate is done for Proc 1 Proc 2 for 4+ Read h/m 4+ Read h/m Proc 1 Proc 2 What is consistency? When must a processor see the new value? e.g., seems that P1: = 0; P2: = 0; Read Read = 1; = 1; L1: if ( == 0)... L2: if ( == 0)... 11/2/2010 18 Now, which condition is satisfied? 9

Memory Consistency Model Faster execution to sequential consistency Not an issue for most programs; they are synchronized program is synchronized if all access to shared data are ordered by synchronization operations write (x)... release (s) {unlock}... acquire (s) {lock}... read(x) Only those programs willing to be nondeterministic are not synchronized: data race : outcome f(proc. speed) 11/2/2010 19 Relaxed Consistency Model Can proceed while invalidate for Proc 1 Proc 2 for 4+ Read h/m 4+ Read h/m Proc 1 Proc 2 What is consistency? When must a processor see the new value? e.g., seems that P1: = 0; P2: = 0; Performance improves Read Read with OOO processors. = 1; = 1; L1: if ( == 0)... L2: if ( == 0)... ut, unpredictable 11/2/2010 results... 20 10

Relaxed Consistency Model Can proceed while invalidate for Proc 1 Proc 2 for 4+ Read h/m 4+ Read h/m Proc 1 Proc 2 What is consistency? When must a processor see the new value? e.g., seems that P1: = 0; P2: = 0; Read Read acquire(s) = 1; = 1; L1: if ( == 0)... L2: if ( == 0)... SC 11/2/2010 21 release(s) RC Memory Consistency Model Faster execution to sequential consistency Not an issue for most programs; they are synchronized program is synchronized if all access to shared data are ordered by synchronization operations write (x)... release (s) {unlock}... acquire (s) {lock}... read(x) cquire (read shared variables) Delay memory accesses that follow the acquire operation until the acquire completes Release (write the shared variables) Grant access to the new values of data before it 11/2/2010 22 11

Relaxed Consistency Models Key idea: allow reads and writes to complete out of order, but to use synchronization operations to enforce ordering, so that a synchronized program behaves as if the processor were sequentially consistent y relaxing orderings, may obtain performance advantages lso specifies range of legal compiler optimizations on shared data Unless synchronization points are clearly defined and programs are synchronized, compiler could not interchange read and write of 2 shared data items because might affect the semantics of the program 11/2/2010 23 Relaxed Consistency Models SC requires maintaining all four possible orderings, R R, R W, W R (all writes completed before next read), W W (all writes completed before next write) Relaxed model relax some of them Relaxing W R : ecause retains ordering among writes, many programs that operate under sequential consistency operate under this model, without additional synchronization. Relaxing W W Relaxing R W and R R (PowerPC, lpha) 11/2/2010 24 12

Mark Hill observation Instead, use speculation to hide latency from strict consistency model If processor receives invalidation for memory reference before it is committed, processor uses speculation recovery to back out computation and restart with invalidated memory reference 1. ggressive implementation of sequential consistency or processor consistency gains most of advantage of more relaxed models 2. Implementation adds little to implementation cost of speculative processor 3. llows the programmer to reason using the simpler programming models 11/2/2010 25 nswers to 1995 Questions about Parallelism In the 1995 edition of this text, we concluded the chapter with a discussion of two then current controversial issues. 1. What architecture would very large scale, microprocessorbased multiprocessors use? 2. What was the role for multiprocessing in the future of microprocessor architecture? nswer 1. Large scale multiprocessors did not become a major and growing market clusters of single microprocessors or moderate SMPs nswer 2. stonishingly clear. For at least for the next 5 years, future MPU performance comes from the exploitation of TLP through multicore processors vs. exploiting more ILP 11/2/2010 26 13

Cautionary Tale Key to success of birth and development of ILP in 1980s and 1990s was software in the form of optimizing compilers that could exploit ILP Similarly, successful exploitation of TLP will depend as much on the development of suitable software systems as it will on the contributions of computer architects Given the slow progress on parallel software in the past 30+ years, it is likely that exploiting TLP broadly will remain challenging for years to come 11/2/2010 27 nd in Conclusion Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping uniform memory access) Directory has extra data structure to keep track of state of all cache blocks Distributing directory scalable shared address multiprocessor Cache coherent, Non uniform memory access MPs are highly effective for multiprogrammed workloads MPs proved effective for intensive commercial workloads, such as OLTP (assuming enough I/O to be CPU-limited), DSS applications (where query optimization is critical), and large-scale, web searching applications 11/2/2010 28 14