Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Similar documents
Multiplying Performance. Application Domains for Multiprocessors. Multicore: Mainstream Multiprocessors. Identifying Parallelism

This Unit: Shared Memory Multiprocessors. CIS 501 Computer Architecture. Multiplying Performance. Readings

CIS 371 Computer Organization and Design

EECS 570 Lecture 9. Snooping Coherence. Winter 2017 Prof. Thomas Wenisch h6p://

This Unit: Shared Memory Multiprocessors. CIS 501 Computer Architecture. Beyond Implicit Parallelism. Readings

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Computer Architecture. Unit 9: Multicore

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

! A single processor can only be so fast! Limited clock frequency! Limited instruction-level parallelism! Limited cache hierarchy

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Thread Level Parallelism I: Multicores

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

EC 513 Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence and Atomic Operations in Hardware

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Multiprocessor Cache Coherency. What is Cache Coherence?

Page 1. Cache Coherence

Handout 3 Multiprocessor and thread level parallelism

Implementing caches. Example. Client. N. America. Client System + Caches. Asia. Client. Africa. Client. Client. Client. Client. Client.

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture: Transactional Memory. Topics: TM implementations

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture #7: Implementing Mutual Exclusion

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Interconnect Routing

On Power and Multi-Processors

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Lecture: Consistency Models, TM

LogTM: Log-Based Transactional Memory

Multiprocessors continued

CS377P Programming for Performance Multicore Performance Cache Coherence

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

A Basic Snooping-Based Multi-Processor Implementation

Computer Systems Architecture

Multiprocessor Systems. COMP s1

Computer Architecture

Computer Science 146. Computer Architecture

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

A Basic Snooping-Based Multi-Processor Implementation

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Multiprocessors and Locking

EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Lecture 11 Cache. Peng Liu.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Foundations of Computer Systems

Scalable Locking. Adam Belay

Multiprocessors continued

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Concurrent Preliminaries

On Power and Multi-Processors

A Basic Snooping-Based Multi-processor

12 Cache-Organization 1

Limitations of parallel processing

Snooping-Based Cache Coherence

Concurrency Control. Transaction Management. Lost Update Problem. Need for Concurrency Control. Concurrency control

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Anne Bracy CS 3410 Computer Science Cornell University

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Multiprocessor Systems. Chapter 8, 8.1

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

A More Sophisticated Snooping-Based Multi-Processor

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Portland State University ECE 588/688. Transactional Memory

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

CSE 374 Programming Concepts & Tools

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Shared Memory Architectures. Approaches to Building Parallel Machines

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Thread-Level Parallelism. Shared Memory. This Unit: Shared Memory Multiprocessors. U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Introduction to Data Management CSE 344

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CSE502: Computer Architecture CSE 502: Computer Architecture

Performance metrics for caches

CSE 120 Principles of Operating Systems

Lecture 17: Transactional Memories I

Multiprocessors & Thread Level Parallelism

Transcription:

Programming With Locks s Tricky Multicore processors are the way of the foreseeable future thread-level parallelism anointed as parallelism model of choice Just one problem Writing lock-based multi-threaded programs is tricky! More precisely: Writing programs that are correct is easy (not really) Writing programs that are highly parallel is easy (not really) Writing programs that are both correct and parallel is difficult And that s the whole point, unfortunately Selecting the right kind of lock for performance Spin lock, queue lock, ticket lock, read/writer lock, etc. Locking granularity issues Goldibear and the 3 Locks Coarse-grain locks: correct, but slow one lock for entire database + Easy to make correct: no chance for unintended interference Limits parallelism: no two critical sections can proceed in parallel Fine-grain locks: parallel, but difficult multiple locks, one per record + Fast: critical sections (to different records) can proceed in parallel Difficult to make correct: easy to make mistakes Multiple locks: just right? (sorry, no fairytale ending) acct-to-acct transfer: must acquire both id_from, id_to locks Simultaneous transfers 241 37 and 37 241 Deadlock: circular wait for shared resources Solution: Always acquire multiple locks in same order Just another thing to keep in mind when programming 30 31 More Lock Madness What if Some actions (e.g., deposits, transfers) require 1 or 2 locks and others (e.g., prepare statements) require all of them? Can these proceed in parallel? What if There are locks for global variables (e.g., operation id counter)? When should operations grab this lock? What if what if what if Lock-based programming is difficult wait, it gets worse And To Make t Worse Acquiring locks is expensive By definition requires a slow atomic instructions Specifically, acquiring write permissions to the lock Ordering constraints (see soon) make it even slower and 99% of the time un-necessary Most concurrent actions don t actually share data You paying to acquire the lock(s) for no reason Fixing these problem is an area of active research One proposed solution Transactional Memory 32 33 Research: Transactional Memory (TM) Transactional Memory + Programming simplicity of coarse-grain locks + Higher concurrency (parallelism) of fine-grain locks Critical sections only serialized if data is actually shared + No lock acquisition overhead Hottest thing since sliced bread (or was a few years ago) No fewer than eight research projects: Brown, Stanford, MT, Wisconsin, Texas, Rochester, ntel, Penn Transactional Memory: The Big dea Big idea : no locks, just shared data Look ma, no locks Big idea : optimistic (speculative) concurrency Execute critical section speculatively, abort on conflicts Better to beg for forgiveness than to ask for permission Read set: set of shared addresses critical section reads Example: accts[37].bal, accts[241].bal Write set: set of shared addresses critical section writes Example: accts[37].bal, accts[241].bal 34 35

Transactional Memory: Begin begin_transaction Take a local register checkpoint Begin locally tracking read set (remember addresses you read) See if anyone else is trying to write it Locally buffer all of your writes (invisible to other processors) + Local actions only: no lock acquire struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; int id_from,id_to,amt; begin_transaction(); if (accts[id_from].bal >= amt) { accts[id_from].bal -= amt; accts[id_to].bal += amt; } end_transaction(); Transactional Memory: End end_transaction Check read set: is data you read still valid (i.e., no writes to any) Yes? Commit transactions: commit writes No? Abort transaction: restore checkpoint struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; int id_from,id_to,amt; begin_transaction(); if (accts[id_from].bal >= amt) { accts[id_from].bal -= amt; accts[id_to].bal += amt; } end_transaction(); 36 37 Roadmap Checkpoint Recall: Simplest Multiprocessor App App App System software Mem CPU CPU /O CPU CPU CPU CPU Thread-level parallelism (TLP) Shared memory model Multiplexed uniprocessor Hardware multihreading Multiprocessing Synchronization Lock implementation Locking gotchas Cache coherence Bus-based protocols Directory protocols Memory consistency models PC PC Regfile Regfile What if we don t want to share the L1 caches? Bandwidth and latency issue Solution: use per-processor ( private ) caches Coordinate them with a Cache Coherence Protocol D 38 39 Shared-Memory Multiprocessors Conceptual model The shared-memory abstraction Familiar and feels natural to programmers Life would be easy if systems actually looked like this Shared-Memory Multiprocessors but systems actually look more like this Processors have caches Memory may be physically distributed Arbitrary interconnect Memory M 0 M 1 M 2 M 3 nterconnect 40 41

Revisiting Our Motivating Example critical section (locks not shown) Mem critical section (locks not shown) Two 100 withdrawals from account #241 at two ATMs Each transaction maps to thread on different processor Track accts[241].bal (address is in r3) No-Cache, No-Problem Scenario : processors have no caches No problem Mem 400 400 300 42 43 Cache ncoherence Mem 400 400 400 400 Write-Through Doesn t Fix t Mem 400 400 400 400 400 400 300 300 Scenario (a): processors have write-back caches Potentially 3 copies of accts[241].bal: memory, p0, p1 Can get incoherent (inconsistent) Scenario (b): processors have write-through caches This time only 2 (different) copies of accts[241].bal No problem? What if another withdrawal happens on processor 0? 44 45 No caches? Slow What To Do? Make shared data uncachable? Faster, but still too slow Entire accts database is technically shared Flush all other caches on writes to shared data? May as well not have caches Hardware cache coherence Rough goal: all caches have same data at all times + Minimal flushing, maximum caching best performance Bus-based Multiprocessor Simple multiprocessors use a bus All processors see all requests at the same time, same order Memory Single memory module, -or- Banked memory module M 0 M 1 Bus M 2 M 3 46 47

CC CPU bus D tags Hardware Cache Coherence D data Coherence all copies have same data at all times Coherence controller: Examines bus traffic (addresses and data) Executes coherence protocol What to do with local copy when you see different things happening on bus Three processor-initiated events Ld: load St: store WB: write-back Two remote-initiated events LdMiss: read miss from another processor StMiss: write miss from another processor V V (M) Coherence Protocol LdMiss/St V (valid-invalid) protocol: aka M* Miss Two states (per block in cache) V (valid): have block (invalid): don t have block +Can implement with valid bit Protocol diagram (left) f you load/store a block: transition to V f anyone else wants to read/write block: Give it up: transition to state Write-back if your own copy is dirty This is an invalidate protocol Update protocol: copy data, don t invalidate Sounds good, but wastes a lot of bandwidth LdMiss, StMiss, WB 48 * M=modified, comes later 49 V Protocol (Write-Back Cache) ld by processor 1 generates an other load miss event (LdMiss) processor 0 responds by sending its dirty copy, transitioning to Mem V: V:400 : V:400 400 V:300 400 50 M StMiss, WB LdMiss/St Miss LdM V MS V protocol is inefficient Only one cached copy allowed in entire system Multiple copies can t exist even if read-only Not a problem in example Big problem in reality MS (modified-shared-invalid) Fixes problem: splits V state into two states M (modified): local dirty copy S (shared): local clean copy Allows either S Multiple read-only copies (S-state) --OR-- Single read/write copy (M-state), LdMiss 51 MS Protocol (Write-Back Cache) ld by processor 1 generates a other load miss event (LdMiss) responds by sending its dirty copy, transitioning to S st by processor 1 generates a other store miss event (StMiss) responds by transitioning to Mem S: M:400 S:400 S:400 400 : M:300 400 Cache Coherence and Cache Misses Coherence introduces two new kinds of cache misses Upgrade miss On stores to read-only blocks Delay to acquire write permission to read-only block Coherence miss Miss to a block evicted by another processor s requests Making the cache larger Doesn t reduce these type of misses As cache grows large, these sorts of misses dominate False sharing Two or more processors sharing parts of the same block But no the same bytes within that block (no actual sharing) Creates pathological ping-pong behavior Careful data placement may help, but is difficult 52 53

MS MES: Exclusive Clean Protocol M StMiss, WB StMiss S E Most modern protocols also include E (exclusive) state have the only cached copy, and it s a clean copy Why is this state useful? transitions to E if no other processors is caching the block, otherwise S Exclusive Clean Protocol Optimization (No miss) Mem E: M:400 S:400 S:400 400 : M:300 400 54 55