Chapter 5. Thread-Level Parallelism

Similar documents
Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter-4 Multiprocessors and Thread-Level Parallelism

Multiprocessor Synchronization

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Handout 3 Multiprocessor and thread level parallelism

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Aleksandar Milenkovich 1

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Multiprocessors 1. Outline

Parallel Architecture. Hwansoo Han

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

5008: Computer Architecture

M4 Parallelism. Implementation of Locks Cache Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Computer Systems Architecture

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Thread-Level Parallelism

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Multiprocessors & Thread Level Parallelism

Chapter 9 Multiprocessors

Flynn s Classification

Computer Systems Architecture

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

Limitations of parallel processing

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

COSC 6385 Computer Architecture - Multi Processor Systems

CPE 631 Lecture 21: Multiprocessors

Chapter 6: Multiprocessors and Thread-Level Parallelism

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Computer parallelism Flynn s categories

EC 513 Computer Architecture

Chapter - 5 Multiprocessors and Thread-Level Parallelism

Computer Architecture

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Computer Science 146. Computer Architecture

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer Architecture

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Multiprocessors - Flynn s Taxonomy (1966)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Lecture 24: Virtual Memory, Multiprocessors

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Why Multiprocessors?

COSC4201 Multiprocessors

Chapter 5: Thread-Level Parallelism Part 1

Multiprocessor Cache Coherency. What is Cache Coherence?

1. Memory technology & Hierarchy

Shared Symmetric Memory Systems

ECSE 425 Lecture 30: Directory Coherence

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Flynn's Classification

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Computer Architecture

Interconnect Routing

Cache Coherence and Atomic Operations in Hardware

Processor Architecture and Interconnect

CSE502: Computer Architecture CSE 502: Computer Architecture

UNIT I (Two Marks Questions & Answers)

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Issues in Multiprocessors

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CSCI 4717 Computer Architecture

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Issues in Multiprocessors

PARALLEL MEMORY ARCHITECTURE

CMSC 611: Advanced. Distributed & Shared Memory

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Shared Memory Architectures. Approaches to Building Parallel Machines

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Transcription:

Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1

Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated + Wide-issue processors are very complex + Wide-issue processors consume a lot of power + Steady progress in parallel software : the major obstacle to parallel processing 2

Flynn s Classification of Parallel Architectures According to the parallelism in I and D stream Single I stream, single D stream (SISD): uniprocessor Single I stream, multiple D streams(simd) : same I executed by multiple processors using diff D Each processor has its own data memory There is a single control processor that sends the same I to all processors These processors are usually special purpose 3

Multiple I streams, single D stream (MISD) : no commercial machine Multiple I streams, multiple D streams (MIMD) Each processor fetches its own instructions and operates on its own data Architecture of choice for general purpose mps Flexible: can be used in single user mode or multiprogrammed Use of the shelf µprocessors 4

MIMD Machines 1. Centralized shared memory architectures Small # s of processors ( up to 16-32) Processors share a centralized memory Usually connected in a bus Also called UMA machines ( Uniform Memory Access) 2. Machines w/physically distributed memory Support many processors Memory distributed among processors Scales the mem bandwidth if most of the accesses are to local mem 5

Figure 5.1 6

Figure 5.2 7

2. Machines w/physically distributed memory (cont) Also reduces the memory latency Of course interprocessor communication is more costly and complex Often each node is a cluster (bus based multiprocessor) 2 types, depending on method used for interprocessor communication: 1. Distributed shared memory (DSM) or scalable shared memory 2. Message passing machines or multicomputers 8

DSMs : Memories addressed as one shared address space: processor P1 writes address X, processor P2 reads address X Shared memory means that some address in 2 processors refers to same mem location; not that mem is centralized Also called NUMA (Non Uniform Memory Access) Processors communicate implicitly via loads and stores Multicomputers: Each processor has its own address space, disjoint to other processors, cannot be addressed by other processors 9

The same physical address on 2 diff processors refers to 2 diff locations in 2 diff memories Each proc-mem is a diff computer Processes communicate explicitly via passing of messages among them e.g. messages to request / send data to perform some operation on remote data Synchronous msg passing : initializing processor sends a request and waits for a reply before continuing Asynchronous msg passing does not wait 10

Processors are notified of the arrival of a msg polling interrupt Standard message passing libraries: message passing interface (MPI) 11

Shared memory communication + Compatibility w/well understood mechanisms in centralized mps + Easy of programming /compiler design for pgms w/ irregular communication patterns + Lower overhead of communication better use of bandwidth when using small communications + Reduced remote communication by using automatic caching of data 12

Msg-passing Communication + Simpler hardware (no support for cache coherence in HW) ± Communication is explicit Painful Forces programmers and compilers to pay attention/ optimize communication 13

Challenges in Parallel Processing 1) Serial sections e.g. To have a speedup of 80 w/100 processor, what fraction of original computation can be sequential? 14

Amdahl s law: 1 1 Speedup = = = 80 (1- fenh) + Fenh fparallel ( 1- fparallel ) + Spenh 100 fparallel = 99.75% fparallel = 0.25% 2) Large latency of remote accesses (50-1,000 clock cycles) Example : 0.5 ns machine has a round trip latency of 200 ns. 0.2% of instructions cause a cache miss (processor stall). Base CPI without misses is 0.5. Whats new CPI? CPI = 0.5 + 0.2% * 200/0.5 = 1.3 15

The Cache Coherence Problem Caches are critical to modern high-speed processors Multiple copies of a block can easily get inconsistent processor writes. I/O writes,.. A = 7 P 3 Cache A = 5 A = 5 Cache P 1 A = 5 2 Memory 16

Hardware Solutions The schemes can be classified based on : Snoopy schemes vs. Directory schemes Write through vs. write-back (ownership-based) protocols Update vs. invalidation protocols 17

Snoopy Cache Coherence Schemes A distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism. 18

Write Through Schemes All processor writes result in : Update of local cache and a global bus write that : updates main memory invalidates/updates all other caches with that item Advantage : Simple to implement Disadvantages : Since ~15% of references are writes, this scheme consumes tremendous bus bandwidth. Thus only a few processors can be supported. Need for dual tagging caches in some cases 19

Write-Back/Ownership Schemes When a single cache has ownership of a block, processor writes do not result in bus writes thus conserving bandwidth. Most bus-based multiprocessors nowadays use such schemes. Many variants of ownership-based protocols exist: Goodman s write -once scheme Berkley ownership scheme Firefly update protocol 20

Invalidation vs. Update Strategies 1. Invalidation : On a write, all other caches with a copy are invalidated 2. Update : On a write, all other caches with a copy are updated Invalidation is bad when : single producer and many consumers of data. Update is bad when : multiple writes by one PE before data is read by another PE. Junk data accumulates in large caches (e.g. process migration). Overall, invalidation schemes are more popular as the default. 21

Invalid Bus Write Miss Bus invalidate P- Read Shared Bus-read P-read P- Read Bus Write Miss P-write P-write P-write Dirty P-write P-read Invalidation, Ownership-based 22

Illinois Scheme States: I, VE (valid-exclusive), VS (valid-shared), D (dirty) Two features : The cache knows if it has an valid-exclusive (VE) copy. In VE state no invalidation traffic on write-hits. If some cache has a copy, cache-cache transfer is used. Advantages: closely approximates traffic on a uniprocessor for sequential pgms. In large cluster-based machines, cuts down latency Disadvantages: complexity of mechanism that determines exclusiveness memory needs to wait before sharing status is determined 23

Invalid Bus Write Miss Bus invalidate P-read [someone has it] P- Read [someone has it] Bus-read Shared P- Read Bus Write Miss P-write P-read Bus-read [no one else has it] Dirty P-write Valid Exclusive P- Read P-write P-write P-read P-write P-read [no one else has it] 24

DEC Firefly Scheme Classification: Write-back, update, no-dirty-sharing. States : VE (valid exclusive): only copy and clean VS (valid shared) : shared -clean copy. Write hits result in updates to memory and other caches and entry remains in this state D(dirty): dirty exclusive (only copy) Used special shared line on bus to detect sharing status of cache line Supports producer-consumer model well What about sequential processes migrating between CPU s? 25

P-Read [no one else has it] Bus Read/Write Valid Bus write-miss Exclusive P-Write and not SL P Read P-write P- Read and SL Shared Bus Read [update MM] Bus-read/write Bus Write miss P-Write M and SL P- Write and SL P-read Dirty P-write P-write Miss and not SL P-read 26

Directory Based Cache Coherence Key idea :keep track in a global directory (in main memory) which processors are caching a location and the state. 27

Basic Scheme (Censier and Feautrier) memory p p cache cache Interconnection network Directory K Presence bits Dirty bit Assume K processors With each cache-block in memory: K presence bits and 1 dirty bit With each cache-block in cache : 1 valid bit and 1 dirty (owner) bit Read Miss Read from main-memory by PE_i If dirty bit is off then {read from main memory;turn p[i] ON; } If dirty bit is ON then {recall line from dirty PE (cache state to shared); update memory; turn dirty-bit OFF;turn p[i] ON; supply recalled data to PE_i;} 28

Write Miss Write Hit Non - owned data If dirty-bit OFF then {supply data to PE_i; send invalidations to all PE s caching that block and clear their P[k] bits; turn dirty bit ON; turn P[i] ON;.. } If dirty bit ON then {recall the data from owner PE which invalidates itself; (update memory); clear bit of previous owner; forward data to PE i; turn bit PE[I] on; (dirty bit ON all the time) } Write- hit to data valid (not owned ) in cache: {access memory-directory; send invalidations to all PE s caching block; clear their P[k] bits; supply data to PE i ; turn dirty bit ON ; turn PE[i] ON } 29

Synchronization Typically built w/ user level software routines that rely on hardware-based synch primitives Small machines: uninterruptible instruction that atomically retrieves & changes a value. Software synchronization mechanisms are then constructed on top of it. Large scale machines : powerful hardware - supported synchronization primitives Key: ability to atomically read and modify a mem-location 30

Users are not expected to use the hardware mechanisms directly Instead : systems programmers build a synchronization library : locks,etc. Examples of Primitives 1) Atomic exchange : interchanges a value in a reg. For a value in memory e.g. lock = 0 free 1 taken processor tries to get a lock by exchanging a 1 in a register with the lock memory location 31

If value returned = 1 : some other processor had grabbed it If value returned = 0 : you got it no one else can since already 1 Can there be any races? (e.g. both get 1) Consider the read - write was 2 instructions 2) Test-and-set : test a value & set it e.g. test for a zero and set to 1 3) Fetch-and increment : return the value & increment it 0 means that the synch var is unclaimed 32

A Second Approach : 2 instructions Having 1 atomic read-write may be complex Have 2 instructions : the 2nd instruction returns a value from which it can be deduced whether the pair was executed as if atomic e.g. MIPS load linked and store conditional If the contents of the location read by the LL change before the SC to the same address SC fails If the processor context switches between the two SC also fails 33

The SC returns a value indicating whether it failed / succeeded The LL returns the initial value Atomic exchange : try : mov R3, R4 ll R2,0(R1) sc R3,0(R1) beqz R3,try mov R4,R2 /* at end : R4 and 0(R1) have been atomically exchanged */ 34

If another proc intervenes and modifies the value between ll, sc sc returns 0 (and fails) Can also be used for atomic fetch-and -increment try : ll R2,0(R1) addi R3,R2,#1 sc R3,0(R1) beqz R3,try 35

Implementing Locks Given an atomic operation use the coherence mechanisms of an MP to implement spin locks Spin Locks: locks that a processor continuously tries to acquire, spinning in a loop + grabs the lock immediately after it is freed - tie up the processor 1) If no cache coherence : Keep the lock in memory to get lock : continuously exchange in a loop 36

To release the lock : write a 0 to it daddui R2,R0,#1 lockit : exch R2,0(R1) bnez R2,lockit 2) If cache coherence Try to cache the lock no need to access memory ; can spin in the cache Since locality in lock accesses : processor that last acquired it will acquire it next will reside in the cache of that processor 37

However : cannot keep spinning w/a write invalidate everyone bus traffic to reload lock Need to do only reads until it sees that the lock is available then an exchange test and test and set lockit : ld R2,0(R1) bnez R2,lockit daddui R2,R0,#1 exch R2,0(R1) bnez R2,lockit 38

See Figure 5.24 39

Can do same thing w/ll-sc lockit : ll R2,0(R1) bnez R2,lockit daddui R2,R0,#1 sc R2,0(R1) beqz R2,lockit ; 0 means SC failed Problem of spin locks : not scalable lots of traffic when the lock is released if many processes waiting 20 Each proc tries to lock a var 40

One processor release lock invalidates 1 bus transaction everybody All processors read miss All processors do an exchange exchanges invalidate other processors 19 bus trans many bus trans 41