MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Similar documents
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Systems Architecture

Multiprocessors & Thread Level Parallelism

Computer Systems Architecture

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Handout 3 Multiprocessor and thread level parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Advanced Topics in Computer Architecture

Lecture 24: Virtual Memory, Multiprocessors

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

COSC4201 Multiprocessors

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Chap. 4 Multiprocessors and Thread-Level Parallelism

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Shared Symmetric Memory Systems

Chapter 5. Thread-Level Parallelism

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer Architecture

Parallel Architecture. Hwansoo Han

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter 9 Multiprocessors

CSCI 4717 Computer Architecture

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Science 146. Computer Architecture

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Aleksandar Milenkovich 1

Computer Organization. Chapter 16

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

RECAP. B649 Parallel Architectures and Programming

Chapter 18 Parallel Processing

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Memory Consistency and Multiprocessor Performance

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

1. Memory technology & Hierarchy

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors 1. Outline

Flynn s Classification

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Organisasi Sistem Komputer

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Lecture 24: Memory, VM, Multiproc

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Lecture 9: MIMD Architecture

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Multiprocessors and Locking

UNIT I (Two Marks Questions & Answers)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Processor Architecture and Interconnect

Online Course Evaluation. What we will do in the last week?

Computer Architecture

Lecture 9: MIMD Architectures

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Comp. Org II, Spring

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Parallel Processing & Multicore computers

Interconnect Routing

TDT 4260 lecture 3 spring semester 2015

CS 136: Advanced Architecture. Review of Caches

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

CS3350B Computer Architecture

Comp. Org II, Spring

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Page 1. Memory Hierarchies (Part 2)

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CSC 631: High-Performance Computer Architecture

Multi-Processor / Parallel Processing

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Multiprocessor Cache Coherency. What is Cache Coherence?

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Transcription:

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming

Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance Growth in data-intensive applications Increasing desktop performance relatively unimportant Effectiveness of multiprocessors for server applications Leveraging design investment by replication B629: Practical Compiling for Modern Machines 2

Flynn s Classification of Parallel Architectures SISD: Single Instruction Single Data stream uniprocessors SIMD: Single Instruction Multiple Data streams suitable for data parallelism Intel s multimedia extensions, vector processors growing popularity in graphics applications MISD: Multiple instruction Single Data stream no commercial multiprocessor to date MIMD: Multiple Instruction Multiple Data streams suitable for thread-level parallelism B629: Practical Compiling for Modern Machines 3

MIMD Architecture of choice for general-purpose multiprocessors Offers flexibility Can leverage the design investment in uniprocessors Can use off-the-shelf processors COTS (Commercial, Off-The-Shelf) processors Examples Clusters commodity and custom clusters Multicores B629: Practical Compiling for Modern Machines 4

Shared-Memory Multiprocessors B629: Practical Compiling for Modern Machines 5

Distributed-Memory Multiprocessors B629: Practical Compiling for Modern Machines 6

Models for Memory and Communication Memory architecture shared memory Uniform Memory Access (UMA) Symmetric (Shared-Memory) Multiprocessors (SMPs) distributed memory Non-Uniform Memory Access (NUMA) Communication architecture (programming) shared memory message-passing B629: Practical Compiling for Modern Machines 7

Other Ways to Categorize Parallel Programming threads vs producer-consumer client-sever vs p-to-p / master-slave vs symm. course vs fine grained SPMD vs MPMD tightly vs loosely coupled Parallel Programs recursive vs iterative shared mem vs msg passing data vs task parallel synch. vs asynch. B629: Practical Compiling for Modern Machines 8

CACHES B629: Practical Compiling for Modern Machines 9

Terminology cache virtual memory memory stall cycles direct mapped valid bit block address write through instruction cache average memory access time cache hit page miss penalty fully associative dirty bit block offset write back data cache hit time cache miss page fault miss rate n-way set associative least-recently used tag field write allocate unified cache misses per instruction block locality address trace set random replacement index field no-write allocate write buffer write stall 10

Memory Hierarchy 11

Four Memory-Hierarchy Questions Where can a block be placed in the upper level? block placement How is a block found if it is in the upper level? block identification Which block should be replaced on a miss? block replacement What happens on a write? write strategy 12

Where Can a Block Be Placed in a Cache? Only one place for each block direct mapped Anywhere in the cache fully associative Restricted set of places set associative (Block address) MOD (Number of blocks in cache) (Block address) MOD (Number of sets in cache) 13

Example 14

How is a Block Found if it is in Cache? Tags in each cache block gives the block address all possible tags searched in parallel (associative memory) valid bit tells whether a tag match valid Fields in a memory address No index field in fully associative caches 15

Which Block Should be Replaced on a Miss? Random easy to implement Least-recently used (LRU) idea: rely on the past to predict the future replace the block unused for the longest time First in, First out (FIFO) approximates LRU (oldest, rather than least recently used) simpler to implement 16

Comparison of Replacement Policies Data cache misses per 1000 instructions on five SPECint2000 and five SPECfp2000 benchmarks 17

Reads dominate What Happens on a Write? 7% of the overall memory traffic are writes 28% of the data cache traffic are writes Write takes longer reading cache line and validity check can be parallel reads can read the whole line, write must modify only the specified bytes 18

Handling Writes Write strategy write through write to cache block and to the block in the lower-level memory write back write only to cache block, update the lower-level memory when block replaced Block allocation strategy write allocate allocate a block on cache miss no-write allocate do not allocate, no affect on cache 19

Example: Opteron Data Cache 20

Terminology cache virtual memory memory stall cycles direct mapped valid bit block address write through instruction cache average memory access time cache hit page miss penalty fully associative dirty bit block offset write back data cache hit time cache miss page fault miss rate n-way set associative least-recently used tag field write allocate unified cache misses per instruction block locality address trace set random replacement index field no-write allocate write buffer write stall 21

SYMMETRIC SHARED-MEMORY: CACHE COHERENCE B629: Practical Compiling for Modern Machines 22

Quotes 23

Quotes We are dedicating all of our future product development to multicore designs. We believe this is a key inflection point for the industry. Intel President Paul Otellini, describing Intelʼs future direction at the Intel Developers Forum in 2005 23

Quotes The turning away from conventional organization came in the middle 1960s, when the law of diminishing returns began to take effect in the effort to increase the operational speed of a computer... Electronic circuits are ultimately limited in their speed of operation by the speed of light... and many of the circuits were already operating in the nanosecond range. W. Jack Bouknight et al. The Illiac IV System (1972) We are dedicating all of our future product development to multicore designs. We believe this is a key inflection point for the industry. Intel President Paul Otellini, describing Intelʼs future direction at the Intel Developers Forum in 2005 23

Multiprocessor Cache Coherence X is not in any cache initially. The caches are write-through. 24

Multiprocessor Cache Coherence X is not in any cache initially. The caches are write-through. Proposed definition: Memory system is coherent if any read of a data item returns the most recently written value of that data item. 24

Multiprocessor Cache Coherence X is not in any cache initially. The caches are write-through. Proposed definition: Memory system is coherent if any read of a data item returns the most recently written value of that data item. Too simplistic! 24

Coherency: Take 2 A memory system is coherent if: 25

Coherency: Take 2 A memory system is coherent if: 1. Writes and Reads by one processor P x M P x M 25

Coherency: Take 2 A memory system is coherent if: 1. Writes and Reads by one processor 2. Writes and Reads by two processors P x M P1 x M P x M P2 x M 25

Coherency: Take 2 A memory system is coherent if: 1. Writes and Reads by one processor 2. Writes and Reads by two processors 3. Writes by two processors (serialization) P x M P1 x M P1 x M P3 x M P x M P2 x M P2 y M P3 y M 25

What Needs to Happen for Coherence? Migration data can move to local cache, when needed Replication data may be replicated in local cache, when needed Need a protocol to maintain the coherence property specialized hardware 26

Directory based Coherence Protocols central location (directory) maintains the sharing state of a block of physical memory slightly higher implementation cost scalable most used for distributed-memory multiprocessors Snooping each cache maintains sharing state of the blocks it contains shared broadcast medium (e.g., bus) each cache snoops on the medium to determine whether they a copy of the block that is requested most used for shared-memory multiprocessors 27

Snooping Protocols: Handling Writes Write invalidate write requires exclusive access any copy held by the reading processor is invalidated if two processors attempt to write, one wins the race Write update (or write broadcast) update all the cached copies of a data item when written consumes substantially more bandwidth than invalidatingbased protocol not used in recent multiprocessors 28

Example of Invalidation Protocol Write-back caches 29

Write Invalidate Protocol: Observations Serialization through access to the broadcast medium Need to locate data item upon miss simple on write-through caches write-buffers may complicate this write-through increases the memory bandwidth requirement more complex on write-back caches caches snoop for read addresses, supply matching dirty block no need to write back dirty block if cached elsewhere preferred approach on most modern multiprocessors, due to lower memory bandwidth requirements Cache tags and valid bits can do double duty Additional bit to indicate whether block shared Desirable to reduce contention on cache between processor and snooping 30

Invalidation-Based Coherence Protocol for Write-Back Caches with Allocate on Write 31

Invalidation-Based Coherence Protocol for Write-Back Caches with Allocate on Write Idea: Shared read, exclusive write 31

Invalidation-Based Coherence Protocol for Write-Back Caches with Allocate on Write Idea: Shared read, exclusive write Read Hit: get from local cache Miss: get from memory or another processor s cache; writeback existing block if needed 31

Invalidation-Based Coherence Protocol for Write-Back Caches with Allocate on Write Idea: Shared read, exclusive write Read Hit: get from local cache Miss: get from memory or another processor s cache; writeback existing block if needed Write Hit: write in cache, mark the block exclusive (invalidate copies at other processors) Miss: get from memory or another processor s cache; writeback existing block if needed 31

Cache Coherence Mechanism (MESI or MOESI) 32

Write Invalidate Cache Coherence Protocol for Write-Back Caches 33

Write Invalidate Cache Coherence Protocol for Write-Back Caches 34

Interconnection Network, Instead of Bus 35

SYMMETRIC SHARED-MEMORY: PERFORMANCE B629: Practical Compiling for Modern Machines 36

Performance Issues Cache misses capacity compulsory conflict Cache coherence true-sharing misses false-sharing 37

Performance Issues Cache misses capacity compulsory conflict Cache coherence true-sharing misses false-sharing 37

Machine Performance: Alpha-Server Alpha-Server 4100, 4 processors: processor Alpha 21164 (four issue) Three-level cache L1: 8KB direct-mapped, separate instruction and data, 32- byte block size, write through, on-chip L2: 96KB 3-way set-associative, unified, 32-byte block size, write back, on-chip L3: 2MB direct-mapped, unified, 64-byte block size, write back, off-chip Latencies L2: 7 cycles, L3: 21 cycles, memory: 80 cycles 38

Performance: Commercial Workload Online Transaction-Processing (OLTP) modeled after TPC-B client-server Decision Support System (DSS) modeled after TPC-D long-running queries against large complex data structures (obsoleted) Web index search AltaVista, using 200 GB memory-mapped database I/O time ignored (substantial for these applications) 39

Execution Time Breakdown 40

OLTP with Different Cache Sizes 41

OLTP: Contributing Causes of L3 Cycles 42

OLTP Mem. Access Cycles with Processor Count 43

OLPT Misses with L3 Cache Block Size 44

Multiprogramming and OS Workload Models user and OS activities Andrew benchmark, emulates software development compiling installing object files in a library removing object files Memory hierarchy L1 instruction cache: 32KB, 2-way set-associative, 64-byte block; L1 data cache: 32KB 2-way set-associative, 32-byte block L2: 1MB unified, 2-way set assoc., 128-byte block Memory: 100 clock cycles access 45

Distribution of Execution Time 46

L1 Size and L1 Block Size 47

Kernel Data Cache Miss Rates 48

Memory Traffic Per Data Reference 49

DISTRIBUTED SHARED-MEMORY: DIRECTORY-BASED COHERENCE B629: Practical Compiling for Modern Machines 50

Distributed Memory 51

Distributed Memory+Directories 52

Directory-Based Protocol States Shared one or more processors have the block cached memory has up to date value Uncached no processor has a copy of the cache block Modified exactly one processor has a copy of the cache block memory copy is out of date the processor with the copy is the block s owner 53

Possible Messages 54

States for Individual Caches 55

States for Directories 56

SYNCHRONIZATION B629: Practical Compiling for Modern Machines 57

Basic Idea Hardware support for atomically reading and writing a memory location enables software to implement locks a variety of synchronizations possible with locks Multiple (equivalent) approaches possible Synchronization libraries on top of hardware primitives 58

Some Examples Atomic exchange exchange a register and memory value atomically return the register value if failed, memory value if succeeded Test-and-set test a memory location and set its value if the test passed Fetch-and-increment 59

Paired Instructions Problems with single atomic operations complicates coherence Alternative: pair of special load and store instructions load linked or load locked store conditional can implement atomic exchange with the pair 60

Other Primitives can also be built Atomic exchange Fetch-and-increment Implemented with a link register to track the address of LL instruction 61

Implementing Spin Locks: Uncached 62

Implementing Spin Locks: Cached 63

Cache Coherence Steps 64

Implementing Spin Locks: Linked Load/Store 65

MEMORY CONSISTENCY B629: Practical Compiling for Modern Machines 66

Coherence vs Consistency Coherence defines the behavior of reads and writes to the same memory location Consistency defines the behavior of reads and writes with respect to accesses to other memory locations we will return to consistency later Working assumptions: a write does not complete until all processors have seen the effect of that write processor does not change the order of a write with respect to any other memory access 67

Memory Consistency Related to accesses to multiple shared memory locations coherence deals with accesses to a shared location P1: A = 0;... A = 1; L1: if (B == 0)... P2: B = 0;... B = 1; L2: if (A == 0)... 68

Memory Consistency Related to accesses to multiple shared memory locations coherence deals with accesses to a shared location P1: A = 0;... A = 1; L1: if (B == 0)... P2: B = 0;... B = 1; L2: if (A == 0)... Sequential Consistency 68

Programmer s View Sequential consistency easy to reason about Most real programs use synchronization synchronization primitives usually implemented in libraries not using synchronization data races Allowing sequential consistency for synchronization variables ensure correctness can implement relaxed consistency for the rest 69

Relaxed Consistency Models Idea: allow reads and writes to finish out of order, but use synchronization to enforce ordering Ways to relax consistency (weak consistency) Relaxing W R: processor consistency Relaxing W W: partial store order Relaxing R W and R R: weak ordering, PowerPC consistency, release consistency Two approaches to optimize performance use weak consistency, rely on synchronization for correctness use sequential or processor consistency with speculative execution 70

FALLACIES AND PITFALLS B629: Practical Compiling for Modern Machines 71

Fallacies and Pitfalls Pitfall: Measuring performance of multiprocessors by linear speedup versus execution time sequential vs parallel algorithms superlinear speedups and cache effects strong vs weak scaling Pitfall: Not developing software to take advantage of, or optimize for, a multiprocessor architecture B629: Practical Compiling for Modern Machines 72