Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Similar documents
Memory Consistency and Multiprocessor Performance

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 5. Multiprocessors and Thread-Level Parallelism

Scientific Applications. Chao Sun

Flynn s Classification

Computer Science 146. Computer Architecture

COSC4201 Multiprocessors

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Why Multiprocessors?

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Page 1. Lecture 11: Multiprocessor 1: Reasons, Classifications, Performance Metrics, Applications. Parallel Processors Religion. Parallel Computers

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Relaxed Memory Consistency

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

5008: Computer Architecture

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Multiprocessor Synchronization

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Advanced Topics in Computer Architecture

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Computer Architecture

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Chapter 5. Thread-Level Parallelism

ECE 669 Parallel Computer Architecture

Chapter 9 Multiprocessors

Parallel Computer Architecture

Multiprocessors Thread-Level Parallelism

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Speeding-up Multiprocessors Running DSS Workloads through Coherence Protocols

Parallel Architecture. Hwansoo Han

Chapter-4 Multiprocessors and Thread-Level Parallelism

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Relaxed Memory-Consistency Models

The Cache-Coherence Problem

Reducing coherence overhead and boosting performance of high-end SMP multiprocessors running a DSS workload

8 Multiprocessors. Bouknight et al., The Illiac IV System [1972] A. L. DeCegama, The Technology of Parallel Processing, Volume I (1989)

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

CSE502: Computer Architecture CSE 502: Computer Architecture

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Memory Consistency Models

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Beyond Sequential Consistency: Relaxed Memory Models

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Portland State University ECE 588/688. Cray-1 and Cray T3E

Multiprocessors & Thread Level Parallelism

I.1 Introduction I-2 I.2 Interprocessor Communication: The Critical Performance Issue I-3 I.3 Characteristics of Scientific Applications I-6 I.

Architecture-Conscious Database Systems

Chapter Seven Morgan Kaufmann Publishers

Distributed Shared Memory and Memory Consistency Models

ECE 588/688 Advanced Computer Architecture II

Computer Science 246. Computer Architecture

Speculative Synchronization

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Multiprocessors 1. Outline

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Computer Systems Architecture

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Uniprocessor Computer Architecture Example: Cray T3E

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

EECS 570 Final Exam - SOLUTIONS Winter 2015

Handout 3 Multiprocessor and thread level parallelism

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Transcription:

Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1

Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel computer which has n sequential processors Particularly, remote writes must appear in a local processor in some correct sequence 2

Sequential Consistency Model Sequential consistency Memory read/writes are globally serialized; assume every cycle only one processor can proceed for one step, and write result appears on other processors immediately Processors do not reorder local reads and writes 3

Memory Consistency Examples Initially A == 0 and B == 0 P1 P2 A = 1; B = 1; L1: if (B == 0)... L2: if (A == 0)... Impossible for both if statements L1 & L2 to be true? Note the number of possible total orderings is an exponential function of #inst 4

Performance of Sequential Consistency SC must delay all memory accesses until all invalidates done What if write invalidate is delayed & processor continues? How about memory level parallelism? 5

Relaxed Memory Consistency Total storing order Only writes are globally serialized; assume every cycle at most one write can proceed, and the write result appears immediately Processors may reorder local reads/writes without RAW dependence Processor consistency Writes from one processor appear in the same order on all other processors Processors may reorder local reads/writes without RAW dependence 6

Relaxed Memory Consistency An architect has designed an SMP that uses processor consistency P1 P2 P3 A = get_data(); if (updated) if (flag) updated = 1; flag = 1; process_data(a); An programmer found that P3 uses stale data of A, sometimes. Initially updated==0 and flag==0. Can you defend the architect? 7

Memory Consistency and ILP Speculate on loads, recovered on violation With ILP and SC what will happen on this? P1 code P2 code P1 exec P2 exec A = 1 B = 1 spec. load B spec. load A print B print A store A store B retire load B flush at load A SC can be maintained, but expensive, so use TSO or PC Speculative execution and rollback can still improve performance Performance: ILP + Strong MC Weak MC 8

Memory Consistency in Real Programs Not really an issue for many common programs; they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x)... release (s) {unlock}... acquire (s) {lock}... read(x) 9

Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization: Uninterruptable instruction to fetch and update memory (atomic operation); Synchronization operation needs this kind of primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization 10

Uninterruptable Instructions to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it 11

Parallel App: Commercial Workload Online transaction processing workload (OLTP) (like TPC-B or -C) Decision support system (DSS) (like TPC-D) Web index search (Altavista) Benc h mark % Time User Mode % Time Kernel % Time I/O time (CPU I dle) OLTP 71% 18% 11% DSS (range) 82-94% 3-5% 4-13% DSS (avg) 87% 4% 9% Altavista > 98% < 1% <1% 12

Alpha 4100 SMP 4 CPUs 300 MHz Apha 211264 @ 300 MHz L1$ 8KB direct mapped, write through L2$ 96KB, 3-way set associative L3$ 2MB (off chip), direct mapped Memory latency 80 clock cycles Cache to cache 125 clock cycles 13

100 90 80 OLTP Performance as vary L3$ size 70 60 50 40 30 Idle PAL Code Memory Access L2/L3 Cache Access Instruction Execution 20 10 0 1 MB 2 MB 4 MB 8MB L3 Cache Size 14

3.25 3 2.75 2.5 2.25 2 1.75 1.5 1.25 1 0.75 0.5 0.25 0 L3 Miss Breakdown 1 MB 2 MB 4 MB 8 MB Cache size Instruction Capacity/Conflict Cold False Sharing True Sharing 15

3 2.5 2 Memory CPI as increase CPUs Instruction Conflict/Capacity Cold False Sharing True Sharing 1.5 1 0.5 0 1 2 4 6 8 L3: 2MB 2-way Processor count 16

OLTP Performance as vary L3$ block size 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 32 64 128 256 Block size in bytes Insruction Capacity/Conflict Cold False Sharing True Sharing 17

A pure NUMA 2 CPUs per node, SGI Origin 2000 Scales up to 2048 processors Design for scientific computation vs. commercial processing Scalable bandwidth is crucial to Origin 18

Parallel App: Scientific/Technical FFT Kernel: 1D complex number FFT 2 matrix transpose phases => all-to-all communication Sequential time for n data points: O(n log n) Example is 1 million point data set LU Kernel: dense matrix factorization Blocking helps cache miss rate, 16x16 Sequential time for nxn matrix: O(n 3 ) Example is 512 x 512 matrix 19

5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 FFT Kernel FFT 8 16 32 64 Processor count 3-hop miss to remote cache Remote miss to home Miss to local memory Hit 20

5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 LU kernel LU 3-hop miss to remote cache Remote miss to home Miss to local memory Hit 8 16 32 64 Processor count 21

Parallel App: Scientific/Technical Barnes App: Barnes-Hut n-body algorithm solving a problem in galaxy evolution n-body algs rely on forces drop off with distance; if far enough away, can ignore (e.g., gravity is 1/d 2 ) Sequential time for n data points: O(n log n) Example is 16,384 bodies Ocean App: Gauss-Seidel multigrid technique to solve a set of elliptical partial differential eq.s red-black Gauss-Seidel colors points in grid to consistently update points based on previous values of adjacent neighbors Multigrid solve finite diff. eq. by iteration using hierarch. Grid Communication when boundary accessed by adjacent subgrid Sequential time for nxn grid: O(n 2 ) Input: 130 x 130 grid points, 5 iterations 22

Barnes App Barnes Average cycles per memory reference 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 3-hop miss to remote cache Remote miss to home Miss to local memory Hit 8 16 32 64 Processor count 23

Ocean App Average cycles per reference 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Ocean 3-hop miss to remote cache Remote miss Local miss Cache hit 8 16 32 64 24

Multiprocessor Conclusion Some optimism about future Parallel processing beginning to be understood in some domains More performance than that achieved with a single-chip microprocessor MPs are highly effective for multiprogrammed workloads MPs proved effective for intensive commercial workloads, such as OLTP (assuming enough I/O to be CPU-limited), DSS applications (where query optimization is critical), and large-scale, web searching applications On-chip MPs appears to be growing 1) embedded market where natural parallelism often exists an obvious alternative to faster less silicon efficient, CPU. 2) diminishing returns in high-end microprocessor encourage designers to pursue on-chip multiprocessing 25