MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Similar documents
Shared Symmetric Memory Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Systems Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computer Systems Architecture

Multiprocessors & Thread Level Parallelism

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer Science 146. Computer Architecture

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Computer Architecture

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Multiprocessors 1. Outline

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Handout 3 Multiprocessor and thread level parallelism

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Advanced Topics in Computer Architecture

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

COSC4201 Multiprocessors

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Limitations of parallel processing

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

1. Memory technology & Hierarchy

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Interconnect Routing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Parallel Architecture. Hwansoo Han

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Processor Architecture

ECE/CS 757: Homework 1

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Multiprocessors and Locking

Lecture 24: Virtual Memory, Multiprocessors

Aleksandar Milenkovich 1

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Multiprocessor Cache Coherency. What is Cache Coherence?

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Optimizing Replication, Communication, and Capacity Allocation in CMPs

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Chapter 5. Thread-Level Parallelism

COSC 6385 Computer Architecture - Multi Processor Systems

ECE 485/585 Microprocessor System Design

Computer parallelism Flynn s categories

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Memory Hierarchy in a Multiprocessor

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Lecture 24: Board Notes: Cache Coherency

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

University of Toronto Faculty of Applied Science and Engineering

Advanced Parallel Programming I

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

CSC 631: High-Performance Computer Architecture

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

UNIT I (Two Marks Questions & Answers)

Thread-Level Parallelism

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Memory Consistency and Multiprocessor Performance

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

SGI Challenge Overview

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Transcription:

1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I

OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory Architectures (5.2) Performance of SMP (5.3) 2

3 INTRODUCTION

INTRODUCTION Move to multi-processor CPE731 -Dr. -Dr. Iyad Iyad Jafar Jafar Power and ILP limitations? RISC 4 4 Technology Improvement New Architectures and Organization

INTRODUCTION Why multiprocessors? Increased costs of silicon and energy to exploit ILP Increasing performance of desktop is less important Advantage of replication rather than unique design Improved understanding on how to use multiprocessors effectively(especially in servers!) Growing interest in high end servers for cloud computing and SaaS A growth in data-intensive applications 5

INTRODUCTION Multiprocessor Tightly coupled processors whose coordination and usage are typically controlled by a single operating system and that share memory through a shared address space 2-32 processors Single-chip system (multicore) or multiple multicore chips Multiprocessors exploit thread-level parallelism Parallel programming execute tightly-coupled threads that collaborate on a single task Request-level parallelism execute multiple independent processes Single program or multiple applications (multiprogramming) Multicomputers? 6

INTRODUCTION To maximize the advantage of multiprocessors with n processors, we need n threads Independent threads are created by programmer or operating systems TLPmayexploitDLP A thread may have some iterations of a loop to exploit data-level parallelism Grain size must be sufficiently large to compensate for the thread overhead! 7

MULTIPROCESSORARCHITECTURE Symmetric Shared-Memory Multiprocessors (SMPs) Centralized shared-memory multiprocessors Smallnumberofcores Share single memory with uniform latency(uma) 8

MULTIPROCESSORARCHITECTURE Distributed Shared Memory Multiprocessors (DSMs) Larger number of processors Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and nondirect (multi-hop) interconnection networks 9

MULTIPROCESSORARCHITECTURE The term shared memory in both architectures implies that threads communicate with each other through the same address space i.e. Any processor can reference any memory locationaslongasithasaccessrights In DSM, the distributed memory adds the communication complexities and overhead 10

CHALLENGES Limited Parallelism in Programs Example. Suppose you want to achieve a speedup of 80 with 100 processors. What fraction of the original computation can be sequential? How this can be addressed? 11

CHALLENGES CommunicationOverhead Example. Suppose we have an application running on a 32-processor multiprocessor, which has a 200 ns time to handle reference to a remote memory. For this application, assume that all the references except those involving communication hit in the local memory hierarchy, which is slightly optimistic. Processors are stalled on a remote request, and the processor clock rate is 3.3 GHz. If the base CPI (assuming that all references hit in the cache) is 0.5, how much faster is the multiprocessor if there is no communication versus if 0.2% of the instructions involve a remote communication reference? 12

CHALLENGES Communication Overhead Example. CPI com = CPI ideal +misspenalty = 0.5 + remoterequestrate penalty = 0.5 + 0.002 200ns/0.3ns = 0.5 + 1.2 = 1.7 Speedup=1.7/0.5=3.4 The multiprocessor with all local references is 3.4 faster Howtoaddress?(SWandHW) 13

14 SMP ARCHITECTURES

SMP ARCHITECTURES Intel Nehalem (Nov 2008) 15

SMP ARCHITECTURES SMPs support caching of private and shared data Reduce latency, BW and contention Cachingprivatedataisnotaproblem!Like a uniprocessor! Caching shared data issues for memory system behavior Coherence;what values can be returned by a read Consistency; when a written value will be returned by a read 16

CACHECOHERENCE P1 5 P2 P3 3 4 X =? X = 5 X =? X = 5 X = 8 2 1 X = 5 Memory 17

CACHECOHERENCE Amemorysystemiscoherentif Preserve Program Order: A read by processor P to location XthatfollowsawritebyPtoX,withnowritesofXbyanother processor occurring between the write and the read by P, always returns the value written by P Coherent view of memory:readbyaprocessortolocationx that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in timeandnootherwritestoxoccurbetweenthetwoaccesses Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and thenlaterreaditas1 18

BASIC SCHEMES FOR ENFORCING COHERENCE A program running on multiple processors have copiesofthesamedatainseveralcaches In coherent multiprocessor, caches use migration and replication Migration Movedatatoalocalcacheandusetransparently ReducelatencyandBWdemand Replication Copy data to individual caches for simultaneous read Reduce latency and bus contention UseaHWprotocoltokeepcachescoherentinstead ofusingaswapproach 19

BASIC SCHEMES FOR ENFORCING COHERENCE Directory-based Protocols The sharing status of a shared block is kept in one (or more) location, i.e. Directory In SMP, centralized directory In DSM, distributed directories Snooping-based Protocols Everycachethathasacopyofasharedblockkeepstrackof the sharing status InSMP, Caches are accessible via some broadcast medium Each cache monitors or snoops the medium to determine whethertheyhaveacopyoftherequestedblock Can be used in multichip multiprocessor on top of directory protocol within each multicore 20

SNOOPINGCOHERENCEPROTOCOLS Write-update protocol(broadcast) A write to a cashed shared item updates all cached copies via the medium Less popular; consumes BW! Write-invalidate protocol A write to a shared cached item invalidates all cached copies(exclusive access) X 21

BASICIMPLEMENTATIONTECHNIQUES Abusorbroadcastmedium Perform invalidates by acquiring the bus first, then broadcasting the address Other processors snoop and check their caches for the broadcasted address Invalidation by different processors is serialized by bus arbitration Locatingshareditemsonamiss Simple in write-through! Write-back is more difficult! However, in write-back, caches can snoop for read requests as well and provide the data if they have it in dirty state Write buffers? 22

BASICIMPLEMENTATIONTECHNIQUES Tracking state Usecachetags,validanddirtybitstoimplement snooping 1-bittotrackthesharingstateofeachblock Exclusive/Modified state The processor has a modified copy of the block. No need to send invalidates on successive writes by the same processor Shared theblockisinmorethanprivatecaches. Finite state controller in each core Respondstorequestsfromthecoreandmedium Changethestateofacachedblock Invalid, modified or shared 23

EXAMPLEPROTOCOL (INVALIDATE& WB) 24 Why to write-back?

EXAMPLEPROTOCOL (INVALIDATE& WB) 25

EXAMPLEPROTOCOL (INVALIDATE& WB) 26

EXAMPLEPROTOCOL (INVALIDATE& WB) 27

EXTENSIONS TOMSI PROTOCOL The previous protocol is called MSI Many extensions exist Add states and/or transactions to improve performance MESI protocol(intel i7 MESIF) Exclusivestateaddedtoindicatecachelineisthesameas mainmemoryandistheonlycachedcopy When the state changes on Read Miss, no need to writeback block to memory MOESI protocol(amd Opteron) MSI and MESI update memory whenever changing the state to Shared. MOESI, a block can be changed from Modified to Owned without writing to memory. MOESI adds the Owner state to indicate that a block is owned by that cache and out-of-date in memory Theownershouldupdatetheblockinmemoryonamiss 28

LIMITATIONS Centralized memory can be become a bottleneck as the number of processors or their memory demands increase High BW connection to L3 cache allowed 4 to 8 cores.however,itisnotlikelytoscale! Multiple busses and interconnection networks such as cross-bar or small point-to-point Bankedmemoryorcache 29

LIMITATIONS Snooping BW could become a problem. Each processor must examine every miss Snooping may interfere with cache operation Duplicate cache tags Centralized directory in the outermost Doesnoteliminatethebottleneckatthebus 30

PERFORMANCE OFSMPS Performance is determined by Traffic caused by cache misses of processors Traffic of communication Both are affected by processor count, cache size and block size Adds the fourth C (coherence) for the 3Cs misses Types of coherence misses True Sharing Misses False Sharing Misses Singlevalidbitperblock. Writing a word in a block invalidates the entire block. 31

PERFORMANCE OFSMPS Coherence Misses Example Assume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of both P1 and P2. Assuming the following sequence of events, identify each miss as a true sharing miss, a false sharing miss, or a hit. 32

PERFORMANCE OFSMPS 1998Study Processor Alpha Server 4100 with four Alpha 21164 processors 4IPCat300MHz Workload TPC-B: online transaction processing(oltp) TPC-D: Decision support system(dss) AltaVista: Web index search 33

PERFORMANCE OFSMPS OLTP has the poorest performance due to memory hierarchy problems Consider evaluating the OLTP when varying L3cache size, block size and number of processors 34

PERFORMANCE OFSMPS Biggest improvement when moving from 1 to 2 MB L3? 35

PERFORMANCE OFSMPS Instruction and capacity misses drops but true sharing, false and compulsory misses are unaffected! 36

PERFORMANCE OFSMPS Increase of true sharing misses! 37

PERFORMANCE OFSMPS Reduce true sharing misses! 38