Chapter 9 Multiprocessors

Similar documents
Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Aleksandar Milenkovich 1

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Flynn s Classification

Parallel Architecture. Sathish Vadhiyar

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Processor Architecture and Interconnect

Computer parallelism Flynn s categories

Intro to Multiprocessors

Multiprocessors & Thread Level Parallelism

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Overview. Processor organizations Types of parallel machines. Real machines

4. Networks. in parallel computers. Advances in Computer Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Lecture 24: Virtual Memory, Multiprocessors

Chapter 5. Multiprocessors and Thread-Level Parallelism

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

TDT Appendix E Interconnection Networks

CS/COE1541: Intro. to Computer Architecture

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Parallel Architecture. Hwansoo Han

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Handout 3 Multiprocessor and thread level parallelism

Parallel Architectures

Chap. 4 Multiprocessors and Thread-Level Parallelism

COSC 6374 Parallel Computation. Parallel Computer Architectures

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

COSC 6385 Computer Architecture - Multi Processor Systems

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

COSC4201 Multiprocessors

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Parallel Architectures

Chapter-4 Multiprocessors and Thread-Level Parallelism

CMSC 611: Advanced Computer Architecture

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessor Synchronization

PARALLEL COMPUTER ARCHITECTURES

COSC 6374 Parallel Computation. Parallel Computer Architectures

Issues in Multiprocessors

Parallel Architectures

CS Parallel Algorithms in Scientific Computing

Lecture 2 Parallel Programming Platforms

CMSC 611: Advanced. Distributed & Shared Memory

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Interconnect Routing

Scalability and Classifications

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Interconnection Networks

Computer Systems Architecture

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Interconnection Network

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Advanced Parallel Architecture. Annalisa Massini /2017

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Chapter 18 Parallel Processing

Computer Systems Architecture

Characteristics of Mult l ip i ro r ce c ssors r

Multiprocessors 1. Outline

INTERCONNECTION NETWORKS LECTURE 4

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Multi-Processor / Parallel Processing

Computer Science 146. Computer Architecture

EE382 Processor Design. Illinois

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

Physical Organization of Parallel Platforms. Alexandre David

Multiprocessor Interconnection Networks

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Interconnection networks

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

CSCI 4717 Computer Architecture

Issues in Multiprocessors

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CDA3101 Recitation Section 13

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

SHARED MEMORY VS DISTRIBUTED MEMORY

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Transcription:

ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University of Rochester Spring 2003 TU Eindhoven 2004 TUT Spring 2004 What we ll cover today Multiprocessor motivation Classification of parallel computation Multiprocessor organizations Shared memory multiprocessors Cache coherence Synchronization Interconnection network basics

Multiprocessor motivation, part 1 Many scientific applications take too long to run on a single processor machine Modeling of weather patterns, astrophysics, chemical reactions, ocean currents, etc. Many of these are parallel applications which largely consist of loops which operate on independent data Such applications can make efficient use of a multiprocessor machine with each loop iteration running on a different processor and operating on independent data Multiprocessor motivation, part 2 Many multi-user environments require more compute power than available from a single processor machine irline reservation system, department store chain inventory system, file server for a large department, web server for a major corporation, etc. These consist of largely parallel transactions which operate on independent data Such applications can make efficient usage of a multiprocessor machine with each transaction running on a different processor and operating on independent data

Classification: Flynn Categories SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) Systolic arrays / stream based processing SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2 (Thinking Machines Corp) Simple programming model Low overhead Now applied as sub-word parallelism!! MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Flexible Use off-the-shelf micros Communication Models Shared Memory Processors communicate with shared address space Easy on small-scale machines Model of choice for uniprocessors, small-scale MPs Lower latency Message passing Processors have private memories, communicate via messages Focuses attention on costly non-local operations Can support either SW model on either HW base

Shared ddress Model Summary Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size: byte, word,... or blocks May use virtual memory manager to map virtual to local or remote physical Memory hierarchy model applies: now communication moves data to local proc. Multiprocessor organizations Shared memory multiprocessors ll processors share the same memory address space Single copy of the OS (although some parts may be parallel) Relatively easy to program and port sequential code to Difficult to scale to large numbers of processors Uniform memory access (UM) machine block diagram

Multiprocessor organizations Distributed memory multiprocessors Processors have their own memory address space Message passing used to access another processor s memory Multiple copies of the OS Usually commodity hardware and network (e.g., Ethernet) More difficult to program Easier to scale hardware and more inherently fault resilient Multiprocessor variants Non-uniform memory access (NUM) shared memory multiprocessors ll memory can be addressed by all processors, but access to a processor s own local memory is faster than access to another processor s remote memory Looks like a distributed machine, but interconnection network is usually custom-designed switches and/or buses

Multiprocessor variants Distributed shared memory (DSM) multiprocessors Commodity hardware of a distributed memory multiprocessor, but all processors have the illusion of shared memory Operating system handles accesses to remote memory transparently on behalf of the application Relieves application developer of the burden of memory management across the network Multiprocessor variants Shared memory machines connected together over a network (operating as a distributed memory or DSM machine) network controller network controller network

Message Passing Model Explicit message send and receive operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Typically blocking, but may use DM Message structure Header Data Trailer Communication Models Comparison Shared-Memory Compatibility with well-understood mechanisms Ease of programming for complex or dynamic communications patterns Shared-memory applications Efficient for small items Supports hardware caching Messaging Passing Simpler hardware Explicit communication Improved synchronization Easier for sender-initiated communication

Shared memory multiprocessors Major design issues Cache coherence: ensuring that stores to d data are seen by other processors Synchronization: the coordination among processors accessing shared data Memory consistency: definition of when a processor must observe a write from another processor Cache coherence problem Two writeback s becoming incoherent (1) reads block

Cache coherence problem Two writeback s becoming incoherent (1) reads block (2) reads block Cache coherence problem Two writeback s becoming incoherent (1) reads block (2) reads block (3) writes block old, out of date copies of block

Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping ctually existed BEFORE Snooping-based schemes Cache coherence protocols Ensures that d blocks that are written to are observable by all processors ssigns a state field to all d blocks Defines actions for performing reads and writes to blocks in each state that ensure coherence ctions are much more complicated than described here in a real machine with a split transaction bus

MESI coherence protocol Commonly used (or variant thereof) in shared memory multiprocessors Idea is to ensure that when a wants to write to a block that other remote s invalidate their copies first Each block is in one of four states (2 bits stored with each block) Invalid: contents are not valid Shared: other processor s may have the same copy; has the same copy Exclusive: no other processor has a copy; has the same copy Modified: no other processor has a copy; has an old copy MESI coherence protocol ctions on a load that results in hit Local actions Read block Remote actions None ctions on a load that results in miss Local actions Request block from bus If not in a remote, set state to Exclusive If also in a remote, set state to Shared Remote actions Look up tags to see if the block is present If so, signal the local that we have a copy, provide it if it is in state Modified, and change the state of our copy to Shared

MESI coherence protocol ctions on a store that results in hit Local actions Check state of block If Shared, send an Invalidation bus command to all remote s Write the block and change the state to Modified Remote actions Upon receipt of an Invalidation command on the bus, look up tags to see if the block is present If so, change the state of the block to Invalid ctions on a store that results in miss Local actions Simultaneously request block from bus and send an Invalidation command fter block received, write the block and set the state to Modified Remote actions Look up tags to see if the block is present If so, signal the local that we have a copy, provide it if it is in state Modified, and change the state of our copy to Invalid Cache coherence problem revisited (1) reads block Exclusive

Cache coherence problem revisited (1) reads block (2) reads block Exclusive Shared Shared Cache coherence problem revisited (1) reads block (2) reads block Exclusive Shared Shared (3) invalidates remote block Shared Invalid Invalidate command

Cache coherence problem revisited (1) reads block (2) reads block Exclusive Shared Shared (3) invalidates remote block (4) writes block Shared Invalid Modified Invalid Invalidate command Synchronization For parallel programs to share data, we must make sure that accesses to a given memory location are ordered Example: database of available inventory at a department store simultaneously accessed from different store computers; only one computer must win the race to reserve a particular item Solution rchitecture defines a special atomic swap instruction in which a memory location is tested for 0, and if so, is set to 1 Software associates a lock variable with each data that needs to be ordered (e.g., particular class of merchandise) and uses the atomic swap instruction to try to set it Software acquires the lock before modifying the associated data (e.g., reserving the merchandise) Software releases the lock by setting it to 0 when done

Uninterruptable Instruction to Fetch and Update Memory tomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 & swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test (also Compare-and-swap, as on previous slide) Fetch-and-increment: it returns the value of a memory location and atomically increments it 0 => synchronization variable is free Synchronization flowchart spinning

Synchronization and coherence example Bus (shared) or Network (switched) Network: claimed to be more scalable no bus arbitration point-to-point connections but router overhead

Shared vs. Switched Media Shared media: nodes share a single interconnection medium (e.g., Ethernet, Bus) Only one message sent at a time Inexpensive : one medium used by all processors Limited bandwidth : medium becomes the bottleneck Needs arbitration Switched media : allow direct communication between source and destination nodes (e.g., TM) Multiple users at a time More expensive : need to replicated medium Higher total bandwidth No arbitration dded latency to go through the switch Network design parameters Large network design space: topology, degree routing algorithm path, path control, collision resolvement, network support, deadlock handling, livelock handling virtual layer support flow control, buffering QoS guarantees error handling etc, etc.

Switch Topology Networks have a topology that indicate how nodes are connected Topology determines Degree: number of links from a node Diameter: max number of links crossed between nodes verage distance: number of links to random destination Bisection: minimum number of links that separate the network into two halves Bisection bandwidth: link bandwidth x bisection Common Topologies Type Degree Diameter ve Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh 4 2(N 1/2-1) 2N 1/2 / 3 N 1/2 3D mesh 6 3(N 1/3-1) 3N 1/3 / 3 N 2/3 nd mesh 2n n(n 1/n -1) nn 1/n / 3 N (n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N 1/2 N 1/2 / 2 2N 1/2 Hypercube Log 2 N n=log 2 N n/2 N/2 2D Tree 3 2Log 2 N ~2Log 2 N 1 Crossbar N-1 1 1 N 2 /2 N = number of nodes, n = dimension

Butterfly or Omega Network N/2 Butterfly N/2 Butterfly 8 x 8 butterfly switch ll paths equal length Unique path from any input to any output Try to avoid conflicts Multistage Fat Tree multistage fat tree (CM-5) avoids congestion at the root node Randomly assign packets to different paths on way up to spread the load Increase degree, decrease congestion

Example Networks Name Number Topology Bits Clock Link Bis. BW Year ncube/ten 1-1024 10-cube 1 10 MHz 1.2 640 1987 ipsc/2 16-128 7-cube 1 16 MHz 2 345 1988 MP-1216 32-512 2D grid 1 25 MHz 3 1,300 1989 Delta 540 2D grid 16 40 MHz 40 640 1991 CM-5 32-2048 fat tree 4 40 MHz 20 10,240 1991 CS-2 32-1024 fat tree 8 70 MHz 50 50,000 1992 Paragon 4-1024 2D grid 16100 MHz 200 6,400 1992 T3D 16-1024 3D Torus 16150 MHz 300 19,200 1993 MBytes/second No standard topology! However, for on-chip mesh and torus are in favor!