Multiprocessors - Flynn s Taxonomy (1966)

Similar documents
Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer parallelism Flynn s categories

Parallel Computing Platforms

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Multiprocessors & Thread Level Parallelism

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Computer Architecture

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Computer Systems Architecture

Lect. 2: Types of Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chapter 5: Thread-Level Parallelism Part 1

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Introduction II. Overview

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Why Multiprocessors?

Computer Systems Architecture

Lecture 7: Parallel Processing

Chapter 11. Introduction to Multiprocessors

WHY PARALLEL PROCESSING? (CE-401)

Online Course Evaluation. What we will do in the last week?

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Processor Architecture and Interconnect

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Architectures

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Parallel Architectures

Lecture 24: Memory, VM, Multiproc

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

COSC 6385 Computer Architecture - Multi Processor Systems

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Multiprocessors 1. Outline

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Parallel Computers. c R. Leduc

Lecture 24: Virtual Memory, Multiprocessors

Introduction to Parallel Computing

Introduction to parallel computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Handout 3 Multiprocessor and thread level parallelism

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Issues in Multiprocessors

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Lecture 7: Parallel Processing

Chapter 9 Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Types of Parallel Computers

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Overview. Processor organizations Types of parallel machines. Real machines

Aleksandar Milenkovich 1

Issues in Multiprocessors

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Dr. Joe Zhang PDC-3: Parallel Platforms

Parallel Architectures

Organisasi Sistem Komputer

Chapter 5. Thread-Level Parallelism

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

BİL 542 Parallel Computing

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Lecture 9: MIMD Architectures

COSC4201 Multiprocessors

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Convergence of Parallel Architecture

Parallel Computing Introduction

SHARED MEMORY VS DISTRIBUTED MEMORY

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Introduction to Parallel Programming

Lecture notes for CS Chapter 4 11/27/18

Crossbar switch. Chapter 2: Concepts and Architectures. Traditional Computer Architecture. Computer System Architectures. Flynn Architectures (2)

Advanced Parallel Architecture. Annalisa Massini /2017

4. Networks. in parallel computers. Advances in Computer Architecture

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

CSE 392/CS 378: High-performance Computing - Principles and Practice

Lecture 9: MIMD Architecture

10th August Part One: Introduction to Parallel Computing

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Chapter 4 Threads, SMP, and

Lecture 9: MIMD Architectures

Multi-core Programming - Introduction

Multicores, Multiprocessors, and Clusters

Parallel Architecture. Hwansoo Han

Computer Architecture and Organization

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Computer Organization. Chapter 16

Objectives of the Course

Transcription:

Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The data is not streaming Single Instruction stream, Multiple Data stream (SIMD) Popular for some applications like image processing One can construe vector processors to be of the SIMD type. MMX extensions to ISA reflect the SIMD philosophy Also apparent in multimedia processors (Equator Map-1000) Data Parallel Programming paradigm Multiprocessors CSE 471 1

Flynn s Taxonomy (c ed) Multiple Instruction stream, Single Data stream (MISD) Until recently no processor that really fits this category Streaming processors; each processor executes a kernel on a stream of data Maybe VLIW? Multiple Instruction stream, Multiple Data stream (MIMD) The most general Covers: Shared-memory multiprocessors Message passing multicomputers (including networks of workstations cooperating on the same problem; grid computing) Multiprocessors CSE 471 2

Shared-memory Multiprocessors Shared-Memory = Single shared-address space (extension of uniprocessor; communication via Load/Store) Uniform Memory Access: UMA With a shared-bus, it s the basis for SMP s (Symmetric MultiProcessing) Cache coherence enforced by snoopy protocols Number of processors limited by Electrical constraints on the load of the bus Contention for the bus Form the basis for clusters (but in clusters access to memory of other clusters is not UMA) Multiprocessors CSE 471 3

SMP (Symmetric MultiProcessors aka Multis) Single shared-bus Systems Proc. Caches Shared-bus Interleaved Memory I/O adapter Multiprocessors CSE 471 4

Shared-memory Multiprocessors (c ed) Non-uniform memory access: NUMA NUMA-CC: cache coherent (directory-based protocols or SCI) NUMA without cache coherence (enforced by software) COMA: Cache Memory Only Architecture Clusters Distributed Shared Memory: DSM Most often network of workstations. The shared address space is the virtual address space O.S. enforces coherence on a page per page basis Multiprocessors CSE 471 5

UMA Dance-Hall Architectures & NUMA Replace the bus by an interconnection network Cross-bar Mesh Perfect shuffle and variants Better to improve locality with NUMA, Each processing element (PE) consists of: Processor Cache hierarchy Memory for local data (private) and shared data Cache coherence via directory schemes Multiprocessors CSE 471 6

UMA - Dance-Hall Schematic Processors Caches Interconnect Main memory modules Multiprocessors CSE 471 7

NUMA Processors caches Local memories Interconnect Multiprocessors CSE 471 8

Shared-bus Number of devices is limited (length, electrical constraints) The longer the bus, the alrger number of devices but also becomes slower because Length Contention Ultra simplified analysis for contention: Q = Processor time between L2 misses; T = bus transaction time Then for 1 process P= bus utilization for 1 processor = T/(T+Q) For n processors sharing the bus, probability that the bus is busy B(n) = 1 (1-P) n Multiprocessors CSE 471 9

Cross-bars and Direct Interconnection Networks Maximum concurrency between n processors and m banks of memory (or cache) Complexity grows as O(n 2 ) Logically a set of n multiplexers But also need of queuing for contending requests Small cross-bars building blocks for direct interconnection networks Each node is at the same distance of every other node Multiprocessors CSE 471 10

An O(nlogn) network: Butterfly 0 1 2 3 4 To go from processor i(xyz in binary) to processor j (uvw), start at i and at each stage k follow either the high link if the kth bit of the destination address is 0 or the low link if it is 1. For example the path to go from processor 4 (100) to processor 6 (110) is marked in bold lines. 5 6 7 Multiprocessors CSE 471 11

Indirect Interconnection Networks Nodes are at various distances of each other Characterized by their dimension Various routing mechanisms. For example higher dimension first 2D meshes and tori 3D cubes More dimensions: hypercubes Fat trees Multiprocessors CSE 471 12

Mesh Multiprocessors CSE 471 13

Message-passing Systems Processors communicate by messages Primitives are of the form send, receive The user (programmer) has to insert the messages Message passing libraries (MPI, OpenMP etc.) Communication can be: Synchronous: The sender must wait for an ack from the receiver (e.g, in RPC) Asynchronous: The sender does not wait for a reply to continue Multiprocessors CSE 471 14

Shared-memory vs. Message-passing An old debate that is not that much important any longer Many systems are built to support a mixture of both paradigms send, receive can be supported by O.S. in shared-memory systems load/store in virtual address space can be used in a messagepassing system (the message passing library can use small messages to that effect, e.g. passing a pointer to a memory area in another computer) Multiprocessors CSE 471 15

The Pros and Cons Shared-memory pros Ease of programming (SPMD: Single Program Multiple Data paradigm) Good for communication of small items Less overhead of O.S. Hardware-based cache coherence Message-passing pros Simpler hardware (more scalable) Explicit communication (both good and bad; some programming languages have primitives for that), easier for long messages Use of message passing libraries Multiprocessors CSE 471 16

Caveat about Parallel Processing Multiprocessors are used to: Speedup computations Solve larger problems Speedup Time to execute on 1 processor / Time to execute on N processors Speedup is limited by the communication/computation ratio and synchronization Efficiency Speedup / Number of processors Multiprocessors CSE 471 17

Amdahl s Law for Parallel Processing Recall Amdahl s law If x% of your program is sequential, speedup is bounded by 1/x At best linear speedup (if no sequential section) What about superlinear speedup? Theoretically impossible Occurs because adding a processor might mean adding more overall memory and caching (e.g., fewer page faults!) Have to be careful about the x% of sequentiality. Might become lower if the data set increases. Speedup and Efficiency should have the number of processors and the size of the input set as parameters Multiprocessors CSE 471 18

Chip MultiProcessors (CMPs) Multiprocessors vs. multicores Multiprocessors have private cache hierarchy (on chip) Multicores have shared L2 (on chop) How many processors Typically today 2 to 4 Tomorrow 8 to 16 Next decade??? Interconnection Today cross-bar Tomorrow??? Biggest problems Programming (parallel programming language) Applications that require parallelism Multiprocessors CSE 471 19

This document was created with Win2PDF available at http://www.win2pdf.com. The unregistered version of Win2PDF is for evaluation or non-commercial use only. This page will not be added after purchasing Win2PDF.