COSC 6385 Computer Architecture - Multi Processor Systems

Similar documents
COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Multiprocessors & Thread Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction II. Overview

Computer Systems Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Processor Architecture and Interconnect

Computer parallelism Flynn s categories

Computer Architecture

BİL 542 Parallel Computing

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Introduction to Parallel Computing

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer Systems Architecture

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Top500 Supercomputer list

CSE 392/CS 378: High-performance Computing - Principles and Practice

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Introduction to parallel computing

Chapter 11. Introduction to Multiprocessors

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Parallel Architecture. Hwansoo Han

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Multiprocessors - Flynn s Taxonomy (1966)

Chapter 5. Thread-Level Parallelism

Issues in Multiprocessors

Parallel Computing Introduction

Flynn s Classification

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 9 Multiprocessors

Shared Symmetric Memory Systems

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

COSC4201 Multiprocessors

Introduction to Parallel Programming

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Aleksandar Milenkovich 1

Chapter 5: Thread-Level Parallelism Part 1

Comp. Org II, Spring

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Architectures

Parallel Processing & Multicore computers

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Lect. 2: Types of Parallelism

Comp. Org II, Spring

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Issues in Multiprocessors

Lecture 7: Parallel Processing

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

CS Parallel Algorithms in Scientific Computing

Parallel Computing Platforms

Computer Organization. Chapter 16

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Organisasi Sistem Komputer

Chapter-4 Multiprocessors and Thread-Level Parallelism

High Performance Computing Systems

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Multicores, Multiprocessors, and Clusters

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Dr. Joe Zhang PDC-3: Parallel Platforms

Multi-Processor / Parallel Processing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel Architectures

Intro to Multiprocessors

Overview. Processor organizations Types of parallel machines. Real machines

Parallel Numerics, WT 2013/ Introduction

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Multiprocessors 1. Outline

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel Computing Why & How?

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

CDA3101 Recitation Section 13

Chapter 18 Parallel Processing

Transcription:

COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006

Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD: Single instruction multiple data MISD: Multiple instructions single data Non existent, just listed for completeness MIMD: Multiple instructions multiple data Most common and general parallel machine

Single Instruction Multiple Data Also known as Array-processors A single instruction stream is broadcasted to multiple processors, each having its own data stream Still used in graphics cards today Instructions stream Data Data Data Data Control unit processor processor processor processor

Multiple Instructions Multiple Data (I) Each processor has its own instruction stream and input data Very general case every other scenario can be mapped to MIMD Further breakdown of MIMD usually based on the memory organization Shared memory systems Distributed memory systems

Shared memory systems (I) All processes have access to the same address space E.g. PC with more than one processor Data exchange between processes by writing/reading shared variables Shared memory systems are easy to program Current standard in scientific programming: OpenMP Two versions of shared memory systems available today Symmetric multiprocessors (SMP) Non-uniform memory access (NUMA) architectures

Symmetric multi-processors (SMPs) All processors share the same physical main memory bandwidth per processor is limiting factor for this type of architecture Typical size: 2-32 processors

NUMA architectures (I) Some memory is closer to a certain processor than other memory The whole memory is still addressable from all processors Depending on what data item a processor retrieves, the access time might vary strongly

NUMA architectures (II) Reduces the memory bottleneck compared to SMPs More difficult to program efficiently E.g. first touch policy: data item will be located in the memory of the processor which uses a data item first To reduce effects of non-uniform memory access, caches are often used ccnuma: cache-coherent non-uniform memory access architectures Largest example as of today: SGI Origin with 512 processors

Distributed memory machines (I) Each processor has its own address space Communication between processes by explicit data exchange Sockets Message passing Remote procedure call / remote method invocation Network interconnect

Distributed memory machines (II) Performance of a distributed memory machine strongly depends on the quality of the network interconnect and the topology of the network interconnect Of-the-shelf technology: e.g. fast-ethernet, gigabit- Ethernet Specialized interconnects: Myrinet, Infiniband, Quadrics, 10G Ethernet

Distributed memory machines (III) Two classes of distributed memory machines: Massively parallel processing systems (MPPs) Tightly coupled environment Single system image (specialized OS) Clusters Of-the-shelf hardware and software components such as Intel P4, AMD Opteron etc. Standard operating systems such as LINUX, Windows, BSD UNIX

MIMD (III) Important metrics: Latency: minimal time to send a very short message from one processor to another Unit: ms, μs Bandwidth: amount of data which can be transferred from one processor to another in a certain time frame Units: Bytes/sec, KB/s, MB/s, GB/s Bits/sec, Kb/s, Mb/s, Gb/s, baud

Hybrid systems E.g. clusters of multi-processor nodes Network interconnect

Grids Evolution of distributed memory machines and distributed computing Several (parallel) machines connected by wide-area links (typically the internet) Machines are in different administrative domains

Shared memory vs. distributed memory machines Shared memory machines: Compiler directives (e.g. Open MP) Threads (e.g. POSIX Threads) Distributed memory machines: Message Passing (e.g. MPI, PVM) Distributed Shared (e.g. UPC, CoArrayFortran) Remote Procedure Calls (RPC/RMI) Network interconnect Message passing widely used in (scientific) parallel programming price/performance ratio of message passing hardware very general concept

Performance Metrics (I) Speedup: how much faster does a problem run on p processors compared to 1 processor? Ttotal (1) S( p) = Ttotal ( p) Optimal: S(p) = p (linear speedup) Parallel Efficiency: Speedup normalized by the number of processors S( p) E( p) = p Optimal: E(p) = 1.0

Performance Metrics (II) Example: Application A takes 35 min. on a single processor, 27 on two processors and 18 on 4 processors. 35 1.29 S( 2) = = 1.29 E( 2) = = 0. 645 27 2 35 1.94 S( 4) = = 1.94 E( 4) = = 0. 485 18 4

Amdahl s Law (I) Most applications have a (small) sequential fraction, which limits the speedup T = T + T = ft + ( 1 f ) T total f: fraction of the code which can only be executed sequentially S( p) = ( sequential f Ttotal (1) 1 f + ) T p parallel total (1) = Total f 1 1 f + p Total

Example for Amdahl s Law 60.00 50.00 40.00 30.00 f=0 f=0.05 f=0.1 f=0.2 20.00 10.00 0.00 1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Amdahl s Law (II) Amdahl s Law assumes, that the problem size is constant In most applications, the sequential part is independent of the problem size, while the part which can be executed in parallel is not.

Performance Metrics (III) Scaleup: ratio of the execution time of a problem of size n on 1 processor to the execution time of the same problem of size n*p on p processors Ttotal (1, n) Sc ( p) = T ( p, n* p) total Optimally, execution time remains constant, e.g. T ( p, n) T (2 p,2n) total = total

Cache Coherence A more detailed look at shared memory systems leads to Caches between memory and Question: What happens if both caches hold a copy of the same data item and 1 modifies the data and 2 tries to read it? Don t want a to read old values Both s try to write the same item? Don t want both s to have a different view Requires write serialization Cache Cache

Cache Coherence Protocols Two basic schemes: Directory based: the status (shared/not shared) of a cache block is stored in a memory location Snooping: each cache block has a copy of the sharing status of the block. Caches are kept on a bus Caches snoop the bus in order to determine whether a copy of block has been modified Dominating in today s PC s

Cache Coherence Protocols (II) Two forms of implementing cache coherence using bus snooping Write invalidate: a write operation by a invalidates the cache blocks of the same data items Other has to re-fetch the data item Write update: a write operation by a broadcasts the value on the bus to all other caches Have to track which data items are hold by other caches in order to minimize data transfer

Write invalidate vs. Write update Multiple writes to the same word (without any read operation in-between) requires multiple broadcasts in the write update protocol, but only one invalidate operation in the write invalidate protocol Multi-word write requires one invalidate operation per cache block for the write-invalidate protocol, requires however the broadcast of every word in the write-update protocol A read after write by different processes might be more efficient by the write update protocol because of bus and memory bandwidth issues, write invalidate is the dominant protocol

Cache Coherence Protocols (III) A bus connects the different caches In order to perform an invalidate operation, processor needs to acquire bus access Automatically serializes multiple writes All writes are complete (e.g. it is not possible, that the first part of a multi-word write contains data from Proc. A and the second part from Proc. B if both write exactly the same memory locations) Extra bit on the cache tag indicates whether a cache block is shared A write operation changes the bit from shared to exclusive