Parallel Architecture. Sathish Vadhiyar

Similar documents
Physical Organization of Parallel Platforms. Alexandre David

Parallel Architectures

CS Parallel Algorithms in Scientific Computing

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CSC630/CSC730: Parallel Computing

Overview. Processor organizations Types of parallel machines. Real machines

Lecture 2 Parallel Programming Platforms

Chapter 9 Multiprocessors

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Advanced Parallel Architecture. Annalisa Massini /2017

Parallel Programming Platforms

Cache Coherency and Interconnection Networks

Parallel Architectures

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

CS575 Parallel Processing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Model Questions and Answers on

Types of Parallel Computers

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Multiprocessors & Thread Level Parallelism

Issues in Multiprocessors

Scalability and Classifications

4. Networks. in parallel computers. Advances in Computer Architecture

Parallel Numerics, WT 2013/ Introduction

Issues in Multiprocessors

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

COSC 6374 Parallel Computation. Parallel Computer Architectures

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Computer parallelism Flynn s categories

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

COSC 6374 Parallel Computation. Parallel Computer Architectures

Computer Systems Architecture

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Flynn s Classification

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Handout 3 Multiprocessor and thread level parallelism

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Interconnection Networks. Issues for Networks

SHARED MEMORY VS DISTRIBUTED MEMORY

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

EE382 Processor Design. Illinois

Computer Systems Architecture

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Chapter 2: Parallel Programming Platforms

Dr. Joe Zhang PDC-3: Parallel Platforms

Computer Architecture

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Parallel Architecture, Software And Performance

Interconnection Network

Interconnection Network

PARALLEL COMPUTER ARCHITECTURES

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Architectures

Chapter 11. Introduction to Multiprocessors

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

COSC 6385 Computer Architecture - Multi Processor Systems

Multi-Processor / Parallel Processing

CS/COE1541: Intro. to Computer Architecture

EE/CSCI 451: Parallel and Distributed Computation

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

BlueGene/L (No. 4 in the Latest Top500 List)

Processor Architecture and Interconnect

Interconnection networks

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 7: Parallel Processing

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Intro to Multiprocessors

Parallel Computing Platforms

Aleksandar Milenkovich 1

Multiprocessors - Flynn s Taxonomy (1966)

Directory Implementation. A High-end MP

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

Taxonomy of Parallel Computers, Models for Parallel Computers. Levels of Parallelism

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC4201 Multiprocessors

PARALLEL MEMORY ARCHITECTURE

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

Lecture 7: Parallel Processing

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Chapter 5. Thread-Level Parallelism

Transcription:

Parallel Architecture Sathish Vadhiyar

Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate parallelism Parallelism more natural for certain kinds of problems, e.g., climate modelling Due to computer architecture trends CPU speeds have saturated Slow memory bandwidths 2

Classification of Architectures Flynn s classification In terms of parallelism in instruction and data stream Single Instruction Single Data (SISD): Serial Computers Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 3

Classification of Architectures Flynn s classification Multiple Instruction Single Data (MISD): Not popular Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 4

Classification of Architectures Based on Memory Shared memory 2 types UMA and NUMA NUMA UMA Examples: HP- Exemplar, SGI Origin, Sequent NUMA-Q Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 5

Classification 2: Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space Communication can be done through this shared memory PM PM PM PM PM PM PM Interconnect P P P P P P P Interconnect Main Memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine 6

Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access 7

SHARED MEMORY AND CACHES 8

Shared Memory Architecture: Caches P1 Write Read X=1 X X: X: 10 P2 Read X Cache hit: Wrong data!! X: 0 X: 01 9

Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn t read `stale data Solutions Hardware: cache coherence mechanisms 10

Cache Coherence Protocols Write update propagate cache line to other processors on every write to a processor Write invalidate each processor gets the updated cache line whenever it reads stale data Which is better? 11

Invalidation Based Cache Coherence P1 Write Read X=1 X X: X: 10 P2 Read X X: 1 X: 0 Invalidate X: 0 X: 1 12

Cache Coherence using invalidate protocols 3 states associated with data items Shared a variable shared by 2 caches Invalid another processor (say P0) has updated the data item Dirty state of the data item in P0 13

Implementations of cache coherence protocols Snoopy for bus based architectures shared bus interconnect where all cache controllers monitor all bus activity There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Memory operations are propagated over the bus and snooped 14

Implementations of cache coherence protocols Directory-based Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors A central directory maintains states of cache blocks, associated processors 15

Implementation of Directory Based Protocols Using presence bits for the owner processors Two schemes: Full bit vector scheme O(MxP) storage for P processors and M cache lines But not necessary Modern day processors use sparse or tagged directory scheme Limited cache lines and limited presence bits 16

False Sharing Cache coherence occurs at the granularity of cache lines an entire cache line is invalidated Modern day cache lines are 64 bytes in size Consider a Fortran program dealing with a matrix Assume each thread or process accessing a row of a matrix Leads to false sharing 17

False sharing: Solutions Reorganize the code so that each processor access a set of rows Can still lead to overlapping of cache lines if matrix size not divisible by processors In such cases, employ padding Padding: dummy elements added to make the matrix size divisible 18

INTERCONNECTION NETWORKS 19

Interconnects Used in both shared memory and distributed memory architectures In shared memory: Used to connect processors to memory In distributed memory: Used to connect different processors Components Interface (PCI or PCI-e): for connecting processor to network link Network link connected to a communication network (network of connections) 20

Communication network Consists of switching elements to which processors are connected through ports Switch: network of switching elements Switching elements connected with each other using a pattern of connections Pattern defines the network topology In shared memory systems, memory units are also connected to communication network 21

Parallel Architecture: Interconnections Routing techniques: how the route taken by the message from source to destination is decided Network topologies Static point-to-point communication links among processing nodes Dynamic Communication links are formed dynamically by switches 22

Network Topologies Static Bus Completely connected Star Linear array, Ring (1-D torus) Mesh k-d mesh: d dimensions with k nodes in each dimension Hypercubes 2-logp mesh Trees our campus network Dynamic Communication links are formed dynamically by switches Crossbar Multistage For more details, and evaluation of topologies, refer to book by Grama et al. 23

Network Topologies Bus, ring used in smallscale shared memory systems Crossbar switch used in some small-scale shared memory machines, small or medium-scale distributed memory machines 24

Crossbar Switch Consists of 2D grid of switching elements Each switching element consists of 2 input ports and 2 output ports An input port dynamically connected to an output port through a switching logic 25

Multistage network Omega network To reduce switching complexity Omega network consisting of logp stages, each consisting of P/2 switching elements Contention In crossbar nonblocking In Omega can occur during multiple communications to disjoint pairs 26

Mesh, Torus, Hypercubes, Fat-tree Commonly used network topologies in distributed memory architectures Hypercubes are networks with dimensions 27

Mesh, Torus, Hypercubes 2D Mesh Hypercube(binary n-cube) n=2 n=3 Torus 28

Fat Tree Networks Binary tree Processors arranged in leaves Other nodes correspond to switches Fundamental property: No. of links from a node to a children = no. of links from the node to its parent Edges become fatter as we traverse up the tree 29

Fat Tree Networks Any pairs of processors can communicate without contention: non-blocking network Constant Bisection Bandwidth (CBB) networks Two level fat tree has a diameter of four 30

Evaluating Interconnection topologies Diameter maximum distance between any two processing nodes Full-connected 1 Star 2 Ring p/2 Hypercube - logp Connectivity multiplicity of paths between 2 nodes. Miniimum number of arcs to be removed from network to break it into two disconnected networks Linear-array 1 Ring 2 2-d mesh 2 2-d mesh with wraparound 4 D-dimension hypercubes d 31

Evaluating Interconnection topologies bisection width minimum number of links to be removed from network to partition it into 2 equal halves Ring 2 Root(P) P-node 2-D mesh - Tree 1 Star 1 Completely connected Hypercubes - P/2 P 2 /4 32

Evaluating Interconnection topologies channel width number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes channel rate performance of a single physical wire channel bandwidth channel rate times channel width bisection bandwidth maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth 33