Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Similar documents
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Architectures

Computer parallelism Flynn s categories

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Parallel Architectures

Multiprocessors - Flynn s Taxonomy (1966)

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

4. Networks. in parallel computers. Advances in Computer Architecture

Chapter 9 Multiprocessors

Parallel Architectures

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Overview. Processor organizations Types of parallel machines. Real machines

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Parallel Architecture. Sathish Vadhiyar

Lecture 2 Parallel Programming Platforms

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Lecture 9: MIMD Architectures

Chap. 4 Multiprocessors and Thread-Level Parallelism

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Organisasi Sistem Komputer

Lecture 24: Virtual Memory, Multiprocessors

INTERCONNECTION NETWORKS LECTURE 4

Computer Organization. Chapter 16

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

COSC 6374 Parallel Computation. Parallel Computer Architectures

Multiprocessors & Thread Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Computer Systems Architecture

Issues in Multiprocessors

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

COSC 6374 Parallel Computation. Parallel Computer Architectures

Lecture 9: MIMD Architectures

Advanced Parallel Architecture. Annalisa Massini /2017

Physical Organization of Parallel Platforms. Alexandre David

Interconnection Networks. Issues for Networks

EE382 Processor Design. Illinois

Foster s Methodology: Application Examples

Lecture 2: Topology - I

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Computer Architecture

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 9: MIMD Architecture

Why Multiprocessors?

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Memory Systems in Pipelined Processors

Multiprocessor Cache Coherency. What is Cache Coherence?

Interconnection Network

DISTRIBUTED SHARED MEMORY

Computer Systems Architecture

CS Parallel Algorithms in Scientific Computing

1. Memory technology & Hierarchy

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

CSCI 4717 Computer Architecture

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Interconnection Networks

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Parallel and Distributed Computing

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Interconnection networks

SHARED MEMORY VS DISTRIBUTED MEMORY

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Scalability and Classifications

Lecture 7: Parallel Processing

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Chapter 11. Introduction to Multiprocessors

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Concepts of Distributed Systems 2006/2007

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Lecture 24: Memory, VM, Multiproc

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Parallel Computing Platforms

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Symmetric Memory Systems

Distributed Systems LEEC (2006/07 2º Sem.)

CSC630/CSC730: Parallel Computing

Chapter 18 Parallel Processing

Lecture 7: Parallel Processing

Interconnect Technology and Computational Speed

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Issues in Multiprocessors

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Transcription:

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 1 / 22

Outline Distributed memory systems Non-Uniform Memory Access (NUMA) architecture Memory coherence and consistency Multicomputers Message Passing Network topologies Latency reduction CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 2 / 22

Shared-Memory Systems also known as Uniform Memory Access (UMA) architecture Symmetric Shared-Memory Multiprocessors (SMP) P P P P Cache Cache Cache Cache Main Memory I / O Limited scalability! CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 3 / 22

Distributed-Memory Systems or Non-Uniform Memory Access (NUMA) architecture Multicomputers P P Cache Cache Main Memory I / O Main Memory I / O Interconnection Network Main Memory I / O Main Memory I / O Cache Cache P P Higher access time to remote data. CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 4 / 22

Distributed-Memory Systems Two programming models for distributed memory systems: Non-Uniform Memory Access (NUMA) architecture Memories are physically separated, but can be addressed as one logically shared addressed space (also known as Distributed Shared Memory, DSM). Multicomputers Each processor has its own private address space, not accessible by the other processors. Data sharing requires explicit messages among processors (also known as Message Passing Multiprocessors). CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 5 / 22

NUMA (DSM) all processors share the same address space a given memory address returns the same data for any processor data always accessed through load/store instructions, wherever it may reside access time is highly dependent on whether data is in local or remote memory (NUMA) CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 6 / 22

NUMA (DSM) Distributed shared memory can be implemented through a distributed virtual memory scheme: logic address space is divided into pages page table keeps state of each page (similar to a directory) not-present pages in remote memories shared page available in local memory, but more copies exist exclusive page only available in local memory access to one of these pages causes a page-fault page request sent to remote processor CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 7 / 22

NUMA Memory Coherence Memory coherence: a read of M[X] by processor P that follows a write by P to M[X], and with no write to M[X] by another processor in between, always returns the value written by P. a read of M[X] by P i that follows a write by of M[X] by P j always returns the value written by P j if the read and write are sufficiently separated in time and no other writes to M[X] occur between the two accesses. write to the same location are serialized; two writes to the same location by any two processors are seen in the same order by all processors. No single bus to serialize requests! Synchronous write: block processor until confirmation of write operation. CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 8 / 22

Multicomputers - Message Passing each processor has its own private resources and address space, not accessible by the other processors, hence acts like an independent computer data sharing requires explicit message requests through the interconnect network Asymmetrical multicomputer: one processor (front-end) is the access point to all processors (back-end) and manages the distribution of the workload Symmetrical multicomputer: every processor at the same level CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 9 / 22

NUMA vs Multicomputers Advantages of NUMA (DSM): programming is simpler no compiler libraries are necessary CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 10 / 22

NUMA vs Multicomputers Advantages of NUMA (DSM): programming is simpler no compiler libraries are necessary Advantages of Multicomputers (Message Passing): simpler hardware explicit communication possible to emulate DSM CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 10 / 22

Types of Distributed Memory Systems Network of Workstations (NOW): computers primary purpose is not parallel computing; use free CPU power; potential diversity of architectures and operating systems CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 11 / 22

Types of Distributed Memory Systems Network of Workstations (NOW): computers primary purpose is not parallel computing; use free CPU power; potential diversity of architectures and operating systems Cluster: co-located collection of COTS computers and switches dedicated to run parallel jobs CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 11 / 22

Types of Distributed Memory Systems Network of Workstations (NOW): computers primary purpose is not parallel computing; use free CPU power; potential diversity of architectures and operating systems Cluster: co-located collection of COTS computers and switches dedicated to run parallel jobs Grid: large collection of widely distributed clusters and NOWs CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 11 / 22

Types of Distributed Memory Systems Network of Workstations (NOW): computers primary purpose is not parallel computing; use free CPU power; potential diversity of architectures and operating systems Cluster: co-located collection of COTS computers and switches dedicated to run parallel jobs Grid: large collection of widely distributed clusters and NOWs Commercial: costume hardware / network to provide a good balance between processing and communication speed CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 11 / 22

Interconnection Networks P P P P Packet switching: messages are sent one at a time over a shared medium that connects to all processors. Typically each message in divided into packets. CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 12 / 22

Interconnection Networks P P P P Packet switching: messages are sent one at a time over a shared medium that connects to all processors. Typically each message in divided into packets. P P P P Switched Medium Circuit switching: support point-to-point connection between pairs of processors. CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 12 / 22

Circuit Switching Networks Metrics for circuit switching network topology evaluation: Bisection bandwidth: bandwidth available between halves of the network Diameter: largest number of switches in the path between any two nodes Cost: required hardware and wires Scalability: how the above parameters grow with the number of processors CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 13 / 22

Circuit Switching Networks Bus CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 14 / 22

Circuit Switching Networks Bus Ring CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 14 / 22

Circuit Switching Networks Bus Ring 2D Grid / Mesh CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 14 / 22

Circuit Switching Networks Bus Ring 2D Grid / Mesh 2D Torus CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 14 / 22

Circuit Switching Networks More sophisticated topologies: 4D Hypercube CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 15 / 22

Circuit Switching Networks More sophisticated topologies: 4D Hypercube Tree / Fat Tree CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 15 / 22

Circuit Switching Networks Butterfly CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 16 / 22

Circuit Switching Networks For a network with n processors: Type Switches Diameter Bisection Ports / switch 2D Mesh n 2( n 1) n 4 Tree 2n 1 2 log n 1 3 Hypercube n log n n/2 log n Butterfly n(log n + 1) log n n 4 one2one n 2 1 (n/2) 2 n CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 17 / 22

Latency Reduction Techniques Memory coherence rules are a guarantee for the correct behavior of parallel applications. Particularly relevant for: transactional systems distributed processing CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 18 / 22

Latency Reduction Techniques Memory coherence rules are a guarantee for the correct behavior of parallel applications. Particularly relevant for: transactional systems distributed processing When developing parallel applications, the programmer can in some cases disregard these rules to achieve better performance asynchronous message send / receive CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 18 / 22

Asynchronous Send Synchronous send / receive CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 19 / 22

Asynchronous Send Synchronous send / receive Asynchronous send (write): write to a temporary buffer CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 19 / 22

Asynchronous Receive Synchronous send / receive CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 20 / 22

Asynchronous Receive Synchronous send / receive Asynchronous receive (read): execute command before data is needed CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 20 / 22

Review Distributed memory systems Non-Uniform Memory Access (NUMA) architecture Memory coherence and consistency Multicomputers Message Passing Network topologies Latency reduction CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 21 / 22

Next Class Dependency Graphs Parallel Programming Methodology CPD (DEI / IST) Parallel and Distributed Computing 5 2016-02-29 22 / 22