Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Similar documents
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Architectures

Computer parallelism Flynn s categories

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Parallel Architectures

Multiprocessors - Flynn s Taxonomy (1966)

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

4. Networks. in parallel computers. Advances in Computer Architecture

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Parallel Architectures

Chapter 9 Multiprocessors

Overview. Processor organizations Types of parallel machines. Real machines

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Parallel Architecture. Sathish Vadhiyar

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Lecture 2 Parallel Programming Platforms

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Lecture 9: MIMD Architectures

Chap. 4 Multiprocessors and Thread-Level Parallelism

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Lecture 24: Virtual Memory, Multiprocessors

Organisasi Sistem Komputer

INTERCONNECTION NETWORKS LECTURE 4

Computer Organization. Chapter 16

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Computer Systems Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Multiprocessors & Thread Level Parallelism

Issues in Multiprocessors

COSC 6374 Parallel Computation. Parallel Computer Architectures

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 9: MIMD Architectures

Advanced Parallel Architecture. Annalisa Massini /2017

COSC 6374 Parallel Computation. Parallel Computer Architectures

Physical Organization of Parallel Platforms. Alexandre David

EE382 Processor Design. Illinois

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Interconnection Networks. Issues for Networks

Computer Architecture

Foster s Methodology: Application Examples

Lecture 2: Topology - I

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 9: MIMD Architecture

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Why Multiprocessors?

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Memory Systems in Pipelined Processors

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Multiprocessor Cache Coherency. What is Cache Coherence?

Interconnection Network

Computer Systems Architecture

DISTRIBUTED SHARED MEMORY

1. Memory technology & Hierarchy

CS Parallel Algorithms in Scientific Computing

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

CSCI 4717 Computer Architecture

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Parallel and Distributed Computing

Interconnection Networks

SHARED MEMORY VS DISTRIBUTED MEMORY

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

COSC 6385 Computer Architecture - Multi Processor Systems

Interconnection networks

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Scalability and Classifications

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Lecture 7: Parallel Processing

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Chapter 11. Introduction to Multiprocessors

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Lecture 24: Memory, VM, Multiproc

Concepts of Distributed Systems 2006/2007

Parallel Computing Platforms

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Shared Symmetric Memory Systems

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Chapter 18 Parallel Processing

Distributed Systems LEEC (2006/07 2º Sem.)

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Lecture 7: Parallel Processing

Issues in Multiprocessors

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CSC630/CSC730: Parallel Computing

Interconnect Technology and Computational Speed

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Transcription:

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 1 / 22

Outline Distributed memory systems Non-Uniform Memory Access (UMA) architecture Multicomputers Memory coherence and consistency Network topologies Latency reduction CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 2 / 22

Shared-Memory Systems also known as Uniform Memory Access (UMA) architecture Symmetric Shared-Memory Multiprocessors (SMP) P P P P Cache Cache Cache Cache Main Memory I / O Limited scalability! CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 3 / 22

Distributed-Memory Systems or Non-Uniform Memory Access (NUMA) architecture Multicomputers P P Cache Cache Main Memory I / O Main Memory I / O Interconnection Network Main Memory I / O Main Memory I / O Cache Cache P P Larger access time to remote data. CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 4 / 22

Distributed-Memory Systems Two programming models for distributed memory systems: Non-Uniform Memory Access (NUMA) architecture Memories are physically separated, but can be addressed as one logically shared addressed space (also known as Distributed Shared Memory, DSM). Multicomputers Each processor has its own private address space, not accessible by the other processors. Data sharing requires explicit messages among processors (also known as Message Passing Multiprocessors, MPM). CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 5 / 22

NUMA (DSM) all processors share the same address space a given memory address returns the same data for any processor data always accessed through load/store instructions, wherever it may reside access time is highly dependent on whether data is in local or remote memory (NUMA) CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 6 / 22

NUMA (DSM) Distributed shared memory can be implemented through a distributed virtual memory scheme: logic address space is divided into pages page table keeps state of each page (similar to a directory) not-present pages in remote memories shared page available in local memory, but more copies exist exclusive page only available in local memory access to one of these pages causes a page-fault page request sent to remote processor CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 7 / 22

NUMA Memory Coherence Memory coherence: a read of M[X] by processor P that follows a write by P to M[X], and with no write to M[X] by another processor in between, always returns the value written by P. a read of M[X] by P i that follows a write by of M[X] by P j always returns the value written by P j if the read and write are sufficiently separated in time and no other writes to M[X] occur between the two accesses. write to the same location are serialized; two writes to the same location by any two processors are seen in the same order by all processors. No single bus to serialize requests! Synchronous write: block processor until confirmation of write operation. CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 8 / 22

Multicomputers each processor has its own private resources and address space, not accessible by the other processors, hence acts like an independent computer data sharing requires explicit message requests through the interconnect network Asymmetrical multicomputer: one processor (front-end) is the access point to all processors (back-end) and manages the distribution of the workload Symmetrical multicomputer: every processor at the same level CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 9 / 22

NUMA vs Multicomputers Advantages of NUMA (DSM): programming is simpler no compiler libraries are necessary CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 10 / 22

NUMA vs Multicomputers Advantages of NUMA (DSM): programming is simpler no compiler libraries are necessary Advantages of Multicomputers: simpler hardware explicit communication possible to emulate DSM CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 10 / 22

Types of Distributed Memory Systems Network of Workstations (NOW): computers primary purpose is not parallel computing; use free CPU power; potential diversity of architectures and operating systems CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 11 / 22

Types of Distributed Memory Systems Network of Workstations (NOW): computers primary purpose is not parallel computing; use free CPU power; potential diversity of architectures and operating systems Cluster: co-located collection of COTS computers and switches dedicated to run parallel jobs CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 11 / 22

Types of Distributed Memory Systems Network of Workstations (NOW): computers primary purpose is not parallel computing; use free CPU power; potential diversity of architectures and operating systems Cluster: co-located collection of COTS computers and switches dedicated to run parallel jobs Grid: large collection of widely distributed clusters and NOWs CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 11 / 22

Types of Distributed Memory Systems Network of Workstations (NOW): computers primary purpose is not parallel computing; use free CPU power; potential diversity of architectures and operating systems Cluster: co-located collection of COTS computers and switches dedicated to run parallel jobs Grid: large collection of widely distributed clusters and NOWs Commercial: costume hardware / network to provide a good balance between processing and communication speed CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 11 / 22

Interconnection Networks P P P P Packet switching: messages are sent one at a time over a shared medium that connects to all processors. Typically each message in divided into packets. CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 12 / 22

Interconnection Networks P P P P Packet switching: messages are sent one at a time over a shared medium that connects to all processors. Typically each message in divided into packets. P P P P Switched Medium Circuit switching: support point-to-point connection between all pairs of processors. CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 12 / 22

Circuit Switching Networks Metrics for circuit switching network topology evaluation: Bisection bandwidth: bandwidth available between halves of the network Diameter: largest number of switches in the path between any two nodes Cost: required hardware and wires Scalability: how the above parameters grow with the number of processors CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 13 / 22

Circuit Switching Networks Bus Ring 2D Grid / Mesh 2D Torus CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 14 / 22

Circuit Switching Networks More sophisticated topologies: 4D Hypercube Tree / Fat Tree CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 15 / 22

Circuit Switching Networks Butterfly CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 16 / 22

Circuit Switching Networks For a network with n processors: Type Switches Diameter Bisection Ports / switch 2D Mesh n 2( n 1) n 4 Tree 2n 1 2 log n 1 3 Hypercube n log n n/2 log n Butterfly n(log n + 1) log n n 4 one2one n 2 1 2 n/2 n CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 17 / 22

Latency Reduction Techniques Memory coherence rules are a guarantee for the correct behavior of parallel applications. Particularly relevant for: transactional systems distributed processing CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 18 / 22

Latency Reduction Techniques Memory coherence rules are a guarantee for the correct behavior of parallel applications. Particularly relevant for: transactional systems distributed processing When developing parallel applications, the programmer can in some cases disregard these rules to achieve better performance asynchronous message send / receive CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 18 / 22

Asynchronous Send Synchronous send / receive Asynchronous send (write): write to a temporary buffer CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 19 / 22

Asynchronous Receive Synchronous send / receive Asynchronous receive (read): execute command before data is needed CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 20 / 22

Review Synchronization Distributed memory systems Non-Uniform Memory Access (UMA) architecture Multicomputers Memory coherence and consistency Network topologies Latency reduction CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 21 / 22

Next Class Dependency Graphs Parallel Programming Methodology CPD (DEI / IST) Parallel and Distributed Computing 5 2011-09-26 22 / 22