Multiprocessors. HPC Prof. Robert van Engelen

Similar documents
CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Computer Graphics Hardware An Overview

Instruction and Data Streams

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

Overview. Some Definitions. Some definitions. High Performance Computing Programming Paradigms and Scalability Part 2: High-Performance Networks

Computer Architecture ELEC3441

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

UNIVERSITY OF MORATUWA

Operating System Concepts. Operating System Concepts

Computer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff

Elementary Educational Computer

Chapter 4 The Datapath

Switch Construction CS

Uniprocessors. HPC Prof. Robert van Engelen

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

Chapter 3. Floating Point Arithmetic

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

Introduction to OSPF. ISP Training Workshops

The Penta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems

Computer Architecture

Appendix D. Controller Implementation

CS2410 Computer Architecture. Flynn s Taxonomy

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Lecture 28: Data Link Layer

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

Transitioning to BGP

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Minimum Spanning Trees

Isn t It Time You Got Faster, Quicker?

Random Graphs and Complex Networks T

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

Minimum Spanning Trees. Application: Connecting a Network

K-NET bus. When several turrets are connected to the K-Bus, the structure of the system is as showns

Lecture 1: Introduction and Fundamental Concepts 1

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

1. SWITCHING FUNDAMENTALS

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Fast Interpolation of Grid Data at a Non-Grid Point

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018

Outline. CSCI 4730 Operating Systems. Questions. What is an Operating System? Computer System Layers. Computer System Layers

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

SCI Reflective Memory

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

THE WAY OF CALCULATING THE TRAFFIC AND SIGNALING NETWORK DIMENSION OF COMMON CHANNEL SIGNALING NO.7 (CCS7)

Ones Assignment Method for Solving Traveling Salesman Problem

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

1&1 Next Level Hosting

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Computer Science Foundation Exam. August 12, Computer Science. Section 1A. No Calculators! KEY. Solutions and Grading Criteria.

Outline. Applications of FFT in Communications. Fundamental FFT Algorithms. FFT Circuit Design Architectures. Conclusions

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

SERIAL COMMUNICATION INTERFACE FOR ESA ESTRO

How do we evaluate algorithms?

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

The Counterchanged Crossed Cube Interconnection Network and Its Topology Properties

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

On (K t e)-saturated Graphs

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Chapter 3 Classification of FFT Processor Algorithms

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

Lecture 2: Spectra of Graphs

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1

Optimal Mapped Mesh on the Circle

Exercise 6 (Week 42) For the foreign students only.

CSE 305. Computer Architecture

DCMIX: Generating Mixed Workloads for the Cloud Data Center

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Examples and Applications of Binary Search

On Nonblocking Folded-Clos Networks in Computer Communication Environments

GPUMP: a Multiple-Precision Integer Library for GPUs

The following algorithms have been tested as a method of converting an I.F. from 16 to 512 MHz to 31 real 16 MHz USB channels:

Properties and Embeddings of Interconnection Networks Based on the Hexcube

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

Minimum Spanning Trees

Lower Bounds for Sorting

Media Access Protocols. Spring 2018 CS 438 Staff, University of Illinois 1

Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

A Study on the Performance of Cholesky-Factorization using MPI

WYSE Academic Challenge Sectional Computer Science 2005 SOLUTION SET

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Lecture 1: Introduction and Strassen s Algorithm

IS-IS in Detail. ISP Workshops

Computer Systems - HS

Transcription:

Multiprocessors Prof. Robert va Egele

Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies for message-passig multicomputers Distributed shared memory Pipelie ad vector processors Compariso Taxoomies

PMS Architecture Model A simple PMS model Processor (P) A device that performs operatios o data Memory (M) A device that stores data Switch (S) A device that facilitates trasfer of data betwee devices Arcs deote coectivity A example computer system with CPU ad peripherals Each compoet has differet performace characteristics

Shared Memory Multiprocessor Processors access shared memory via a commo switch, e.g. a bus Problem: a sigle bus results i a bottleeck Shared memory has a sigle address space Architecture sometimes referred to as a dace hall

Shared Memory: the Bus Cotetio Problem Each processor competes for access to shared memory Fetchig istructios Loadig ad storig data Performace of a sigle bus S: bus cotetio Access to memory is restricted to oe processor at a time This limits the speedup ad scalability with respect to the umber of processors Assume that each istructio requires 0<m<1 memory operatios (the average fractio of loads or stores per istructio), F istructios are performed per uit of time, ad a maximum of W words ca be moved over the bus per uit of time, the S P < W / (m F) regardless of the umber of processors P I other words, the parallel efficiecy is limited uless P < W / (m F)

Shared Memory: Work-Memory Ratio for (i=0; i<1000; i++) x = x + i; 2 distict memory locatios ad oe float add: FP:M = 500 for (i=0; i<n; i++) x = x + a[i]*b[i]; 2N+1 distict memory locatios ad 2N FP operatios: FP:M = 1 whe N is large Work-memory ratio (FP:M ratio): ratio of the umber of floatig poit operatios to the umber of distict memory locatios refereced i the iermost loop: Same locatio is couted just oce i iermost loop Assumes effective use of registers (ad cache) i iermost loop Assumes o reuse across outer loops (registers/cache use saturated i ier loop) Note that FP:M = m -1 1 so efficiet utilizatio of shared memory multiprocessors requires P < (FP:M+1) W / F

Shared Memory Multiprocessor with Local Cache Add local cache the improve performace whe W / F is small With today s systems we have W / F << 1 Problem: how to esure cache coherece?

Shared Memory: Cache Coherece Thread 1 modifies shared data A cache coherece protocol esures that processors obtai ewly altered data whe shared data is modified by aother processor Because caches operate o cache lies, more data tha the shared object aloe ca be effected, which may lead to false sharig Thread 0 reads modified shared data Cache coherece protocol esures that thread 0 obtais the ewly altered data

COMA Cache-oly memory architecture (COMA) Large cache per processor to replace shared memory A data item is either i oe cache (o-shared) or i multiple caches (shared) Switch icludes a egie that provides a sigle global address space ad esures cache coherece

Distributed Memory Multicomputer Massively parallel processor (MPP) systems with P > 1000 Commuicatio via message passig Nouiform memory access (NUMA) Network topologies Mesh Hypercube Cross-bar switch

Distributed Shared Memory Distributed shared memory (DSM) systems use physically distributed memory modules ad a global address space that gives the illusio of shared virtual memory that is usually NUMA Hardware is used to automatically traslate a memory address ito a local address or a remote memory address (via message passig) Software approaches add a programmig layer to simplify access to shared objects (hidig the message passig commuicatios)

Computatio-Commuicatio P=8 P=4 t comp / t comm = 1000 / 10 2 Ratio The computatiocommuicatio ratio: t comp / t comm Usually assessed aalytically ad verified empirically High commuicatio overhead decreases speedup, so ratio should be as high as possible For example: data size, umber of processors P, ad ratio t comp / t comm = 1000 / 10 2 P=2 P=1 S P = t s / t P = 1000 / (1000 / P + 10 2 )

Mesh Topology Network of P odes has mesh size ÖP ÖP Diameter 2 (ÖP -1) Torus etwork wraps the eds Diameter ÖP -1

Hypercube Topology d-dimesioal hypercube has P = 2 d odes Diameter is d = log 2 P d=2 d=3 Node addressig is simple Node umber of earest eighbor ode differs i oe bit Routig algorithm flips bits to determie possible paths, e.g. from ode 001 to 111 has two shortest paths 001 011 111 d=4 001 101 111

Cross-bar Switches Processors ad memories are coected by a set of switches Eables simultaeous (cotetio free) commuicatio betwee processor i ad s(i), where s is a arbitrary permutatio of 1 P Cross-bar switch s(1)=2 s(2)=1 s(3)=3

Multistage Itercoect Network 4 4 two-stage itercoect 8 8 three-stage itercoect Each switch has a upper output (0) ad a lower output (1) A message travels through switch based o destiatio address Each bit i destiatio address is used to cotrol a switch from start to destiatio For example, from 001 to 100 First switch selects lower output (1) Secod switch selects upper output (0) Third switch selects upper output (0) Cotetio ca occur whe two messages are routed through the same switch

Pipelie ad Vector Processors DO i = 0,9999 z(i) = x(i) + y(i) DO j = 0,18,512 DO i = 0,511 z(j+i) = x(j+i) + y(j+i) ENDDO ENDDO DO i = 0,271 z(9728+i) = x(9728+i) + y(9728+i) ENDDO DO j = 0,18,512 z(j:j+511) = x(j:j+511) + y(j:j+511) ENDDO z(9728:9999) = x(9728:9999) + y(9728:9999) Vector processors ru operatios o multiple data elemets simultaeously Vector processor has a maximum vector legth, e.g. 512 Strip miig the loop results i a outer loop with stride 512 to eable vectorizatio of loger vector operatios Pipelied vector architectures dispatch multiple vector operatios per clock cycle Vector chaiig allows the result of a previous vector operatio to be directly fed ito the ext operatio i the pipelie

Compariso: Badwidth, Latecy ad Capacity

Further Readig [PP2] pages 13-26 [SPC] pages 71-95 [] pages 25-28